NikitolProject/android-encrypted-archiver

Files

NikitolProject 041a00913b docs(01-format-specification): research phase domain

2026-02-24 23:05:52 +03:00

26 KiB

Raw Blame History

Phase 1: Format Specification - Research

Researched: 2026-02-24 Domain: Binary format design, cryptographic data structures, cross-platform compatibility Confidence: HIGH

Summary

Phase 1 produces the sole deliverable: a complete binary format specification document that three independent implementations (Rust CLI, Kotlin Android decoder, busybox shell script) will build against. The spec must define every byte offset, field size, and encoding so that implementers have zero ambiguity. This is a documentation phase with no code.

The critical design tensions are: (1) the format must be simple enough for a shell script using dd/xxd/openssl to parse, yet structured enough to hold per-file encryption metadata; (2) the format must accommodate Phase 6 obfuscation features (XOR headers, encrypted TOC, decoy padding) even though they are not implemented until later; (3) all three decoders share a single hardcoded 32-byte key, and the spec must resolve the open question of whether HMAC uses the same key or a derived subkey.

Primary recommendation: Write the spec as a single Markdown document with ASCII diagrams showing byte layouts, a field-reference table for every structure (header, file table entry, data block), and a complete worked example showing the hex dump of a 2-file archive. Address obfuscation as reserved fields and documented future behavior (version flag controls whether obfuscation is active).

<phase_requirements>

Phase Requirements

ID	Description	Research Support
FMT-05	Specification of the format as a document (before any implementation begins)	All findings below directly support creating this spec: field layouts, crypto parameter placement, worked example patterns, shell compatibility constraints, and obfuscation placeholders
</phase_requirements>

Standard Stack

This phase produces a document, not code. There are no library dependencies. The "stack" here is the set of standards and reference materials the spec must conform to.

Core Standards

Standard	Reference	Purpose	Why It Governs the Spec
AES-256-CBC	NIST SP 800-38A	Block cipher mode	Defines 16-byte block size, IV requirements, PKCS7 padding behavior
HMAC-SHA-256	RFC 2104, FIPS 198-1	Message authentication	Defines 32-byte output, key requirements
PKCS7 padding	RFC 5652 Section 6.3	Block alignment	Encrypted size = ceil((input_len + 1) / 16) * 16; always adds at least 1 byte
SHA-256	FIPS 180-4	File integrity checksum	32-byte digest for post-decompression verification
DEFLATE/gzip	RFC 1952	Compression	Standard gzip stream, decompressed by GZIPInputStream (Kotlin) and gunzip (shell)
Encrypt-then-MAC	Bellare & Namprempre 2000; IETF draft-mcgrew-aead-aes-cbc-hmac-sha2	Authenticated encryption construction	HMAC computed over IV + ciphertext, verified before decryption

Key Size Constants

Parameter	Size	Notes
AES key	32 bytes (256 bits)	Hardcoded, shared across all decoders
AES block size	16 bytes	Governs IV size and PKCS7 padding
IV	16 bytes	Random per file, stored in cleartext
HMAC-SHA-256 output	32 bytes	Appended after or stored alongside ciphertext
SHA-256 checksum	32 bytes	Stored in file table, verified after decompression

Alternatives Considered

Instead of	Could Use	Tradeoff
AES-256-CBC + HMAC	AES-256-GCM (AEAD)	GCM is simpler (single operation) but `openssl enc` in busybox does NOT support GCM mode; CBC + HMAC is the only option compatible with all three decoders
HMAC-SHA-256	HMAC-SHA-512 truncated	IETF AEAD spec uses SHA-512 for AES-256, but SHA-256 is simpler, sufficient for integrity, and natively available in all three target environments
Little-endian	Big-endian (network order)	Big-endian is traditional for network protocols, but Rust, x86, ARM (Android) are all little-endian natively; shell `od -t u4` reads host-endian which is LE on ARM
PKCS7	Zero-padding	PKCS7 is standard for AES-CBC, directly supported by `openssl enc` and javax.crypto PKCS5Padding (PKCS5 = PKCS7 for 16-byte blocks)

Architecture Patterns

Pattern 1: Field Definition Table

What: Every binary structure is specified as a table with offset, size, type, endianness, and description. When to use: For every fixed-layout structure in the format (header, file table entries).

Example:

### Archive Header (40 bytes)

| Offset | Size | Type    | Endian | Field            | Description                          |
|--------|------|---------|--------|------------------|--------------------------------------|
| 0x00   | 4    | bytes   | -      | magic            | Custom magic bytes: 0xCA 0xFE 0xAR 0xCH (example) |
| 0x04   | 1    | u8      | -      | version          | Format version (1 for v1)            |
| 0x05   | 1    | u8      | -      | flags            | Bit 0: compression, Bit 1: obfuscation |
| 0x06   | 2    | u16     | LE     | file_count       | Number of files in archive           |
| 0x08   | 4    | u32     | LE     | toc_offset       | Offset to file table from file start |
| 0x0C   | 4    | u32     | LE     | toc_size         | Size of file table in bytes          |
| 0x10   | 16   | bytes   | -      | toc_iv           | IV for encrypted file table (Phase 6)|
| 0x20   | 8    | bytes   | -      | reserved         | Reserved for future use (zero-filled)|

Pattern 2: Encrypt-then-MAC Construction (per file)

What: The exact order of operations and data layout for each file's encrypted block. When to use: Defines how every file is stored in the archive.

Pipeline per file:
  1. Read original file -> compute SHA-256 checksum -> store in file table
  2. Compress with gzip (if compression flag set) -> compressed_data
  3. Pad compressed_data with PKCS7 to AES block boundary
  4. Generate random 16-byte IV
  5. Encrypt padded data with AES-256-CBC using IV -> ciphertext
  6. Compute HMAC-SHA-256 over (IV || ciphertext) -> mac
  7. Store: IV (16) || ciphertext (variable) || HMAC (32)

Data block layout:
  [IV: 16 bytes][ciphertext: N bytes][HMAC: 32 bytes]
  Where N = ceil((compressed_size + pkcs7_pad) / 16) * 16

Pattern 3: Version-Gated Features

What: Use the version byte and flags field to control which features are active, allowing the same format to work with and without obfuscation. When to use: Phase 6 obfuscation features are defined in the spec now but activated by flag bits.

Flags byte (offset 0x05):
  Bit 0: Per-file compression enabled (0 = raw, 1 = gzip)
  Bit 1: TOC encryption enabled (0 = plaintext TOC, 1 = AES-encrypted TOC)
  Bit 2: XOR header obfuscation (0 = off, 1 = on)
  Bit 3: Decoy padding between blocks (0 = off, 1 = on)
  Bits 4-7: Reserved (must be 0)

Decoders MUST check flags and skip unsupported features gracefully.

Pattern 4: Worked Example with Hex Dump

What: A concrete archive with known inputs showing every byte. When to use: Mandatory per success criteria -- the spec must include at least one complete worked example.

Example archive: 2 files
  File 1: "hello.txt" (5 bytes: "Hello")
  File 2: "test.apk" (simulated 32 bytes)

Key: 0x00112233...EEFF (32 bytes, shown in full)
IV for file 1: 0xAABBCCDD... (16 bytes)
IV for file 2: 0x11223344... (16 bytes)

Complete hex dump:
  0000: CA FE xx xx 01 01 02 00  ...  <- header (magic, version, flags, count)
  ...
  [every byte annotated with field name]

Anti-Patterns to Avoid

Variable-length header without explicit size field: Shell decoders need to know exactly where to dd skip= to. Every variable-length region must have its size recorded in a preceding fixed-offset field.
Implicit padding assumptions: Never assume "the decoder will figure out padding." Explicitly state PKCS7 rules and encrypted size formula in the spec.
Mixing concerns in field table: Don't combine "offset within archive" with "offset within data block." Use absolute offsets from archive start everywhere.
Underspecifying endianness: Every multi-byte integer must state "LE" (little-endian) explicitly. The shell decoder reads bytes with dd and must know byte order.
Ambiguous HMAC scope: The spec must state EXACTLY which bytes are fed to HMAC. "HMAC of the ciphertext" is ambiguous (does it include IV? padding? length?). State: "HMAC-SHA-256(key, IV || ciphertext)" with byte ranges.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Authenticated encryption	Custom MAC scheme	Standard encrypt-then-MAC (HMAC-SHA-256 over IV+ciphertext)	Subtle errors (MAC-then-encrypt, encrypt-and-MAC) lead to padding oracle attacks
Key for HMAC vs encryption	Ad-hoc key splitting	Either: (a) use same 32-byte key for both (acceptable per v1 scope), or (b) HKDF with distinct labels	IETF AEAD spec splits key; but for hardcoded key with no key reuse across protocols, same key is cryptographically safe for AES-CBC + HMAC-SHA-256 specifically
Block padding	Manual zero-padding	PKCS7 (built into openssl enc, javax.crypto)	Zero-padding is ambiguous for binary files; PKCS7 is unambiguous and universally supported
Compression framing	Custom compression headers	Standard gzip stream (RFC 1952)	GZIPInputStream and gunzip handle framing automatically

Key insight: The format spec should use standard cryptographic constructions (encrypt-then-MAC, PKCS7, gzip) composed together, rather than inventing novel schemes. The "custom" part is the container format (header, TOC, block layout), not the cryptographic primitives inside it.

Common Pitfalls

Pitfall 1: Shell Decoder Byte Extraction Fragility

What goes wrong: The shell decoder uses dd bs=1 skip=N count=M to extract fields. If any offset or size in the spec is wrong by even 1 byte, the entire decode chain fails silently (produces garbage, not an error). Why it happens: Off-by-one errors in offset calculations, or forgetting that PKCS7 adds a full block when input is already block-aligned. How to avoid: The spec's worked example must include a step-by-step shell decode walkthrough: "To extract file 1 IV: dd if=archive.bin bs=1 skip=48 count=16". Test the worked example's offsets manually. Warning signs: The worked example's offsets don't add up when you manually sum field sizes.

Pitfall 2: HMAC Input Ambiguity

What goes wrong: Rust computes HMAC over IV || ciphertext, Kotlin computes it over just ciphertext, shell computes it over IV || ciphertext || padding_length. All three produce different MACs for the same data. Why it happens: The spec says "HMAC of the encrypted data" without defining the exact byte range. How to avoid: Specify HMAC input as: "The 16-byte IV followed by the ciphertext bytes (including PKCS7 padding). Total HMAC input length = 16 + encrypted_size." Include the expected HMAC value in the worked example. Warning signs: Any phrase like "HMAC of the data" without byte-range specification.

Pitfall 3: Encrypted Size Calculation Error

What goes wrong: The file table stores encrypted_size but the value is wrong because the spec doesn't account for PKCS7 padding correctly. Why it happens: AES-CBC with PKCS7 always pads: if input is N bytes, output is (floor(N/16) + 1) * 16 bytes. A 16-byte input produces 32 bytes of ciphertext, not 16. How to avoid: State the formula explicitly: encrypted_size = ((compressed_size / 16) + 1) * 16. Include examples: 0 bytes -> 16, 1 byte -> 16, 15 bytes -> 16, 16 bytes -> 32, 17 bytes -> 32. Warning signs: File table encrypted_size equals compressed_size rounded up (misses the always-add-block rule).

Pitfall 4: Little-Endian Parsing in Shell

What goes wrong: Shell script reads a 4-byte LE integer as big-endian, getting the wrong value. Why it happens: xxd and od have different default endianness. Busybox xxd may not support -e flag. How to avoid: The spec should include a reference shell function for reading LE integers: extract 4 bytes with dd, reverse byte order with a shell snippet, convert hex to decimal with printf. Document this in the spec appendix. Warning signs: Testing only with values < 256 (where endianness doesn't matter).

Pitfall 5: XOR Obfuscation Key in Spec vs. Implementation

What goes wrong: Phase 6 implements XOR obfuscation but the key or XOR range wasn't specified, so each decoder uses different parameters. Why it happens: Phase 1 defers obfuscation to Phase 6 and doesn't fully specify it. How to avoid: The spec MUST define: XOR key bytes, which byte range is XORed (e.g., "bytes 0x00-0x27 of the header"), and what the XORed header looks like in the worked example (even if the v1 example shows flags=0 with obfuscation off). Warning signs: Obfuscation section says "TBD" or "see Phase 6."

Pitfall 6: Filename Encoding

What goes wrong: Cyrillic filenames (requirement SHL-03) are garbled because the spec doesn't specify encoding. Why it happens: UTF-8 vs. Latin-1 assumption mismatch between encoders. How to avoid: Spec must state: "All filenames are UTF-8 encoded. The file table stores filename as a length-prefixed byte string: u16 length (in bytes) followed by that many UTF-8 bytes." Warning signs: Spec shows filename field as "fixed N bytes, null-terminated."

Code Examples

This phase produces no code. However, the following reference patterns should appear IN the spec document itself:

Shell LE Integer Reading Function (Spec Appendix)

# Read a little-endian u32 from binary file at offset
# Usage: read_le_u32 <file> <offset>
read_le_u32() {
  local file="$1" offset="$2"
  local hex=$(dd if="$file" bs=1 skip="$offset" count=4 2>/dev/null | xxd -p)
  # Reverse bytes: abcdef01 -> 01efcdab
  local b0=${hex:0:2} b1=${hex:2:2} b2=${hex:4:2} b3=${hex:6:2}
  printf '%d' "0x${b3}${b2}${b1}${b0}"
}

# Read a little-endian u16 from binary file at offset
read_le_u16() {
  local file="$1" offset="$2"
  local hex=$(dd if="$file" bs=1 skip="$offset" count=2 2>/dev/null | xxd -p)
  local b0=${hex:0:2} b1=${hex:2:2}
  printf '%d' "0x${b1}${b0}"
}

Shell HMAC Verification (Spec Appendix)

# Verify HMAC-SHA256 of a data block
# Usage: verify_hmac <file> <data_offset> <data_length> <expected_hmac_hex> <key_hex>
verify_hmac() {
  local file="$1" offset="$2" length="$3" expected="$4" key="$5"
  local actual=$(dd if="$file" bs=1 skip="$offset" count="$length" 2>/dev/null \
    | openssl dgst -sha256 -mac HMAC -macopt "hexkey:${key}" -hex 2>/dev/null \
    | awk '{print $NF}')
  [ "$actual" = "$expected" ]
}

Kotlin Decrypt Pattern (Spec Appendix)

// Reference decrypt for a single file entry
fun decryptFileEntry(data: ByteArray, iv: ByteArray, key: ByteArray): ByteArray {
    val cipher = Cipher.getInstance("AES/CBC/PKCS5Padding")
    val secretKey = SecretKeySpec(key, "AES")
    val ivSpec = IvParameterSpec(iv)
    cipher.init(Cipher.DECRYPT_MODE, secretKey, ivSpec)
    return cipher.doFinal(data)  // PKCS7 unpadding is automatic
}

State of the Art

Old Approach	Current Approach	When Changed	Impact
MAC-then-encrypt	Encrypt-then-MAC	~2010 (Bellare & Namprempre formalized)	Prevents padding oracle attacks; HMAC verification can reject before decryption
Fixed filenames (8.3)	Length-prefixed UTF-8	Standard practice	Supports Cyrillic/Unicode filenames (SHL-03)
Single IV for entire archive	Per-file random IV	Standard practice	Prevents cross-file pattern analysis
AEAD modes (GCM)	Still CBC+HMAC for shell compat	Ongoing	GCM is preferred when all consumers support it; busybox openssl does not support GCM

Deprecated/outdated:

openssl enc with -salt and password-based key derivation: Not applicable here (we use raw key with -K/-iv/-nosalt)
PKCS5Padding vs PKCS7Padding naming confusion: In Java/Android, PKCS5Padding actually implements PKCS7 for 16-byte blocks. The spec should note this equivalence.

Open Questions

HMAC Key: Same as encryption key or derived subkey?
- What we know: IETF AEAD spec uses split keys (first 32 bytes for MAC, last 32 bytes for encryption from a 64-byte master key). Best practice recommends separate keys. However, for a hardcoded key scenario with no key reuse across protocols, using the same 32-byte key for both AES-CBC and HMAC-SHA-256 is cryptographically safe (AES and HMAC have different internal structures, no known attack exploits key reuse between them).
- What's unclear: The project STATE.md explicitly flags this as an open question.
- Recommendation: Use the SAME 32-byte hardcoded key for both AES-256-CBC encryption and HMAC-SHA-256. Rationale: (a) simplifies all three decoders, (b) the shell decoder would need an HKDF implementation if keys differ (busybox has no HKDF), (c) cryptographically safe for this specific combination, (d) v2 requirement SEC-01 already plans HKDF-derived per-file keys which will supersede this. Document in the spec that v1 uses a single key and v2 will derive subkeys.
Busybox xxd availability on target device
- What we know: BusyBox source includes xxd as a configurable applet (hexdump_xxd.c). The -p (plain hex dump) flag is widely supported. The -e (little-endian) flag may NOT be available in busybox xxd.
- What's unclear: The exact busybox build on the target Android 13 Qualcomm device.
- Recommendation: The spec should define shell operations using xxd -p (plain hex) only, and implement LE byte reversal manually in shell (as shown in code examples above). Fallback: od -A n -t x1 can replace xxd -p. Document both options in the spec appendix.
Busybox openssl dgst -sha256 -mac HMAC availability
- What we know: Standard OpenSSL supports -mac HMAC -macopt hexkey:.... Busybox builds vary; some include a stripped-down openssl.
- What's unclear: Whether the target device's busybox-openssl supports the -mac/-macopt flags.
- Recommendation: The spec should document the HMAC verification command and note that if busybox openssl lacks -mac support, the shell decoder may skip HMAC verification (degrade gracefully). This is acceptable for a fallback decoder. Document the exact command and the degraded path.
Decoy padding size and placement
- What we know: Phase 6 requires random data between blocks. The spec must define this now.
- What's unclear: How much padding, whether it's fixed or variable per gap, how the decoder knows where real data starts.
- Recommendation: Define in the file table entry: a padding_after field (u16, LE) indicating how many random bytes follow this file's data block. The decoder skips encrypted_size + 32 (HMAC) + padding_after bytes to reach the next file's data. When flags bit 3 is 0 (no decoy padding), padding_after is always 0.
TOC (file table) encryption IV storage
- What we know: FMT-07 requires encrypted file table with its own IV.
- What's unclear: Where the TOC IV is stored if the header itself is XOR-obfuscated.
- Recommendation: Store the TOC IV in the header at a fixed offset (e.g., bytes 0x10-0x1F). XOR obfuscation (FMT-06) is applied AFTER the header is fully constructed, including the TOC IV. The decoder de-XORs the header first, then reads the TOC IV, then decrypts the TOC. Order of operations: de-XOR header -> read TOC IV -> decrypt TOC -> read file entries -> for each file: verify HMAC -> decrypt -> decompress -> verify SHA-256.

Recommended Format Layout

Based on research, the following layout balances all constraints:

+==========================+
|     ARCHIVE HEADER       |  Fixed size (e.g., 40 bytes)
|  magic(4) | ver(1) |     |
|  flags(1) | count(2) |   |
|  toc_offset(4) |         |
|  toc_size(4) |           |
|  toc_iv(16) |            |
|  reserved(8)             |
+==========================+
|     FILE TABLE (TOC)     |  Variable size, optionally encrypted
|  Entry 1: name, sizes,   |
|    offset, iv, hmac,     |
|    sha256, flags         |
|  Entry 2: ...            |
|  ...                     |
+==========================+
|     DATA BLOCK 1         |  IV(16) + ciphertext(N) + HMAC(32)
+--------------------------+
|     [DECOY PADDING 1]    |  Optional random bytes (Phase 6)
+--------------------------+
|     DATA BLOCK 2         |  IV(16) + ciphertext(N) + HMAC(32)
+--------------------------+
|     [DECOY PADDING 2]    |  Optional random bytes (Phase 6)
+--------------------------+
|     ...                  |
+==========================+

File Table Entry Fields (per file)

Field	Size	Type	Description
name_length	2	u16 LE	Filename length in bytes
name	variable	UTF-8 bytes	Filename (not null-terminated)
original_size	4	u32 LE	Original file size before compression
compressed_size	4	u32 LE	Size after gzip compression
encrypted_size	4	u32 LE	Size after AES-CBC encryption (with PKCS7)
data_offset	4	u32 LE	Absolute offset of this file's data block
iv	16	bytes	AES-CBC IV for this file
hmac	32	bytes	HMAC-SHA-256 of (IV + ciphertext)
sha256	32	bytes	SHA-256 of original (uncompressed) file
compression_flag	1	u8	0 = raw (no compression), 1 = gzip
padding_after	2	u16 LE	Bytes of decoy padding after this data block

Critical Design Decisions for the Spec

TOC placement: After header, before data blocks. This allows the decoder to read the entire TOC first, then seek to individual data blocks. Shell decoder reads TOC with a single dd call.
Absolute offsets: data_offset in each file table entry is absolute from archive byte 0. This avoids cumulative offset calculation errors in the shell decoder.
IV stored in BOTH TOC and data block: The IV appears in the file table entry AND as the first 16 bytes of the data block. This is redundant but allows two decode strategies: (a) read IV from TOC (fast), or (b) read IV from data block (streaming). The spec should mandate both are identical.
HMAC covers IV + ciphertext: HMAC input is exactly: the 16-byte IV followed by the encrypted data (ciphertext including PKCS7 padding). The HMAC does NOT cover the HMAC field itself or any TOC metadata.
Magic bytes: Must NOT match any known file signature. Consult the Wikipedia list of file signatures and the Gary Kessler file signatures table. Use 4+ bytes that do not appear in any standard signature database. Starting with a null byte (0x00) is a good practice to signal "this is binary, not text."

Sources

Primary (HIGH confidence)

IETF draft-mcgrew-aead-aes-cbc-hmac-sha2-01 - AEAD construction with AES-CBC + HMAC, key splitting, MAC input specification
OpenSSL enc documentation (3.3) - -K, -iv, -nosalt flags for raw key mode, PKCS padding behavior
OpenSSL dgst documentation - HMAC computation with -mac HMAC -macopt hexkey:...
Android Cryptography reference - Supported ciphers on Android 13
Wikipedia: List of file signatures - Magic bytes collision avoidance
PKWARE ZIP specification - Reference for archive format structure patterns (header + central directory + entries)

Secondary (MEDIUM confidence)

Encrypt-then-MAC article (Erik Ringsmuth) - Practical encrypt-then-MAC construction walkthrough, verified against IETF draft
AES-CBC + HMAC best practices (ProAndroidDev) - Android-specific AES-CBC + HMAC patterns
BusyBox documentation - Applet availability (dd, xxd, openssl)
Designing File Formats (fadden.com) - General binary format design principles
GameDev binary format design - Custom binary format design patterns
XOR obfuscation analysis (SANS) - XOR detection methods (informs how to design XOR obfuscation that is less trivially breakable)

Tertiary (LOW confidence)

BusyBox xxd -e flag availability on Android 13 Qualcomm - Not verified for specific target device
BusyBox openssl dgst -mac HMAC support - Varies by build, not verified for target

Metadata

Confidence breakdown:

Standard stack: HIGH - AES-256-CBC, HMAC-SHA-256, PKCS7, gzip are well-established standards with extensive documentation
Architecture: HIGH - Archive format design patterns (header + TOC + data blocks) are well-understood; ZIP, TAR, and similar formats provide proven structural patterns
Pitfalls: HIGH - Known issues (HMAC ambiguity, PKCS7 off-by-one, LE parsing, filename encoding) are well-documented in cryptographic engineering literature
Shell compatibility: MEDIUM - Busybox applet availability varies by build configuration; the spec must accommodate fallbacks
Obfuscation details: MEDIUM - XOR obfuscation is well-understood but the specific parameters (key, byte range) are design choices without external standards to reference

Research date: 2026-02-24 Valid until: 2026-03-24 (stable domain, 30-day validity)

26 KiB Raw Blame History