Files
android-encrypted-archiver/.planning/phases/01-format-specification/01-RESEARCH.md
2026-02-24 23:05:52 +03:00

26 KiB

Phase 1: Format Specification - Research

Researched: 2026-02-24 Domain: Binary format design, cryptographic data structures, cross-platform compatibility Confidence: HIGH

Summary

Phase 1 produces the sole deliverable: a complete binary format specification document that three independent implementations (Rust CLI, Kotlin Android decoder, busybox shell script) will build against. The spec must define every byte offset, field size, and encoding so that implementers have zero ambiguity. This is a documentation phase with no code.

The critical design tensions are: (1) the format must be simple enough for a shell script using dd/xxd/openssl to parse, yet structured enough to hold per-file encryption metadata; (2) the format must accommodate Phase 6 obfuscation features (XOR headers, encrypted TOC, decoy padding) even though they are not implemented until later; (3) all three decoders share a single hardcoded 32-byte key, and the spec must resolve the open question of whether HMAC uses the same key or a derived subkey.

Primary recommendation: Write the spec as a single Markdown document with ASCII diagrams showing byte layouts, a field-reference table for every structure (header, file table entry, data block), and a complete worked example showing the hex dump of a 2-file archive. Address obfuscation as reserved fields and documented future behavior (version flag controls whether obfuscation is active).

<phase_requirements>

Phase Requirements

ID Description Research Support
FMT-05 Specification of the format as a document (before any implementation begins) All findings below directly support creating this spec: field layouts, crypto parameter placement, worked example patterns, shell compatibility constraints, and obfuscation placeholders
</phase_requirements>

Standard Stack

This phase produces a document, not code. There are no library dependencies. The "stack" here is the set of standards and reference materials the spec must conform to.

Core Standards

Standard Reference Purpose Why It Governs the Spec
AES-256-CBC NIST SP 800-38A Block cipher mode Defines 16-byte block size, IV requirements, PKCS7 padding behavior
HMAC-SHA-256 RFC 2104, FIPS 198-1 Message authentication Defines 32-byte output, key requirements
PKCS7 padding RFC 5652 Section 6.3 Block alignment Encrypted size = ceil((input_len + 1) / 16) * 16; always adds at least 1 byte
SHA-256 FIPS 180-4 File integrity checksum 32-byte digest for post-decompression verification
DEFLATE/gzip RFC 1952 Compression Standard gzip stream, decompressed by GZIPInputStream (Kotlin) and gunzip (shell)
Encrypt-then-MAC Bellare & Namprempre 2000; IETF draft-mcgrew-aead-aes-cbc-hmac-sha2 Authenticated encryption construction HMAC computed over IV + ciphertext, verified before decryption

Key Size Constants

Parameter Size Notes
AES key 32 bytes (256 bits) Hardcoded, shared across all decoders
AES block size 16 bytes Governs IV size and PKCS7 padding
IV 16 bytes Random per file, stored in cleartext
HMAC-SHA-256 output 32 bytes Appended after or stored alongside ciphertext
SHA-256 checksum 32 bytes Stored in file table, verified after decompression

Alternatives Considered

Instead of Could Use Tradeoff
AES-256-CBC + HMAC AES-256-GCM (AEAD) GCM is simpler (single operation) but openssl enc in busybox does NOT support GCM mode; CBC + HMAC is the only option compatible with all three decoders
HMAC-SHA-256 HMAC-SHA-512 truncated IETF AEAD spec uses SHA-512 for AES-256, but SHA-256 is simpler, sufficient for integrity, and natively available in all three target environments
Little-endian Big-endian (network order) Big-endian is traditional for network protocols, but Rust, x86, ARM (Android) are all little-endian natively; shell od -t u4 reads host-endian which is LE on ARM
PKCS7 Zero-padding PKCS7 is standard for AES-CBC, directly supported by openssl enc and javax.crypto PKCS5Padding (PKCS5 = PKCS7 for 16-byte blocks)

Architecture Patterns

docs/
  FORMAT.md           # The format specification (THE deliverable)
    - Overview & design goals
    - Notation conventions
    - Archive structure diagram (ASCII art)
    - Header definition (byte-level table)
    - File table entry definition (byte-level table)
    - Data block layout
    - Encryption & authentication details
    - Compression details
    - Obfuscation features (Phase 6 preview)
    - Worked example with hex dump
    - Version compatibility rules

Pattern 1: Field Definition Table

What: Every binary structure is specified as a table with offset, size, type, endianness, and description. When to use: For every fixed-layout structure in the format (header, file table entries).

Example:

### Archive Header (40 bytes)

| Offset | Size | Type    | Endian | Field            | Description                          |
|--------|------|---------|--------|------------------|--------------------------------------|
| 0x00   | 4    | bytes   | -      | magic            | Custom magic bytes: 0xCA 0xFE 0xAR 0xCH (example) |
| 0x04   | 1    | u8      | -      | version          | Format version (1 for v1)            |
| 0x05   | 1    | u8      | -      | flags            | Bit 0: compression, Bit 1: obfuscation |
| 0x06   | 2    | u16     | LE     | file_count       | Number of files in archive           |
| 0x08   | 4    | u32     | LE     | toc_offset       | Offset to file table from file start |
| 0x0C   | 4    | u32     | LE     | toc_size         | Size of file table in bytes          |
| 0x10   | 16   | bytes   | -      | toc_iv           | IV for encrypted file table (Phase 6)|
| 0x20   | 8    | bytes   | -      | reserved         | Reserved for future use (zero-filled)|

Pattern 2: Encrypt-then-MAC Construction (per file)

What: The exact order of operations and data layout for each file's encrypted block. When to use: Defines how every file is stored in the archive.

Pipeline per file:
  1. Read original file -> compute SHA-256 checksum -> store in file table
  2. Compress with gzip (if compression flag set) -> compressed_data
  3. Pad compressed_data with PKCS7 to AES block boundary
  4. Generate random 16-byte IV
  5. Encrypt padded data with AES-256-CBC using IV -> ciphertext
  6. Compute HMAC-SHA-256 over (IV || ciphertext) -> mac
  7. Store: IV (16) || ciphertext (variable) || HMAC (32)

Data block layout:
  [IV: 16 bytes][ciphertext: N bytes][HMAC: 32 bytes]
  Where N = ceil((compressed_size + pkcs7_pad) / 16) * 16

Pattern 3: Version-Gated Features

What: Use the version byte and flags field to control which features are active, allowing the same format to work with and without obfuscation. When to use: Phase 6 obfuscation features are defined in the spec now but activated by flag bits.

Flags byte (offset 0x05):
  Bit 0: Per-file compression enabled (0 = raw, 1 = gzip)
  Bit 1: TOC encryption enabled (0 = plaintext TOC, 1 = AES-encrypted TOC)
  Bit 2: XOR header obfuscation (0 = off, 1 = on)
  Bit 3: Decoy padding between blocks (0 = off, 1 = on)
  Bits 4-7: Reserved (must be 0)

Decoders MUST check flags and skip unsupported features gracefully.

Pattern 4: Worked Example with Hex Dump

What: A concrete archive with known inputs showing every byte. When to use: Mandatory per success criteria -- the spec must include at least one complete worked example.

Example archive: 2 files
  File 1: "hello.txt" (5 bytes: "Hello")
  File 2: "test.apk" (simulated 32 bytes)

Key: 0x00112233...EEFF (32 bytes, shown in full)
IV for file 1: 0xAABBCCDD... (16 bytes)
IV for file 2: 0x11223344... (16 bytes)

Complete hex dump:
  0000: CA FE xx xx 01 01 02 00  ...  <- header (magic, version, flags, count)
  ...
  [every byte annotated with field name]

Anti-Patterns to Avoid

  • Variable-length header without explicit size field: Shell decoders need to know exactly where to dd skip= to. Every variable-length region must have its size recorded in a preceding fixed-offset field.
  • Implicit padding assumptions: Never assume "the decoder will figure out padding." Explicitly state PKCS7 rules and encrypted size formula in the spec.
  • Mixing concerns in field table: Don't combine "offset within archive" with "offset within data block." Use absolute offsets from archive start everywhere.
  • Underspecifying endianness: Every multi-byte integer must state "LE" (little-endian) explicitly. The shell decoder reads bytes with dd and must know byte order.
  • Ambiguous HMAC scope: The spec must state EXACTLY which bytes are fed to HMAC. "HMAC of the ciphertext" is ambiguous (does it include IV? padding? length?). State: "HMAC-SHA-256(key, IV || ciphertext)" with byte ranges.

Don't Hand-Roll

Problem Don't Build Use Instead Why
Authenticated encryption Custom MAC scheme Standard encrypt-then-MAC (HMAC-SHA-256 over IV+ciphertext) Subtle errors (MAC-then-encrypt, encrypt-and-MAC) lead to padding oracle attacks
Key for HMAC vs encryption Ad-hoc key splitting Either: (a) use same 32-byte key for both (acceptable per v1 scope), or (b) HKDF with distinct labels IETF AEAD spec splits key; but for hardcoded key with no key reuse across protocols, same key is cryptographically safe for AES-CBC + HMAC-SHA-256 specifically
Block padding Manual zero-padding PKCS7 (built into openssl enc, javax.crypto) Zero-padding is ambiguous for binary files; PKCS7 is unambiguous and universally supported
Compression framing Custom compression headers Standard gzip stream (RFC 1952) GZIPInputStream and gunzip handle framing automatically

Key insight: The format spec should use standard cryptographic constructions (encrypt-then-MAC, PKCS7, gzip) composed together, rather than inventing novel schemes. The "custom" part is the container format (header, TOC, block layout), not the cryptographic primitives inside it.

Common Pitfalls

Pitfall 1: Shell Decoder Byte Extraction Fragility

What goes wrong: The shell decoder uses dd bs=1 skip=N count=M to extract fields. If any offset or size in the spec is wrong by even 1 byte, the entire decode chain fails silently (produces garbage, not an error). Why it happens: Off-by-one errors in offset calculations, or forgetting that PKCS7 adds a full block when input is already block-aligned. How to avoid: The spec's worked example must include a step-by-step shell decode walkthrough: "To extract file 1 IV: dd if=archive.bin bs=1 skip=48 count=16". Test the worked example's offsets manually. Warning signs: The worked example's offsets don't add up when you manually sum field sizes.

Pitfall 2: HMAC Input Ambiguity

What goes wrong: Rust computes HMAC over IV || ciphertext, Kotlin computes it over just ciphertext, shell computes it over IV || ciphertext || padding_length. All three produce different MACs for the same data. Why it happens: The spec says "HMAC of the encrypted data" without defining the exact byte range. How to avoid: Specify HMAC input as: "The 16-byte IV followed by the ciphertext bytes (including PKCS7 padding). Total HMAC input length = 16 + encrypted_size." Include the expected HMAC value in the worked example. Warning signs: Any phrase like "HMAC of the data" without byte-range specification.

Pitfall 3: Encrypted Size Calculation Error

What goes wrong: The file table stores encrypted_size but the value is wrong because the spec doesn't account for PKCS7 padding correctly. Why it happens: AES-CBC with PKCS7 always pads: if input is N bytes, output is (floor(N/16) + 1) * 16 bytes. A 16-byte input produces 32 bytes of ciphertext, not 16. How to avoid: State the formula explicitly: encrypted_size = ((compressed_size / 16) + 1) * 16. Include examples: 0 bytes -> 16, 1 byte -> 16, 15 bytes -> 16, 16 bytes -> 32, 17 bytes -> 32. Warning signs: File table encrypted_size equals compressed_size rounded up (misses the always-add-block rule).

Pitfall 4: Little-Endian Parsing in Shell

What goes wrong: Shell script reads a 4-byte LE integer as big-endian, getting the wrong value. Why it happens: xxd and od have different default endianness. Busybox xxd may not support -e flag. How to avoid: The spec should include a reference shell function for reading LE integers: extract 4 bytes with dd, reverse byte order with a shell snippet, convert hex to decimal with printf. Document this in the spec appendix. Warning signs: Testing only with values < 256 (where endianness doesn't matter).

Pitfall 5: XOR Obfuscation Key in Spec vs. Implementation

What goes wrong: Phase 6 implements XOR obfuscation but the key or XOR range wasn't specified, so each decoder uses different parameters. Why it happens: Phase 1 defers obfuscation to Phase 6 and doesn't fully specify it. How to avoid: The spec MUST define: XOR key bytes, which byte range is XORed (e.g., "bytes 0x00-0x27 of the header"), and what the XORed header looks like in the worked example (even if the v1 example shows flags=0 with obfuscation off). Warning signs: Obfuscation section says "TBD" or "see Phase 6."

Pitfall 6: Filename Encoding

What goes wrong: Cyrillic filenames (requirement SHL-03) are garbled because the spec doesn't specify encoding. Why it happens: UTF-8 vs. Latin-1 assumption mismatch between encoders. How to avoid: Spec must state: "All filenames are UTF-8 encoded. The file table stores filename as a length-prefixed byte string: u16 length (in bytes) followed by that many UTF-8 bytes." Warning signs: Spec shows filename field as "fixed N bytes, null-terminated."

Code Examples

This phase produces no code. However, the following reference patterns should appear IN the spec document itself:

Shell LE Integer Reading Function (Spec Appendix)

# Read a little-endian u32 from binary file at offset
# Usage: read_le_u32 <file> <offset>
read_le_u32() {
  local file="$1" offset="$2"
  local hex=$(dd if="$file" bs=1 skip="$offset" count=4 2>/dev/null | xxd -p)
  # Reverse bytes: abcdef01 -> 01efcdab
  local b0=${hex:0:2} b1=${hex:2:2} b2=${hex:4:2} b3=${hex:6:2}
  printf '%d' "0x${b3}${b2}${b1}${b0}"
}

# Read a little-endian u16 from binary file at offset
read_le_u16() {
  local file="$1" offset="$2"
  local hex=$(dd if="$file" bs=1 skip="$offset" count=2 2>/dev/null | xxd -p)
  local b0=${hex:0:2} b1=${hex:2:2}
  printf '%d' "0x${b1}${b0}"
}

Shell HMAC Verification (Spec Appendix)

# Verify HMAC-SHA256 of a data block
# Usage: verify_hmac <file> <data_offset> <data_length> <expected_hmac_hex> <key_hex>
verify_hmac() {
  local file="$1" offset="$2" length="$3" expected="$4" key="$5"
  local actual=$(dd if="$file" bs=1 skip="$offset" count="$length" 2>/dev/null \
    | openssl dgst -sha256 -mac HMAC -macopt "hexkey:${key}" -hex 2>/dev/null \
    | awk '{print $NF}')
  [ "$actual" = "$expected" ]
}

Kotlin Decrypt Pattern (Spec Appendix)

// Reference decrypt for a single file entry
fun decryptFileEntry(data: ByteArray, iv: ByteArray, key: ByteArray): ByteArray {
    val cipher = Cipher.getInstance("AES/CBC/PKCS5Padding")
    val secretKey = SecretKeySpec(key, "AES")
    val ivSpec = IvParameterSpec(iv)
    cipher.init(Cipher.DECRYPT_MODE, secretKey, ivSpec)
    return cipher.doFinal(data)  // PKCS7 unpadding is automatic
}

State of the Art

Old Approach Current Approach When Changed Impact
MAC-then-encrypt Encrypt-then-MAC ~2010 (Bellare & Namprempre formalized) Prevents padding oracle attacks; HMAC verification can reject before decryption
Fixed filenames (8.3) Length-prefixed UTF-8 Standard practice Supports Cyrillic/Unicode filenames (SHL-03)
Single IV for entire archive Per-file random IV Standard practice Prevents cross-file pattern analysis
AEAD modes (GCM) Still CBC+HMAC for shell compat Ongoing GCM is preferred when all consumers support it; busybox openssl does not support GCM

Deprecated/outdated:

  • openssl enc with -salt and password-based key derivation: Not applicable here (we use raw key with -K/-iv/-nosalt)
  • PKCS5Padding vs PKCS7Padding naming confusion: In Java/Android, PKCS5Padding actually implements PKCS7 for 16-byte blocks. The spec should note this equivalence.

Open Questions

  1. HMAC Key: Same as encryption key or derived subkey?

    • What we know: IETF AEAD spec uses split keys (first 32 bytes for MAC, last 32 bytes for encryption from a 64-byte master key). Best practice recommends separate keys. However, for a hardcoded key scenario with no key reuse across protocols, using the same 32-byte key for both AES-CBC and HMAC-SHA-256 is cryptographically safe (AES and HMAC have different internal structures, no known attack exploits key reuse between them).
    • What's unclear: The project STATE.md explicitly flags this as an open question.
    • Recommendation: Use the SAME 32-byte hardcoded key for both AES-256-CBC encryption and HMAC-SHA-256. Rationale: (a) simplifies all three decoders, (b) the shell decoder would need an HKDF implementation if keys differ (busybox has no HKDF), (c) cryptographically safe for this specific combination, (d) v2 requirement SEC-01 already plans HKDF-derived per-file keys which will supersede this. Document in the spec that v1 uses a single key and v2 will derive subkeys.
  2. Busybox xxd availability on target device

    • What we know: BusyBox source includes xxd as a configurable applet (hexdump_xxd.c). The -p (plain hex dump) flag is widely supported. The -e (little-endian) flag may NOT be available in busybox xxd.
    • What's unclear: The exact busybox build on the target Android 13 Qualcomm device.
    • Recommendation: The spec should define shell operations using xxd -p (plain hex) only, and implement LE byte reversal manually in shell (as shown in code examples above). Fallback: od -A n -t x1 can replace xxd -p. Document both options in the spec appendix.
  3. Busybox openssl dgst -sha256 -mac HMAC availability

    • What we know: Standard OpenSSL supports -mac HMAC -macopt hexkey:.... Busybox builds vary; some include a stripped-down openssl.
    • What's unclear: Whether the target device's busybox-openssl supports the -mac/-macopt flags.
    • Recommendation: The spec should document the HMAC verification command and note that if busybox openssl lacks -mac support, the shell decoder may skip HMAC verification (degrade gracefully). This is acceptable for a fallback decoder. Document the exact command and the degraded path.
  4. Decoy padding size and placement

    • What we know: Phase 6 requires random data between blocks. The spec must define this now.
    • What's unclear: How much padding, whether it's fixed or variable per gap, how the decoder knows where real data starts.
    • Recommendation: Define in the file table entry: a padding_after field (u16, LE) indicating how many random bytes follow this file's data block. The decoder skips encrypted_size + 32 (HMAC) + padding_after bytes to reach the next file's data. When flags bit 3 is 0 (no decoy padding), padding_after is always 0.
  5. TOC (file table) encryption IV storage

    • What we know: FMT-07 requires encrypted file table with its own IV.
    • What's unclear: Where the TOC IV is stored if the header itself is XOR-obfuscated.
    • Recommendation: Store the TOC IV in the header at a fixed offset (e.g., bytes 0x10-0x1F). XOR obfuscation (FMT-06) is applied AFTER the header is fully constructed, including the TOC IV. The decoder de-XORs the header first, then reads the TOC IV, then decrypts the TOC. Order of operations: de-XOR header -> read TOC IV -> decrypt TOC -> read file entries -> for each file: verify HMAC -> decrypt -> decompress -> verify SHA-256.

Based on research, the following layout balances all constraints:

+==========================+
|     ARCHIVE HEADER       |  Fixed size (e.g., 40 bytes)
|  magic(4) | ver(1) |     |
|  flags(1) | count(2) |   |
|  toc_offset(4) |         |
|  toc_size(4) |           |
|  toc_iv(16) |            |
|  reserved(8)             |
+==========================+
|     FILE TABLE (TOC)     |  Variable size, optionally encrypted
|  Entry 1: name, sizes,   |
|    offset, iv, hmac,     |
|    sha256, flags         |
|  Entry 2: ...            |
|  ...                     |
+==========================+
|     DATA BLOCK 1         |  IV(16) + ciphertext(N) + HMAC(32)
+--------------------------+
|     [DECOY PADDING 1]    |  Optional random bytes (Phase 6)
+--------------------------+
|     DATA BLOCK 2         |  IV(16) + ciphertext(N) + HMAC(32)
+--------------------------+
|     [DECOY PADDING 2]    |  Optional random bytes (Phase 6)
+--------------------------+
|     ...                  |
+==========================+

File Table Entry Fields (per file)

Field Size Type Description
name_length 2 u16 LE Filename length in bytes
name variable UTF-8 bytes Filename (not null-terminated)
original_size 4 u32 LE Original file size before compression
compressed_size 4 u32 LE Size after gzip compression
encrypted_size 4 u32 LE Size after AES-CBC encryption (with PKCS7)
data_offset 4 u32 LE Absolute offset of this file's data block
iv 16 bytes AES-CBC IV for this file
hmac 32 bytes HMAC-SHA-256 of (IV + ciphertext)
sha256 32 bytes SHA-256 of original (uncompressed) file
compression_flag 1 u8 0 = raw (no compression), 1 = gzip
padding_after 2 u16 LE Bytes of decoy padding after this data block

Critical Design Decisions for the Spec

  1. TOC placement: After header, before data blocks. This allows the decoder to read the entire TOC first, then seek to individual data blocks. Shell decoder reads TOC with a single dd call.

  2. Absolute offsets: data_offset in each file table entry is absolute from archive byte 0. This avoids cumulative offset calculation errors in the shell decoder.

  3. IV stored in BOTH TOC and data block: The IV appears in the file table entry AND as the first 16 bytes of the data block. This is redundant but allows two decode strategies: (a) read IV from TOC (fast), or (b) read IV from data block (streaming). The spec should mandate both are identical.

  4. HMAC covers IV + ciphertext: HMAC input is exactly: the 16-byte IV followed by the encrypted data (ciphertext including PKCS7 padding). The HMAC does NOT cover the HMAC field itself or any TOC metadata.

  5. Magic bytes: Must NOT match any known file signature. Consult the Wikipedia list of file signatures and the Gary Kessler file signatures table. Use 4+ bytes that do not appear in any standard signature database. Starting with a null byte (0x00) is a good practice to signal "this is binary, not text."

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • BusyBox xxd -e flag availability on Android 13 Qualcomm - Not verified for specific target device
  • BusyBox openssl dgst -mac HMAC support - Varies by build, not verified for target

Metadata

Confidence breakdown:

  • Standard stack: HIGH - AES-256-CBC, HMAC-SHA-256, PKCS7, gzip are well-established standards with extensive documentation
  • Architecture: HIGH - Archive format design patterns (header + TOC + data blocks) are well-understood; ZIP, TAR, and similar formats provide proven structural patterns
  • Pitfalls: HIGH - Known issues (HMAC ambiguity, PKCS7 off-by-one, LE parsing, filename encoding) are well-documented in cryptographic engineering literature
  • Shell compatibility: MEDIUM - Busybox applet availability varies by build configuration; the spec must accommodate fallbacks
  • Obfuscation details: MEDIUM - XOR obfuscation is well-understood but the specific parameters (key, byte range) are design choices without external standards to reference

Research date: 2026-02-24 Valid until: 2026-03-24 (stable domain, 30-day validity)