docs(01-format-specification): create phase plan

2026-02-24 23:09:12 +03:00
parent 041a00913b
commit ca443768e0
2 changed files with 306 additions and 3 deletions
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@@ -30,10 +30,10 @@ Decimal phases appear between their surrounding integers in numeric order.
  2. Spec defines magic bytes, version field, encryption parameters (AES-256-CBC, IV storage, HMAC placement), and compression flags
  3. Spec includes a worked example: a concrete archive with 2 files showing exact byte layout
  4. Spec addresses all obfuscation features (XOR headers, encrypted TOC, decoy padding) even though implementation is Phase 6
-**Plans**: TBD
+**Plans**: 1 plan

 Plans:
- [ ] 01-01: TBD
+- [ ] 01-01-PLAN.md -- Write complete binary format specification with byte-level field definitions, worked example, and shell reference appendix

 ### Phase 2: Core Archiver
 **Goal**: A working Rust CLI that takes input files and produces a valid encrypted archive
@@ -114,7 +114,7 @@ Phases execute in numeric order: 1 -> 2 -> 3 -> 4 -> 5 -> 6

 | Phase | Plans Complete | Status | Completed |
 |-------|----------------|--------|-----------|
-| 1. Format Specification | 0/1 | Not started | - |
+| 1. Format Specification | 0/1 | Planned | - |
 | 2. Core Archiver | 0/2 | Not started | - |
 | 3. Round-Trip Verification | 0/2 | Not started | - |
 | 4. Kotlin Decoder | 0/1 | Not started | - |
--- a/.planning/phases/01-format-specification/01-01-PLAN.md
+++ b/.planning/phases/01-format-specification/01-01-PLAN.md
@@ -0,0 +1,303 @@
+---
+phase: 01-format-specification
+plan: 01
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+  - docs/FORMAT.md
+autonomous: true
+requirements:
+  - FMT-05
+
+must_haves:
+  truths:
+    - "Format spec document exists at docs/FORMAT.md with complete byte-level definitions"
+    - "Every binary structure (header, file table entry, data block) has a field table with offset, size, type, endianness, and description"
+    - "Encryption parameters are fully specified: AES-256-CBC, IV storage, HMAC-SHA-256 scope (IV || ciphertext), PKCS7 padding formula"
+    - "A worked example shows a concrete 2-file archive with every byte annotated"
+    - "Obfuscation features (XOR headers, encrypted TOC, decoy padding) are fully specified with byte ranges and activation flags"
+    - "Shell decoder reference functions (LE integer reading, HMAC verification) are included in appendix"
+  artifacts:
+    - path: "docs/FORMAT.md"
+      provides: "Complete binary format specification"
+      min_lines: 300
+      contains: "Archive Header"
+  key_links:
+    - from: "docs/FORMAT.md header definition"
+      to: "docs/FORMAT.md worked example"
+      via: "offset consistency"
+      pattern: "0x[0-9A-Fa-f]+"
+    - from: "docs/FORMAT.md file table entry"
+      to: "docs/FORMAT.md worked example"
+      via: "field sizes match TOC entry byte count"
+      pattern: "encrypted_size"
+---
+
+<objective>
+Write the complete binary format specification document for the encrypted archive format.
+
+Purpose: This is THE deliverable for Phase 1. All three implementations (Rust archiver, Kotlin decoder, shell decoder) will build against this spec. Every byte offset, field size, endianness, and encoding must be unambiguous.
+
+Output: `docs/FORMAT.md` -- a single, comprehensive Markdown document with ASCII diagrams, byte-level field tables, encryption/authentication details, obfuscation feature specs, a worked example with annotated hex dump, and a shell decoder reference appendix.
+</objective>
+
+<execution_context>
+@/home/nick/.claude/get-shit-done/workflows/execute-plan.md
+@/home/nick/.claude/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/REQUIREMENTS.md
+@.planning/phases/01-format-specification/01-RESEARCH.md
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Write format specification with byte-level field definitions</name>
+  <files>docs/FORMAT.md</files>
+  <action>
+Create `docs/FORMAT.md` with the following sections, drawing heavily from the research in `01-RESEARCH.md`:
+
+**1. Overview and Design Goals**
+- State the purpose: custom binary format unrecognizable by standard tools
+- List the three target decoders (Rust, Kotlin, shell)
+- State the core constraint: shell decoder must parse with `dd`/`xxd`/`openssl`
+
+**2. Notation Conventions**
+- All multi-byte integers are little-endian (LE)
+- All sizes in bytes unless stated otherwise
+- Offsets are absolute from archive byte 0
+- Filenames are UTF-8 encoded, length-prefixed (not null-terminated)
+
+**3. Archive Structure Diagram (ASCII art)**
+- Show the three sections: Header -> File Table (TOC) -> Data Blocks
+- Include optional decoy padding between data blocks
+
+**4. Archive Header Definition (fixed 40 bytes)**
+Create a field table following the research layout:
+
+| Offset | Size | Type | Endian | Field | Description |
+|--------|------|------|--------|-------|-------------|
+| 0x00 | 4 | bytes | - | magic | Custom magic bytes (choose 4 bytes NOT in any known file signature database; use 0x00 as first byte to signal binary; e.g. 0x00 0xEA 0x72 0x63) |
+| 0x04 | 1 | u8 | - | version | Format version (1 for v1) |
+| 0x05 | 1 | u8 | - | flags | Bit 0: per-file compression, Bit 1: TOC encryption, Bit 2: XOR header obfuscation, Bit 3: decoy padding. Bits 4-7: reserved (0) |
+| 0x06 | 2 | u16 | LE | file_count | Number of files |
+| 0x08 | 4 | u32 | LE | toc_offset | Absolute offset of file table |
+| 0x0C | 4 | u32 | LE | toc_size | Size of file table in bytes |
+| 0x10 | 16 | bytes | - | toc_iv | IV for encrypted TOC (zero-filled when TOC encryption is off) |
+| 0x20 | 8 | bytes | - | reserved | Reserved, zero-filled |
+
+**5. File Table Entry Definition**
+Variable-length per entry. Create a field table:
+
+| Field | Size | Type | Description |
+|-------|------|------|-------------|
+| name_length | 2 | u16 LE | Filename length in bytes |
+| name | name_length | UTF-8 bytes | Filename |
+| original_size | 4 | u32 LE | Original file size |
+| compressed_size | 4 | u32 LE | Size after gzip (equals original_size if compression off) |
+| encrypted_size | 4 | u32 LE | Size after AES-CBC with PKCS7: ((compressed_size / 16) + 1) * 16 |
+| data_offset | 4 | u32 LE | Absolute offset of data block |
+| iv | 16 | bytes | AES-CBC IV for this file |
+| hmac | 32 | bytes | HMAC-SHA-256(key, IV || ciphertext) |
+| sha256 | 32 | bytes | SHA-256 of original file |
+| compression_flag | 1 | u8 | 0 = raw, 1 = gzip |
+| padding_after | 2 | u16 LE | Bytes of decoy padding after data block (0 when flag off) |
+
+Explicitly state total entry size formula: `2 + name_length + 4 + 4 + 4 + 4 + 16 + 32 + 32 + 1 + 2 = 101 + name_length bytes`
+
+**6. Data Block Layout**
+Per file:
+```
+[ciphertext: encrypted_size bytes]
+```
+Note: IV is stored ONLY in the file table entry (not duplicated at data block start). The data block contains ONLY the ciphertext. HMAC is stored ONLY in the file table entry.
+
+IMPORTANT DESIGN DECISION (differs from research suggestion of storing IV in both places): Store IV only in TOC to keep data blocks simple for shell `dd` extraction. The research suggested dual storage but that adds complexity with no benefit since the TOC is always read first.
+
+**7. Encryption and Authentication Details**
+- Pipeline: original -> SHA-256 checksum -> gzip compress (if flag set) -> PKCS7 pad -> AES-256-CBC encrypt -> HMAC-SHA-256
+- AES-256-CBC: 32-byte hardcoded key, 16-byte random IV per file
+- PKCS7 padding: always adds at least 1 byte. Formula: `encrypted_size = ((compressed_size / 16) + 1) * 16`. Include examples: 0->16, 1->16, 15->16, 16->32, 17->32, 31->32, 32->48
+- HMAC-SHA-256: key = same 32-byte key as AES (per research recommendation, v2 will use HKDF). Input = IV (16 bytes) || ciphertext (encrypted_size bytes). Total HMAC input = 16 + encrypted_size bytes. Output = 32 bytes.
+- Encrypt-then-MAC: HMAC is computed AFTER encryption. Decoder MUST verify HMAC BEFORE decrypting (reject tampered data without attempting decryption).
+
+**8. Compression Details**
+- Standard gzip (DEFLATE, RFC 1952)
+- Per-file flag: compression_flag in TOC entry
+- Already-compressed files (APK, ZIP, PNG) should use flag=0 (raw)
+- Decompression: `java.util.zip.GZIPInputStream` (Kotlin), `gunzip` (shell), `flate2` (Rust)
+
+**9. Obfuscation Features (Phase 6)**
+Define fully now, activated by flags bits:
+
+**9.1 XOR Header Obfuscation (flags bit 2)**
+- XOR key: define a specific 8-byte repeating key (e.g., 0xA5 0x3C 0x96 0x0F 0xE1 0x7B 0x4D 0xC8)
+- XOR range: bytes 0x00 through 0x27 (entire 40-byte header)
+- Applied AFTER header is fully constructed
+- Decoder de-XORs FIRST, then reads header fields
+- When flag bit 2 is 0, header is stored as-is (no XOR)
+
+**9.2 TOC Encryption (flags bit 1)**
+- Uses AES-256-CBC with toc_iv from header
+- Same 32-byte key as file encryption
+- PKCS7 padding applied to entire serialized TOC
+- toc_size in header = encrypted TOC size (including PKCS7 padding)
+- When flag bit 1 is 0, TOC is stored as plaintext
+
+**9.3 Decoy Padding (flags bit 3)**
+- Random bytes inserted after each data block
+- Size stored in file table entry `padding_after` field
+- Decoder skips `padding_after` bytes after reading ciphertext
+- When flag bit 3 is 0, `padding_after` is always 0
+
+**10. Decode Order of Operations**
+1. If XOR obfuscation flag: de-XOR header bytes 0x00-0x27
+2. Read header fields (magic, version, flags, file_count, toc_offset, toc_size, toc_iv)
+3. Verify magic bytes
+4. Read TOC bytes from toc_offset, length toc_size
+5. If TOC encryption flag: decrypt TOC with AES-256-CBC using toc_iv
+6. Parse file table entries
+7. For each file:
+   a. Read ciphertext from data_offset, length encrypted_size
+   b. Verify HMAC-SHA-256(key, iv || ciphertext) matches stored hmac
+   c. Decrypt ciphertext with AES-256-CBC using entry's iv
+   d. Remove PKCS7 padding
+   e. If compression_flag: decompress with gunzip/GZIPInputStream
+   f. Verify SHA-256 of result matches stored sha256
+   g. Write to output file using stored filename
+
+**11. Version Compatibility Rules**
+- Version 1: supports all features described in this spec
+- Decoders MUST reject archives with version > supported
+- Unknown flags bits MUST cause rejection (not silent ignore)
+- Future versions may add fields after reserved bytes in header
+
+Do NOT include the worked example in this task -- that is Task 2.
+  </action>
+  <verify>
+    <automated>test -f docs/FORMAT.md && grep -c "Archive Header" docs/FORMAT.md && grep -c "File Table Entry" docs/FORMAT.md && grep -c "HMAC-SHA-256" docs/FORMAT.md && grep -c "XOR" docs/FORMAT.md && grep -c "PKCS7" docs/FORMAT.md && wc -l docs/FORMAT.md | awk '{print ($1 >= 200) ? "PASS: " $1 " lines" : "FAIL: only " $1 " lines"}'</automated>
+    <manual>Review that every structure has a complete field table with offsets, sizes, types, and endianness</manual>
+  </verify>
+  <done>docs/FORMAT.md exists with sections 1-11 covering header (40 bytes, field table with offsets), file table entry (variable length, field table), data block layout, AES-256-CBC + HMAC-SHA-256 encryption pipeline, PKCS7 formula with examples, gzip compression flags, XOR obfuscation with specific key and byte range, TOC encryption details, decoy padding, decode order of operations, and version compatibility rules. No ambiguous phrases like "HMAC of the data" -- all byte ranges are explicit.</done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Write worked example with annotated hex dump and shell reference appendix</name>
+  <files>docs/FORMAT.md</files>
+  <action>
+Append to `docs/FORMAT.md` (created in Task 1) two final sections:
+
+**Section 12: Worked Example**
+
+Create a concrete 2-file archive with ALL values computed from scratch. This is the most critical part of the spec -- if the offsets are wrong by even 1 byte, the shell decoder will produce garbage.
+
+Use these inputs:
+- File 1: "hello.txt" containing exactly "Hello" (5 bytes, 0x48 0x65 0x6C 0x6C 0x6F)
+- File 2: "data.bin" containing exactly 32 bytes of 0x01 repeated (simulates a binary file)
+- Key: 32 bytes, use a simple memorable pattern for the example, e.g. 0x00 0x01 0x02 ... 0x1F
+- Flags: 0x01 (compression enabled, no obfuscation)
+- Version: 1
+
+For each file, manually walk through the pipeline:
+1. Compute SHA-256 of original content (use `echo -n "Hello" | sha256sum` and similar to get REAL values -- actually run these commands to get correct hashes)
+2. Gzip compress (note that gzip output is non-deterministic due to timestamps, so state: "gzip output is implementation-dependent; this example uses representative values")
+3. Compute encrypted_size using PKCS7 formula
+4. Choose example IVs (document them explicitly)
+5. Show the AES-CBC encryption result (note: "actual ciphertext depends on IV and plaintext; values shown are representative")
+6. Compute HMAC-SHA-256 over IV || ciphertext
+
+Build the complete archive byte-by-byte:
+
+**Header (bytes 0x00-0x27, 40 bytes):**
+Show each field with its hex value and explanation.
+
+**File Table (starts at offset 0x28):**
+Walk through each TOC entry field by field:
+- Entry 1: name_length=9 (0x09 0x00), name="hello.txt" (9 UTF-8 bytes), original_size=5, compressed_size=XX, encrypted_size=YY, data_offset=ZZ, iv=..., hmac=..., sha256=..., compression_flag=1, padding_after=0
+- Entry 2: name_length=8 (0x08 0x00), name="data.bin" (8 UTF-8 bytes), ...
+
+Compute TOC total size: sum of both entries.
+Compute data block offsets: header_size + toc_size = first data block offset.
+
+**Data Blocks:**
+Show each data block: ciphertext bytes at the computed offset.
+
+**Annotated hex dump:**
+Show the FULL archive as a hex dump with annotations on the right:
+```
+Offset  | Hex                                             | ASCII  | Annotation
+--------|------------------------------------------------|--------|---------------------
+0x0000  | 00 EA 72 63 01 01 02 00  28 00 00 00 XX XX ... | ..rc.. | header: magic, ver, flags, count, toc_offset...
+0x0008  | ...                                            |        | header: toc_size, toc_iv...
+...
+```
+
+**Step-by-step shell decode walkthrough:**
+Show exact dd commands to extract each field:
+```sh
+# Read magic bytes
+dd if=archive.bin bs=1 skip=0 count=4 2>/dev/null | xxd -p
+# Expected: 00ea7263
+
+# Read file count
+dd if=archive.bin bs=1 skip=6 count=2 2>/dev/null | xxd -p
+# Expected: 0200 (LE u16 = 2)
+
+# Read TOC offset
+read_le_u32 archive.bin 8
+# Expected: 40 (0x28)
+```
+Continue for reading TOC entries and extracting/decrypting file 1.
+
+**IMPORTANT:** Run actual sha256sum commands during implementation to get REAL hash values for "Hello" and the 32-byte binary content. Use placeholder values ONLY for gzip output and AES ciphertext (which are non-deterministic without running actual crypto).
+
+**Section 13: Appendix -- Shell Decoder Reference**
+
+Include the reference shell functions from the research:
+1. `read_le_u16()` -- read a little-endian u16 from file at offset
+2. `read_le_u32()` -- read a little-endian u32 from file at offset
+3. `verify_hmac()` -- verify HMAC-SHA256 of a data block
+4. `decrypt_file()` -- decrypt a single file entry with openssl
+5. Note on busybox compatibility: if `xxd` unavailable, use `od -A n -t x1`; if `openssl dgst -mac HMAC` unavailable, skip HMAC verification (graceful degradation)
+
+Also include a Kotlin reference decrypt snippet (from research).
+  </action>
+  <verify>
+    <automated>grep -c "Worked Example" docs/FORMAT.md && grep -c "hello.txt" docs/FORMAT.md && grep -c "data.bin" docs/FORMAT.md && grep -c "read_le_u32" docs/FORMAT.md && grep -c "dd if=" docs/FORMAT.md && grep -c "sha256" docs/FORMAT.md | awk '{print ($1 >= 1) ? "PASS" : "FAIL: missing sha256 references"}'</automated>
+    <manual>Verify that the worked example offsets are internally consistent: header_size + toc_size = first data_offset, and each subsequent data_offset = previous data_offset + previous encrypted_size + previous padding_after</manual>
+  </verify>
+  <done>docs/FORMAT.md includes a complete worked example with 2 files ("hello.txt" and "data.bin"), showing every byte of the archive with correct offsets, real SHA-256 hashes, representative ciphertext, annotated hex dump, step-by-step shell decode commands, and an appendix with shell reference functions (read_le_u16, read_le_u32, verify_hmac, decrypt_file) and Kotlin decrypt snippet. Offsets are internally consistent (manually verifiable by summing field sizes).</done>
+</task>
+
+</tasks>
+
+<verification>
+1. `docs/FORMAT.md` exists and contains 300+ lines
+2. Header definition has a complete field table with 8 fields, totaling 40 bytes
+3. File table entry has a complete field table with 11 fields
+4. HMAC input is explicitly defined as "IV (16 bytes) || ciphertext (encrypted_size bytes)"
+5. PKCS7 formula is stated with at least 5 size examples
+6. Obfuscation features (XOR, encrypted TOC, decoy padding) are fully specified with activation flags
+7. Worked example includes 2 files with byte-by-byte annotation
+8. Worked example offsets are internally consistent (sum of field sizes matches stated offsets)
+9. Shell decoder reference functions are included in appendix
+10. No ambiguous phrases: no "HMAC of the data", no "padding as needed", no "TBD" or "see Phase 6"
+</verification>
+
+<success_criteria>
+- docs/FORMAT.md is a standalone, complete spec that a developer can implement against without asking questions
+- Every binary structure has exact byte offsets and sizes
+- Encryption parameters leave no room for interpretation
+- Worked example can be used as a golden reference for Phase 3 test vectors
+- Shell decoder reference functions demonstrate the spec is implementable with busybox commands
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/01-format-specification/01-01-SUMMARY.md`
+</output>