From cc7ff6db10b12677d805af6348af7a42cd9c976c Mon Sep 17 00:00:00 2001 From: NikitolProject Date: Tue, 24 Feb 2026 23:44:54 +0300 Subject: [PATCH] =?UTF-8?q?docs(02):=20research=20phase=20domain=20?= =?UTF-8?q?=E2=80=94=20Rust=20crypto=20stack,=20binary=20format,=20CLI=20p?= =?UTF-8?q?atterns?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../phases/02-core-archiver/02-RESEARCH.md | 468 ++++++++++++++++++ 1 file changed, 468 insertions(+) create mode 100644 .planning/phases/02-core-archiver/02-RESEARCH.md diff --git a/.planning/phases/02-core-archiver/02-RESEARCH.md b/.planning/phases/02-core-archiver/02-RESEARCH.md new file mode 100644 index 0000000..6c196e2 --- /dev/null +++ b/.planning/phases/02-core-archiver/02-RESEARCH.md @@ -0,0 +1,468 @@ +# Phase 2: Core Archiver - Research + +**Researched:** 2026-02-24 +**Domain:** Rust CLI binary with custom binary format, AES-256-CBC encryption, gzip compression, HMAC-SHA-256 authentication +**Confidence:** HIGH + +## Summary + +Phase 2 implements the core Rust CLI archiver from scratch (greenfield -- no existing source code). The tool must produce archives matching the FORMAT.md specification (v1) exactly: 40-byte fixed header, variable-length TOC with per-file metadata, and encrypted data blocks. The pipeline for each file is: SHA-256 hash -> gzip compress (optional) -> PKCS7 pad -> AES-256-CBC encrypt -> HMAC-SHA-256 authenticate. + +The Rust ecosystem has mature, well-tested crates for every component: `aes` + `cbc` for encryption, `hmac` + `sha2` for authentication and hashing, `flate2` for gzip, `clap` for CLI, `rand` for IV generation. All stable versions are compatible and compile together (verified). The full crypto pipeline (compress -> encrypt -> HMAC -> verify -> decrypt -> decompress -> verify SHA-256) was validated as a working Rust program during this research. + +**Primary recommendation:** Use stable RustCrypto crates (aes 0.8, cbc 0.1, hmac 0.12, sha2 0.10) rather than the 0.9/0.2/0.13/0.11 release candidates. The stable versions are battle-tested, have extensive documentation, and all compile together with Rust 1.93. Structure the project with clear module separation: `cli`, `format`, `crypto`, `compression`, `archive` (pack/unpack/inspect logic). + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|-----------------| +| FMT-01 | Custom binary format with non-standard magic bytes (not recognized by binwalk/file/7z) | Magic bytes `0x00 0xEA 0x72 0x63` defined in FORMAT.md; leading null byte prevents `file` recognition. Binary serialization uses Rust std `to_le_bytes()`/`from_le_bytes()` -- no external crate needed. | +| FMT-02 | Version field (1 byte) for forward compatibility | Simple u8 at offset 0x04; reject version != 1. Trivial to implement. | +| FMT-03 | File table with metadata: name, sizes, offset, IV, HMAC, SHA-256 | Variable-length TOC entries (101 + name_length bytes each). UTF-8 filenames, length-prefixed. All field types are standard Rust primitives. | +| FMT-04 | Little-endian for all multi-byte fields | Rust std: `u16::to_le_bytes()`, `u32::to_le_bytes()`, `u16::from_le_bytes()`, `u32::from_le_bytes()`. No external crate needed. | +| ENC-01 | AES-256-CBC encryption per file | `aes 0.8.4` + `cbc 0.1.2` crates. Type alias: `type Aes256CbcEnc = cbc::Encryptor`. Verified working. | +| ENC-02 | HMAC-SHA-256 authentication (encrypt-then-MAC) per file | `hmac 0.12.1` + `sha2 0.10.9`. HMAC input = IV (16 bytes) \|\| ciphertext. Verified working. | +| ENC-03 | Random 16-byte IV per file, stored in cleartext TOC | `rand 0.9.2`: `rand::rng().fill(&mut iv)`. ThreadRng is cryptographically secure (ChaCha-based with OS seeding). | +| ENC-04 | Hardcoded 32-byte key | Const array `const KEY: [u8; 32] = [...]` in source. Same key for AES and HMAC in v1. | +| ENC-05 | PKCS7 padding for AES-CBC | `cbc` crate handles PKCS7 via `encrypt_padded_mut::()`. Formula: `encrypted_size = ((compressed_size / 16) + 1) * 16`. Verified. | +| CMP-01 | Gzip compression per file before encryption | `flate2 1.1.9`: `GzEncoder::new(Vec::new(), Compression::default())`. Use `GzBuilder::new().mtime(0)` for reproducible output in tests. | +| CMP-02 | Per-file compression flag (skip for already-compressed files) | CLI `--no-compress` flag + extension-based auto-detection for `.apk`, `.zip`, `.png`, `.jpg`, `.jpeg`, `.gz`, `.bz2`, `.xz`, `.mp4`, `.mp3`. | +| INT-01 | SHA-256 checksum per file (verify after decompression) | `sha2 0.10.9`: `Sha256::digest(&original_data)`. Computed BEFORE compression. Stored in TOC entry. | +| CLI-01 | Rust CLI utility for archive creation (Linux/macOS) | `clap 4.5.60` with derive API. Binary target in `src/main.rs`. Standard cargo build. | +| CLI-02 | Pack multiple files (text + APK) into one archive | `pack` subcommand accepts `Vec` input files + `-o` output path. Reads files into memory (per Out of Scope: no streaming). | +| CLI-03 | Subcommands: pack, unpack, inspect | Three subcommands via clap `#[derive(Subcommand)]`. `inspect` reads header + TOC only, displays metadata without decrypting data blocks. | + + +## Standard Stack + +### Core +| Library | Version | Purpose | Why Standard | +|---------|---------|---------|--------------| +| `aes` | 0.8.4 | AES-256 block cipher | RustCrypto official. 96M+ downloads. Pure Rust with hardware acceleration (AES-NI). | +| `cbc` | 0.1.2 | CBC mode of operation | RustCrypto official. Handles PKCS7 padding natively via `block_padding::Pkcs7`. | +| `hmac` | 0.12.1 | HMAC-SHA-256 computation | RustCrypto official. Constant-time comparison via `verify_slice()`. | +| `sha2` | 0.10.9 | SHA-256 hashing | RustCrypto official. Both one-shot (`Sha256::digest()`) and streaming APIs. | +| `flate2` | 1.1.9 | Gzip compression/decompression | De facto standard. Uses miniz_oxide (pure Rust) by default. | +| `clap` | 4.5.60 | CLI argument parsing | Industry standard. Derive API for subcommands. | +| `rand` | 0.9.2 | Cryptographic random IV generation | `rand::rng()` returns ChaCha-based CSPRNG with OS seeding. | +| `anyhow` | 1.0.102 | Error handling | Ergonomic `Result` with context. Standard for CLI apps. | + +### Supporting +| Library | Version | Purpose | When to Use | +|---------|---------|---------|-------------| +| (none -- std lib) | - | Little-endian serialization | `u16::to_le_bytes()`, `u32::from_le_bytes()` etc. Built into Rust std. | + +### Alternatives Considered +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| `aes` 0.8 + `cbc` 0.1 (stable) | `aes` 0.9-rc + `cbc` 0.2-rc (RC) | RC versions have newer API but are pre-release. Stable versions are battle-tested and fully compatible. Use stable. | +| `byteorder` crate | Rust std `to_le_bytes()`/`from_le_bytes()` | std is sufficient since Rust 1.32. No external crate needed. | +| `ring` (Google) | RustCrypto stack | `ring` does not expose AES-CBC. It focuses on AEAD modes (AES-GCM). Not suitable for this format. | +| `openssl` crate | RustCrypto stack | Links to C library. RustCrypto is pure Rust, no system dependencies. Simpler cross-compilation. | +| `serde` + `bincode` | Manual binary serialization | Format spec requires exact byte layout. Manual serialization gives precise control over every byte. Serde/bincode add unnecessary abstraction for a fixed binary format. | + +**Installation:** +```bash +cargo init --name encrypted_archive +cargo add aes@0.8 cbc@0.1 hmac@0.12 sha2@0.10 flate2@1.1 clap@4.5 --features clap/derive rand@0.9 anyhow@1.0 +``` + +## Architecture Patterns + +### Recommended Project Structure +``` +encrypted_archive/ +├── Cargo.toml +├── src/ +│ ├── main.rs # Entry point: clap CLI parsing, dispatch to commands +│ ├── cli.rs # Clap derive structs (Cli, Commands enum) +│ ├── format.rs # Binary format constants, header/TOC structs, serialization/deserialization +│ ├── crypto.rs # encrypt_file(), decrypt_file(), compute_hmac(), verify_hmac() +│ ├── compression.rs # compress(), decompress(), should_compress() +│ ├── archive.rs # pack(), unpack(), inspect() -- orchestration logic +│ └── key.rs # Hardcoded 32-byte key constant +├── docs/ +│ └── FORMAT.md # Binary format specification (already exists) +└── tests/ # Integration tests (Phase 3) +``` + +### Pattern 1: Pipeline Processing per File +**What:** Each file goes through a sequential pipeline: hash -> compress -> pad+encrypt -> HMAC +**When to use:** Always during `pack` operation +**Example:** +```rust +// Source: Verified working pipeline from research validation +use aes::cipher::{block_padding::Pkcs7, BlockEncryptMut, KeyIvInit}; +use hmac::{Hmac, Mac}; +use sha2::{Sha256, Digest}; +use flate2::write::GzEncoder; +use flate2::Compression; +use std::io::Write; + +type Aes256CbcEnc = cbc::Encryptor; +type HmacSha256 = Hmac; + +struct ProcessedFile { + name: String, + original_size: u32, + compressed_size: u32, + encrypted_size: u32, + iv: [u8; 16], + hmac: [u8; 32], + sha256: [u8; 32], + compression_flag: u8, + ciphertext: Vec, +} + +fn process_file(name: &str, data: &[u8], key: &[u8; 32], compress: bool) -> ProcessedFile { + // Step 1: SHA-256 of original + let sha256: [u8; 32] = Sha256::digest(data).into(); + + // Step 2: Compress (optional) + let compressed = if compress { + let mut encoder = GzEncoder::new(Vec::new(), Compression::default()); + encoder.write_all(data).unwrap(); + encoder.finish().unwrap() + } else { + data.to_vec() + }; + + // Step 3: Generate random IV + let mut iv = [0u8; 16]; + rand::rng().fill(&mut iv); + + // Step 4: Encrypt with PKCS7 padding + let encrypted_size = ((compressed.len() / 16) + 1) * 16; + let mut buf = vec![0u8; encrypted_size]; + buf[..compressed.len()].copy_from_slice(&compressed); + let ciphertext = Aes256CbcEnc::new(key.into(), &iv.into()) + .encrypt_padded_mut::(&mut buf, compressed.len()) + .unwrap() + .to_vec(); + + // Step 5: HMAC-SHA-256 over IV || ciphertext + let mut mac = HmacSha256::new_from_slice(key).unwrap(); + mac.update(&iv); + mac.update(&ciphertext); + let hmac: [u8; 32] = mac.finalize().into_bytes().into(); + + ProcessedFile { + name: name.to_string(), + original_size: data.len() as u32, + compressed_size: compressed.len() as u32, + encrypted_size: encrypted_size as u32, + iv, + hmac, + sha256, + compression_flag: if compress { 1 } else { 0 }, + ciphertext, + } +} +``` + +### Pattern 2: Two-Pass Archive Writing +**What:** First pass processes all files to compute sizes and offsets; second pass writes the archive sequentially. +**When to use:** Always during `pack`. The TOC must contain `data_offset` for each file, but data blocks come after the TOC. You must know TOC size before writing data blocks. +**Example:** +```rust +fn compute_offsets(files: &mut [ProcessedFile], file_count: u16) { + let header_size: u32 = 40; + + // Compute TOC size + let toc_size: u32 = files.iter() + .map(|f| 101 + f.name.len() as u32) + .sum(); + + let toc_offset = header_size; + let mut data_offset = toc_offset + toc_size; + + // Assign data offsets + for file in files.iter_mut() { + file.data_offset = data_offset; + data_offset += file.encrypted_size; + // padding_after = 0 in Phase 2 (no decoy padding) + } +} +``` + +### Pattern 3: CLI Subcommand Dispatch +**What:** Use clap derive API with an enum of subcommands +**When to use:** Always for the CLI entry point +**Example:** +```rust +// Source: Verified working clap derive pattern from research validation +use clap::{Parser, Subcommand}; +use std::path::PathBuf; + +#[derive(Parser)] +#[command(name = "encrypted_archive")] +#[command(about = "Custom encrypted archive tool")] +struct Cli { + #[command(subcommand)] + command: Commands, +} + +#[derive(Subcommand)] +enum Commands { + /// Pack files into an encrypted archive + Pack { + /// Input files to archive + #[arg(required = true)] + files: Vec, + /// Output archive file + #[arg(short, long)] + output: PathBuf, + /// Disable compression for specified files + #[arg(long)] + no_compress: Vec, + }, + /// Unpack an encrypted archive (for testing) + Unpack { + /// Archive file to unpack + archive: PathBuf, + /// Output directory + #[arg(short, long, default_value = ".")] + output_dir: PathBuf, + }, + /// Inspect archive metadata without decrypting + Inspect { + /// Archive file to inspect + archive: PathBuf, + }, +} +``` + +### Anti-Patterns to Avoid +- **Streaming writes without knowing offsets:** The TOC contains `data_offset` for each file. You MUST compute all offsets before writing the TOC. Process all files first, then serialize. +- **Using serde/bincode for binary format:** The format spec requires exact byte-level control. Manual serialization with `to_le_bytes()` is correct and simpler. +- **Single large buffer for entire archive:** Process and encrypt files individually, write them sequentially. Each file should be processed independently. +- **Reusing IVs:** Each file MUST have a unique random IV. Never reuse IVs across files or archive creations. +- **MAC-then-encrypt:** The spec mandates encrypt-then-MAC. HMAC MUST be computed over `IV || ciphertext`, NOT over plaintext. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| AES-256-CBC encryption | Custom AES implementation | `aes 0.8` + `cbc 0.1` crates | Side-channel resistance, hardware acceleration, audited | +| PKCS7 padding | Manual padding logic | `cbc` crate's `Pkcs7` padding (via `block_padding`) | Off-by-one errors in padding are security-critical | +| HMAC-SHA-256 | Manual HMAC construction | `hmac 0.12` crate | Constant-time comparison, correct key scheduling | +| SHA-256 hashing | Custom hash | `sha2 0.10` crate | Correctness, performance, hardware acceleration | +| Gzip compression | Custom deflate | `flate2 1.1` crate | RFC 1952 compliance, performance, battle-tested | +| CLI argument parsing | Manual arg parsing | `clap 4.5` with derive | Validation, help text, error messages, subcommands | +| Random IV generation | Custom RNG | `rand 0.9` with `rand::rng()` | CSPRNG with OS seeding, no bias | +| Little-endian serialization | Manual byte shifting | Rust std `to_le_bytes()`/`from_le_bytes()` | Built-in, zero-cost, correct | + +**Key insight:** Every component in the encryption pipeline is security-sensitive. Using audited, well-tested crates for crypto operations is not optional -- hand-rolled crypto is the single highest-risk anti-pattern in this domain. + +## Common Pitfalls + +### Pitfall 1: Buffer Sizing for `encrypt_padded_mut` +**What goes wrong:** `PadError` at runtime because the buffer is too small for PKCS7-padded output. +**Why it happens:** PKCS7 ALWAYS adds at least 1 byte. When input is a multiple of 16, a full 16-byte padding block is added. Formula: `((input_len / 16) + 1) * 16`. +**How to avoid:** Always allocate `encrypted_size = ((compressed_size / 16) + 1) * 16` bytes for the encryption buffer. Copy compressed data to the start, then call `encrypt_padded_mut` with `compressed_size` as the plaintext length. +**Warning signs:** `PadError` or `unwrap()` panic during encryption. + +### Pitfall 2: Gzip Non-Determinism in Tests +**What goes wrong:** Gzip output varies between runs (different `compressed_size`), making golden tests impossible. +**Why it happens:** Gzip headers contain a timestamp (`mtime`) and OS byte that vary. +**How to avoid:** Use `GzBuilder::new().mtime(0).write(Vec::new(), Compression::default())` to zero out the timestamp. The OS byte defaults to the build platform but is consistent on the same machine. +**Warning signs:** `compressed_size` changes between test runs for identical input. + +### Pitfall 3: Incorrect HMAC Scope +**What goes wrong:** HMAC computed over wrong data (just ciphertext, or including TOC metadata). +**Why it happens:** Ambiguity about what "encrypt-then-MAC" covers. +**How to avoid:** FORMAT.md is explicit: `HMAC_input = IV (16 bytes) || ciphertext (encrypted_size bytes)`. Nothing else. The IV from the TOC entry, concatenated with the ciphertext from the data block. +**Warning signs:** HMAC verification failures in other decoders (Kotlin, shell). + +### Pitfall 4: TOC Offset Calculation Errors +**What goes wrong:** Data blocks written at wrong offsets; decoders read garbage. +**Why it happens:** Variable-length filename fields make TOC entry sizes differ. Off-by-one in offset arithmetic. +**How to avoid:** Use the formula from FORMAT.md: `entry_size = 101 + name_length`. Total TOC size = sum of all entry sizes. First data block offset = `toc_offset + toc_size`. Each subsequent data block offset = previous offset + previous `encrypted_size`. +**Warning signs:** `inspect` command shows corrupted filenames or impossible sizes. + +### Pitfall 5: Endianness Errors +**What goes wrong:** Multi-byte fields written in big-endian or native-endian instead of little-endian. +**Why it happens:** Forgetting to convert, or using wrong conversion function. +**How to avoid:** Always use `value.to_le_bytes()` when writing and `u32::from_le_bytes([b0, b1, b2, b3])` when reading. Never use `to_ne_bytes()` or `to_be_bytes()`. +**Warning signs:** Values look "swapped" when inspecting hex dump. Shell decoder reads wrong numbers. + +### Pitfall 6: UTF-8 Filename Length vs. Character Count +**What goes wrong:** `name_length` field stores character count instead of byte count. +**Why it happens:** Confusion between `str.len()` (byte count, correct) and `str.chars().count()` (character count, wrong). +**How to avoid:** FORMAT.md specifies `name_length` as "Filename length in bytes (UTF-8 encoded byte count)". In Rust, `String::len()` returns byte count, which is correct. +**Warning signs:** Non-ASCII filenames (Cyrillic) cause parsing errors in decoders. + +### Pitfall 7: Forgetting Flags Byte +**What goes wrong:** Archive header has wrong flags, decoders misinterpret format features. +**Why it happens:** Phase 2 uses only bit 0 (compression). Bits 1-7 must be zero. +**How to avoid:** Set `flags = 0x01` when any file uses compression (global flag), `flags = 0x00` when no files use compression. Bits 1-3 are for Phase 6 obfuscation features. Bits 4-7 MUST be zero. +**Warning signs:** Decoders reject archive due to unknown flags. + +## Code Examples + +Verified patterns from official sources and research validation: + +### Binary Format Serialization (Header) +```rust +// Source: FORMAT.md Section 4 + Rust std library +fn write_header( + writer: &mut impl std::io::Write, + file_count: u16, + toc_offset: u32, + toc_size: u32, + flags: u8, +) -> std::io::Result<()> { + // Magic bytes + writer.write_all(&[0x00, 0xEA, 0x72, 0x63])?; + // Version + writer.write_all(&[0x01])?; + // Flags + writer.write_all(&[flags])?; + // File count (LE) + writer.write_all(&file_count.to_le_bytes())?; + // TOC offset (LE) + writer.write_all(&toc_offset.to_le_bytes())?; + // TOC size (LE) + writer.write_all(&toc_size.to_le_bytes())?; + // TOC IV (zero-filled, TOC not encrypted in Phase 2) + writer.write_all(&[0u8; 16])?; + // Reserved + writer.write_all(&[0u8; 8])?; + Ok(()) +} +``` + +### TOC Entry Serialization +```rust +// Source: FORMAT.md Section 5 +fn write_toc_entry( + writer: &mut impl std::io::Write, + file: &ProcessedFile, + data_offset: u32, +) -> std::io::Result<()> { + let name_bytes = file.name.as_bytes(); + writer.write_all(&(name_bytes.len() as u16).to_le_bytes())?; + writer.write_all(name_bytes)?; + writer.write_all(&file.original_size.to_le_bytes())?; + writer.write_all(&file.compressed_size.to_le_bytes())?; + writer.write_all(&file.encrypted_size.to_le_bytes())?; + writer.write_all(&data_offset.to_le_bytes())?; + writer.write_all(&file.iv)?; + writer.write_all(&file.hmac)?; + writer.write_all(&file.sha256)?; + writer.write_all(&[file.compression_flag])?; + writer.write_all(&0u16.to_le_bytes())?; // padding_after = 0 + Ok(()) +} +``` + +### Inspect Command (Read Header + TOC Only) +```rust +// Source: FORMAT.md Section 10, steps 1-4 +use std::io::{Read, Seek, SeekFrom}; + +fn read_header(reader: &mut impl Read) -> anyhow::Result
{ + let mut buf = [0u8; 40]; + reader.read_exact(&mut buf)?; + + // Verify magic + anyhow::ensure!( + buf[0..4] == [0x00, 0xEA, 0x72, 0x63], + "Invalid magic bytes" + ); + + let version = buf[4]; + anyhow::ensure!(version == 1, "Unsupported version: {}", version); + + let flags = buf[5]; + anyhow::ensure!(flags & 0xF0 == 0, "Unknown flags set: 0x{:02X}", flags); + + let file_count = u16::from_le_bytes([buf[6], buf[7]]); + let toc_offset = u32::from_le_bytes([buf[8], buf[9], buf[10], buf[11]]); + let toc_size = u32::from_le_bytes([buf[12], buf[13], buf[14], buf[15]]); + + Ok(Header { version, flags, file_count, toc_offset, toc_size }) +} +``` + +### Compression Decision Heuristic +```rust +// Source: FORMAT.md Section 8 recommendation +fn should_compress(filename: &str, no_compress_list: &[String]) -> bool { + // Explicit exclusion from CLI + if no_compress_list.iter().any(|nc| filename.ends_with(nc) || filename == nc) { + return false; + } + // Auto-detect already-compressed formats + let ext = filename.rsplit('.').next().unwrap_or("").to_lowercase(); + !matches!( + ext.as_str(), + "apk" | "zip" | "gz" | "bz2" | "xz" | "zst" + | "png" | "jpg" | "jpeg" | "gif" | "webp" + | "mp4" | "mp3" | "aac" | "ogg" | "flac" + | "7z" | "rar" | "jar" + ) +} +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| `block-modes 0.8` crate | `cbc 0.1` crate (separate crate per mode) | 2022 | `block-modes` is deprecated. Use `cbc` directly. | +| `rand::thread_rng()` | `rand::rng()` | rand 0.9 (2025) | Function renamed. Same underlying ChaCha CSPRNG. | +| `GenericArray` for keys/IVs | `.into()` conversion from `[u8; N]` | aes/cbc 0.8/0.1 | Can pass `&key.into()` directly from fixed arrays. | +| `byteorder` crate | Rust std `to_le_bytes()`/`from_le_bytes()` | Rust 1.32 (2018) | No external crate needed for endian conversion. | + +**Deprecated/outdated:** +- `block-modes` crate: Replaced by individual mode crates (`cbc`, `ecb`, `cfb`, `ofb`). Do NOT use `block-modes`. +- `rand::thread_rng()`: Renamed to `rand::rng()` in 0.9. The old name is removed. +- `crypto-mac` crate: Merged into `digest` 0.10. Use `hmac 0.12` which uses `digest 0.10` internally. + +## Open Questions + +1. **Hardcoded key value** + - What we know: The key is 32 bytes, hardcoded, shared across all decoders. + - What's unclear: The specific key bytes are not defined in FORMAT.md (only the worked example uses `00 01 02 ... 1F`). + - Recommendation: Define a non-trivial key constant in `src/key.rs`. The planner should decide the actual key bytes or generate them randomly once. The worked example key is fine for testing but should be replaced for production. + +2. **Error handling strategy for `unpack`** + - What we know: FORMAT.md says "MUST reject" on HMAC failure, "MUST fail" on bad version. + - What's unclear: Should `unpack` abort on first file error, or continue extracting other files? + - Recommendation: Abort on header/TOC errors. For per-file errors (HMAC mismatch, SHA-256 mismatch), report the error but continue extracting remaining files (with a non-zero exit code at the end). + +3. **Maximum file size constraint (u32)** + - What we know: `original_size`, `compressed_size`, `encrypted_size` are all u32 (max ~4 GB). + - What's unclear: Should the archiver check and reject files > 4 GB? + - Recommendation: Yes, validate file sizes during `pack` and produce a clear error if any file exceeds `u32::MAX`. This is acceptable given the Out of Scope note ("files fit in memory"). + +## Sources + +### Primary (HIGH confidence) +- `docs/FORMAT.md` v1.0 -- The normative specification for the binary format. All byte offsets, field sizes, and pipeline steps are from this document. +- `docs.rs/aes/0.8.4` -- AES crate API documentation +- `docs.rs/cbc/0.1.2` -- CBC mode crate API documentation and usage examples +- `docs.rs/hmac/0.12.1` -- HMAC crate API documentation and usage examples +- `docs.rs/sha2/0.10.9` -- SHA-2 crate API documentation +- `docs.rs/flate2/1.1.9` -- flate2 crate API documentation (GzEncoder, GzDecoder, GzBuilder) +- `docs.rs/clap/4.5.60` -- Clap CLI crate documentation +- `docs.rs/rand/0.9.2` -- Rand crate documentation +- **Research validation:** Full pipeline (compress -> encrypt -> HMAC -> verify -> decrypt -> decompress -> verify) was compiled and executed successfully as a Rust program during this research. + +### Secondary (MEDIUM confidence) +- `crates.io` version listings -- Latest stable versions verified via `cargo search` and crates.io API +- `rust-random.github.io/book` -- Rand book confirming ThreadRng is ChaCha-based CSPRNG + +### Tertiary (LOW confidence) +- None. All findings are verified against official documentation and compilation tests. + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH -- All crates verified via `cargo check`, full pipeline compiled and executed +- Architecture: HIGH -- Follows standard Rust CLI patterns; FORMAT.md provides exact byte-level specification +- Pitfalls: HIGH -- Common issues identified from official docs, GitHub issues, and practical validation + +**Research date:** 2026-02-24 +**Valid until:** 2026-04-24 (stable crates, slow-moving ecosystem)