24 KiB
Phase 2: Core Archiver - Research
Researched: 2026-02-24 Domain: Rust CLI binary with custom binary format, AES-256-CBC encryption, gzip compression, HMAC-SHA-256 authentication Confidence: HIGH
Summary
Phase 2 implements the core Rust CLI archiver from scratch (greenfield -- no existing source code). The tool must produce archives matching the FORMAT.md specification (v1) exactly: 40-byte fixed header, variable-length TOC with per-file metadata, and encrypted data blocks. The pipeline for each file is: SHA-256 hash -> gzip compress (optional) -> PKCS7 pad -> AES-256-CBC encrypt -> HMAC-SHA-256 authenticate.
The Rust ecosystem has mature, well-tested crates for every component: aes + cbc for encryption, hmac + sha2 for authentication and hashing, flate2 for gzip, clap for CLI, rand for IV generation. All stable versions are compatible and compile together (verified). The full crypto pipeline (compress -> encrypt -> HMAC -> verify -> decrypt -> decompress -> verify SHA-256) was validated as a working Rust program during this research.
Primary recommendation: Use stable RustCrypto crates (aes 0.8, cbc 0.1, hmac 0.12, sha2 0.10) rather than the 0.9/0.2/0.13/0.11 release candidates. The stable versions are battle-tested, have extensive documentation, and all compile together with Rust 1.93. Structure the project with clear module separation: cli, format, crypto, compression, archive (pack/unpack/inspect logic).
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| FMT-01 | Custom binary format with non-standard magic bytes (not recognized by binwalk/file/7z) | Magic bytes 0x00 0xEA 0x72 0x63 defined in FORMAT.md; leading null byte prevents file recognition. Binary serialization uses Rust std to_le_bytes()/from_le_bytes() -- no external crate needed. |
| FMT-02 | Version field (1 byte) for forward compatibility | Simple u8 at offset 0x04; reject version != 1. Trivial to implement. |
| FMT-03 | File table with metadata: name, sizes, offset, IV, HMAC, SHA-256 | Variable-length TOC entries (101 + name_length bytes each). UTF-8 filenames, length-prefixed. All field types are standard Rust primitives. |
| FMT-04 | Little-endian for all multi-byte fields | Rust std: u16::to_le_bytes(), u32::to_le_bytes(), u16::from_le_bytes(), u32::from_le_bytes(). No external crate needed. |
| ENC-01 | AES-256-CBC encryption per file | aes 0.8.4 + cbc 0.1.2 crates. Type alias: type Aes256CbcEnc = cbc::Encryptor<aes::Aes256>. Verified working. |
| ENC-02 | HMAC-SHA-256 authentication (encrypt-then-MAC) per file | hmac 0.12.1 + sha2 0.10.9. HMAC input = IV (16 bytes) || ciphertext. Verified working. |
| ENC-03 | Random 16-byte IV per file, stored in cleartext TOC | rand 0.9.2: rand::rng().fill(&mut iv). ThreadRng is cryptographically secure (ChaCha-based with OS seeding). |
| ENC-04 | Hardcoded 32-byte key | Const array const KEY: [u8; 32] = [...] in source. Same key for AES and HMAC in v1. |
| ENC-05 | PKCS7 padding for AES-CBC | cbc crate handles PKCS7 via encrypt_padded_mut::<Pkcs7>(). Formula: encrypted_size = ((compressed_size / 16) + 1) * 16. Verified. |
| CMP-01 | Gzip compression per file before encryption | flate2 1.1.9: GzEncoder::new(Vec::new(), Compression::default()). Use GzBuilder::new().mtime(0) for reproducible output in tests. |
| CMP-02 | Per-file compression flag (skip for already-compressed files) | CLI --no-compress flag + extension-based auto-detection for .apk, .zip, .png, .jpg, .jpeg, .gz, .bz2, .xz, .mp4, .mp3. |
| INT-01 | SHA-256 checksum per file (verify after decompression) | sha2 0.10.9: Sha256::digest(&original_data). Computed BEFORE compression. Stored in TOC entry. |
| CLI-01 | Rust CLI utility for archive creation (Linux/macOS) | clap 4.5.60 with derive API. Binary target in src/main.rs. Standard cargo build. |
| CLI-02 | Pack multiple files (text + APK) into one archive | pack subcommand accepts Vec<PathBuf> input files + -o output path. Reads files into memory (per Out of Scope: no streaming). |
| CLI-03 | Subcommands: pack, unpack, inspect | Three subcommands via clap #[derive(Subcommand)]. inspect reads header + TOC only, displays metadata without decrypting data blocks. |
| </phase_requirements> |
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
aes |
0.8.4 | AES-256 block cipher | RustCrypto official. 96M+ downloads. Pure Rust with hardware acceleration (AES-NI). |
cbc |
0.1.2 | CBC mode of operation | RustCrypto official. Handles PKCS7 padding natively via block_padding::Pkcs7. |
hmac |
0.12.1 | HMAC-SHA-256 computation | RustCrypto official. Constant-time comparison via verify_slice(). |
sha2 |
0.10.9 | SHA-256 hashing | RustCrypto official. Both one-shot (Sha256::digest()) and streaming APIs. |
flate2 |
1.1.9 | Gzip compression/decompression | De facto standard. Uses miniz_oxide (pure Rust) by default. |
clap |
4.5.60 | CLI argument parsing | Industry standard. Derive API for subcommands. |
rand |
0.9.2 | Cryptographic random IV generation | rand::rng() returns ChaCha-based CSPRNG with OS seeding. |
anyhow |
1.0.102 | Error handling | Ergonomic Result<T> with context. Standard for CLI apps. |
Supporting
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| (none -- std lib) | - | Little-endian serialization | u16::to_le_bytes(), u32::from_le_bytes() etc. Built into Rust std. |
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
aes 0.8 + cbc 0.1 (stable) |
aes 0.9-rc + cbc 0.2-rc (RC) |
RC versions have newer API but are pre-release. Stable versions are battle-tested and fully compatible. Use stable. |
byteorder crate |
Rust std to_le_bytes()/from_le_bytes() |
std is sufficient since Rust 1.32. No external crate needed. |
ring (Google) |
RustCrypto stack | ring does not expose AES-CBC. It focuses on AEAD modes (AES-GCM). Not suitable for this format. |
openssl crate |
RustCrypto stack | Links to C library. RustCrypto is pure Rust, no system dependencies. Simpler cross-compilation. |
serde + bincode |
Manual binary serialization | Format spec requires exact byte layout. Manual serialization gives precise control over every byte. Serde/bincode add unnecessary abstraction for a fixed binary format. |
Installation:
cargo init --name encrypted_archive
cargo add aes@0.8 cbc@0.1 hmac@0.12 sha2@0.10 flate2@1.1 clap@4.5 --features clap/derive rand@0.9 anyhow@1.0
Architecture Patterns
Recommended Project Structure
encrypted_archive/
├── Cargo.toml
├── src/
│ ├── main.rs # Entry point: clap CLI parsing, dispatch to commands
│ ├── cli.rs # Clap derive structs (Cli, Commands enum)
│ ├── format.rs # Binary format constants, header/TOC structs, serialization/deserialization
│ ├── crypto.rs # encrypt_file(), decrypt_file(), compute_hmac(), verify_hmac()
│ ├── compression.rs # compress(), decompress(), should_compress()
│ ├── archive.rs # pack(), unpack(), inspect() -- orchestration logic
│ └── key.rs # Hardcoded 32-byte key constant
├── docs/
│ └── FORMAT.md # Binary format specification (already exists)
└── tests/ # Integration tests (Phase 3)
Pattern 1: Pipeline Processing per File
What: Each file goes through a sequential pipeline: hash -> compress -> pad+encrypt -> HMAC
When to use: Always during pack operation
Example:
// Source: Verified working pipeline from research validation
use aes::cipher::{block_padding::Pkcs7, BlockEncryptMut, KeyIvInit};
use hmac::{Hmac, Mac};
use sha2::{Sha256, Digest};
use flate2::write::GzEncoder;
use flate2::Compression;
use std::io::Write;
type Aes256CbcEnc = cbc::Encryptor<aes::Aes256>;
type HmacSha256 = Hmac<Sha256>;
struct ProcessedFile {
name: String,
original_size: u32,
compressed_size: u32,
encrypted_size: u32,
iv: [u8; 16],
hmac: [u8; 32],
sha256: [u8; 32],
compression_flag: u8,
ciphertext: Vec<u8>,
}
fn process_file(name: &str, data: &[u8], key: &[u8; 32], compress: bool) -> ProcessedFile {
// Step 1: SHA-256 of original
let sha256: [u8; 32] = Sha256::digest(data).into();
// Step 2: Compress (optional)
let compressed = if compress {
let mut encoder = GzEncoder::new(Vec::new(), Compression::default());
encoder.write_all(data).unwrap();
encoder.finish().unwrap()
} else {
data.to_vec()
};
// Step 3: Generate random IV
let mut iv = [0u8; 16];
rand::rng().fill(&mut iv);
// Step 4: Encrypt with PKCS7 padding
let encrypted_size = ((compressed.len() / 16) + 1) * 16;
let mut buf = vec![0u8; encrypted_size];
buf[..compressed.len()].copy_from_slice(&compressed);
let ciphertext = Aes256CbcEnc::new(key.into(), &iv.into())
.encrypt_padded_mut::<Pkcs7>(&mut buf, compressed.len())
.unwrap()
.to_vec();
// Step 5: HMAC-SHA-256 over IV || ciphertext
let mut mac = HmacSha256::new_from_slice(key).unwrap();
mac.update(&iv);
mac.update(&ciphertext);
let hmac: [u8; 32] = mac.finalize().into_bytes().into();
ProcessedFile {
name: name.to_string(),
original_size: data.len() as u32,
compressed_size: compressed.len() as u32,
encrypted_size: encrypted_size as u32,
iv,
hmac,
sha256,
compression_flag: if compress { 1 } else { 0 },
ciphertext,
}
}
Pattern 2: Two-Pass Archive Writing
What: First pass processes all files to compute sizes and offsets; second pass writes the archive sequentially.
When to use: Always during pack. The TOC must contain data_offset for each file, but data blocks come after the TOC. You must know TOC size before writing data blocks.
Example:
fn compute_offsets(files: &mut [ProcessedFile], file_count: u16) {
let header_size: u32 = 40;
// Compute TOC size
let toc_size: u32 = files.iter()
.map(|f| 101 + f.name.len() as u32)
.sum();
let toc_offset = header_size;
let mut data_offset = toc_offset + toc_size;
// Assign data offsets
for file in files.iter_mut() {
file.data_offset = data_offset;
data_offset += file.encrypted_size;
// padding_after = 0 in Phase 2 (no decoy padding)
}
}
Pattern 3: CLI Subcommand Dispatch
What: Use clap derive API with an enum of subcommands When to use: Always for the CLI entry point Example:
// Source: Verified working clap derive pattern from research validation
use clap::{Parser, Subcommand};
use std::path::PathBuf;
#[derive(Parser)]
#[command(name = "encrypted_archive")]
#[command(about = "Custom encrypted archive tool")]
struct Cli {
#[command(subcommand)]
command: Commands,
}
#[derive(Subcommand)]
enum Commands {
/// Pack files into an encrypted archive
Pack {
/// Input files to archive
#[arg(required = true)]
files: Vec<PathBuf>,
/// Output archive file
#[arg(short, long)]
output: PathBuf,
/// Disable compression for specified files
#[arg(long)]
no_compress: Vec<String>,
},
/// Unpack an encrypted archive (for testing)
Unpack {
/// Archive file to unpack
archive: PathBuf,
/// Output directory
#[arg(short, long, default_value = ".")]
output_dir: PathBuf,
},
/// Inspect archive metadata without decrypting
Inspect {
/// Archive file to inspect
archive: PathBuf,
},
}
Anti-Patterns to Avoid
- Streaming writes without knowing offsets: The TOC contains
data_offsetfor each file. You MUST compute all offsets before writing the TOC. Process all files first, then serialize. - Using serde/bincode for binary format: The format spec requires exact byte-level control. Manual serialization with
to_le_bytes()is correct and simpler. - Single large buffer for entire archive: Process and encrypt files individually, write them sequentially. Each file should be processed independently.
- Reusing IVs: Each file MUST have a unique random IV. Never reuse IVs across files or archive creations.
- MAC-then-encrypt: The spec mandates encrypt-then-MAC. HMAC MUST be computed over
IV || ciphertext, NOT over plaintext.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| AES-256-CBC encryption | Custom AES implementation | aes 0.8 + cbc 0.1 crates |
Side-channel resistance, hardware acceleration, audited |
| PKCS7 padding | Manual padding logic | cbc crate's Pkcs7 padding (via block_padding) |
Off-by-one errors in padding are security-critical |
| HMAC-SHA-256 | Manual HMAC construction | hmac 0.12 crate |
Constant-time comparison, correct key scheduling |
| SHA-256 hashing | Custom hash | sha2 0.10 crate |
Correctness, performance, hardware acceleration |
| Gzip compression | Custom deflate | flate2 1.1 crate |
RFC 1952 compliance, performance, battle-tested |
| CLI argument parsing | Manual arg parsing | clap 4.5 with derive |
Validation, help text, error messages, subcommands |
| Random IV generation | Custom RNG | rand 0.9 with rand::rng() |
CSPRNG with OS seeding, no bias |
| Little-endian serialization | Manual byte shifting | Rust std to_le_bytes()/from_le_bytes() |
Built-in, zero-cost, correct |
Key insight: Every component in the encryption pipeline is security-sensitive. Using audited, well-tested crates for crypto operations is not optional -- hand-rolled crypto is the single highest-risk anti-pattern in this domain.
Common Pitfalls
Pitfall 1: Buffer Sizing for encrypt_padded_mut
What goes wrong: PadError at runtime because the buffer is too small for PKCS7-padded output.
Why it happens: PKCS7 ALWAYS adds at least 1 byte. When input is a multiple of 16, a full 16-byte padding block is added. Formula: ((input_len / 16) + 1) * 16.
How to avoid: Always allocate encrypted_size = ((compressed_size / 16) + 1) * 16 bytes for the encryption buffer. Copy compressed data to the start, then call encrypt_padded_mut with compressed_size as the plaintext length.
Warning signs: PadError or unwrap() panic during encryption.
Pitfall 2: Gzip Non-Determinism in Tests
What goes wrong: Gzip output varies between runs (different compressed_size), making golden tests impossible.
Why it happens: Gzip headers contain a timestamp (mtime) and OS byte that vary.
How to avoid: Use GzBuilder::new().mtime(0).write(Vec::new(), Compression::default()) to zero out the timestamp. The OS byte defaults to the build platform but is consistent on the same machine.
Warning signs: compressed_size changes between test runs for identical input.
Pitfall 3: Incorrect HMAC Scope
What goes wrong: HMAC computed over wrong data (just ciphertext, or including TOC metadata).
Why it happens: Ambiguity about what "encrypt-then-MAC" covers.
How to avoid: FORMAT.md is explicit: HMAC_input = IV (16 bytes) || ciphertext (encrypted_size bytes). Nothing else. The IV from the TOC entry, concatenated with the ciphertext from the data block.
Warning signs: HMAC verification failures in other decoders (Kotlin, shell).
Pitfall 4: TOC Offset Calculation Errors
What goes wrong: Data blocks written at wrong offsets; decoders read garbage.
Why it happens: Variable-length filename fields make TOC entry sizes differ. Off-by-one in offset arithmetic.
How to avoid: Use the formula from FORMAT.md: entry_size = 101 + name_length. Total TOC size = sum of all entry sizes. First data block offset = toc_offset + toc_size. Each subsequent data block offset = previous offset + previous encrypted_size.
Warning signs: inspect command shows corrupted filenames or impossible sizes.
Pitfall 5: Endianness Errors
What goes wrong: Multi-byte fields written in big-endian or native-endian instead of little-endian.
Why it happens: Forgetting to convert, or using wrong conversion function.
How to avoid: Always use value.to_le_bytes() when writing and u32::from_le_bytes([b0, b1, b2, b3]) when reading. Never use to_ne_bytes() or to_be_bytes().
Warning signs: Values look "swapped" when inspecting hex dump. Shell decoder reads wrong numbers.
Pitfall 6: UTF-8 Filename Length vs. Character Count
What goes wrong: name_length field stores character count instead of byte count.
Why it happens: Confusion between str.len() (byte count, correct) and str.chars().count() (character count, wrong).
How to avoid: FORMAT.md specifies name_length as "Filename length in bytes (UTF-8 encoded byte count)". In Rust, String::len() returns byte count, which is correct.
Warning signs: Non-ASCII filenames (Cyrillic) cause parsing errors in decoders.
Pitfall 7: Forgetting Flags Byte
What goes wrong: Archive header has wrong flags, decoders misinterpret format features.
Why it happens: Phase 2 uses only bit 0 (compression). Bits 1-7 must be zero.
How to avoid: Set flags = 0x01 when any file uses compression (global flag), flags = 0x00 when no files use compression. Bits 1-3 are for Phase 6 obfuscation features. Bits 4-7 MUST be zero.
Warning signs: Decoders reject archive due to unknown flags.
Code Examples
Verified patterns from official sources and research validation:
Binary Format Serialization (Header)
// Source: FORMAT.md Section 4 + Rust std library
fn write_header(
writer: &mut impl std::io::Write,
file_count: u16,
toc_offset: u32,
toc_size: u32,
flags: u8,
) -> std::io::Result<()> {
// Magic bytes
writer.write_all(&[0x00, 0xEA, 0x72, 0x63])?;
// Version
writer.write_all(&[0x01])?;
// Flags
writer.write_all(&[flags])?;
// File count (LE)
writer.write_all(&file_count.to_le_bytes())?;
// TOC offset (LE)
writer.write_all(&toc_offset.to_le_bytes())?;
// TOC size (LE)
writer.write_all(&toc_size.to_le_bytes())?;
// TOC IV (zero-filled, TOC not encrypted in Phase 2)
writer.write_all(&[0u8; 16])?;
// Reserved
writer.write_all(&[0u8; 8])?;
Ok(())
}
TOC Entry Serialization
// Source: FORMAT.md Section 5
fn write_toc_entry(
writer: &mut impl std::io::Write,
file: &ProcessedFile,
data_offset: u32,
) -> std::io::Result<()> {
let name_bytes = file.name.as_bytes();
writer.write_all(&(name_bytes.len() as u16).to_le_bytes())?;
writer.write_all(name_bytes)?;
writer.write_all(&file.original_size.to_le_bytes())?;
writer.write_all(&file.compressed_size.to_le_bytes())?;
writer.write_all(&file.encrypted_size.to_le_bytes())?;
writer.write_all(&data_offset.to_le_bytes())?;
writer.write_all(&file.iv)?;
writer.write_all(&file.hmac)?;
writer.write_all(&file.sha256)?;
writer.write_all(&[file.compression_flag])?;
writer.write_all(&0u16.to_le_bytes())?; // padding_after = 0
Ok(())
}
Inspect Command (Read Header + TOC Only)
// Source: FORMAT.md Section 10, steps 1-4
use std::io::{Read, Seek, SeekFrom};
fn read_header(reader: &mut impl Read) -> anyhow::Result<Header> {
let mut buf = [0u8; 40];
reader.read_exact(&mut buf)?;
// Verify magic
anyhow::ensure!(
buf[0..4] == [0x00, 0xEA, 0x72, 0x63],
"Invalid magic bytes"
);
let version = buf[4];
anyhow::ensure!(version == 1, "Unsupported version: {}", version);
let flags = buf[5];
anyhow::ensure!(flags & 0xF0 == 0, "Unknown flags set: 0x{:02X}", flags);
let file_count = u16::from_le_bytes([buf[6], buf[7]]);
let toc_offset = u32::from_le_bytes([buf[8], buf[9], buf[10], buf[11]]);
let toc_size = u32::from_le_bytes([buf[12], buf[13], buf[14], buf[15]]);
Ok(Header { version, flags, file_count, toc_offset, toc_size })
}
Compression Decision Heuristic
// Source: FORMAT.md Section 8 recommendation
fn should_compress(filename: &str, no_compress_list: &[String]) -> bool {
// Explicit exclusion from CLI
if no_compress_list.iter().any(|nc| filename.ends_with(nc) || filename == nc) {
return false;
}
// Auto-detect already-compressed formats
let ext = filename.rsplit('.').next().unwrap_or("").to_lowercase();
!matches!(
ext.as_str(),
"apk" | "zip" | "gz" | "bz2" | "xz" | "zst"
| "png" | "jpg" | "jpeg" | "gif" | "webp"
| "mp4" | "mp3" | "aac" | "ogg" | "flac"
| "7z" | "rar" | "jar"
)
}
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
block-modes 0.8 crate |
cbc 0.1 crate (separate crate per mode) |
2022 | block-modes is deprecated. Use cbc directly. |
rand::thread_rng() |
rand::rng() |
rand 0.9 (2025) | Function renamed. Same underlying ChaCha CSPRNG. |
GenericArray for keys/IVs |
.into() conversion from [u8; N] |
aes/cbc 0.8/0.1 | Can pass &key.into() directly from fixed arrays. |
byteorder crate |
Rust std to_le_bytes()/from_le_bytes() |
Rust 1.32 (2018) | No external crate needed for endian conversion. |
Deprecated/outdated:
block-modescrate: Replaced by individual mode crates (cbc,ecb,cfb,ofb). Do NOT useblock-modes.rand::thread_rng(): Renamed torand::rng()in 0.9. The old name is removed.crypto-maccrate: Merged intodigest0.10. Usehmac 0.12which usesdigest 0.10internally.
Open Questions
-
Hardcoded key value
- What we know: The key is 32 bytes, hardcoded, shared across all decoders.
- What's unclear: The specific key bytes are not defined in FORMAT.md (only the worked example uses
00 01 02 ... 1F). - Recommendation: Define a non-trivial key constant in
src/key.rs. The planner should decide the actual key bytes or generate them randomly once. The worked example key is fine for testing but should be replaced for production.
-
Error handling strategy for
unpack- What we know: FORMAT.md says "MUST reject" on HMAC failure, "MUST fail" on bad version.
- What's unclear: Should
unpackabort on first file error, or continue extracting other files? - Recommendation: Abort on header/TOC errors. For per-file errors (HMAC mismatch, SHA-256 mismatch), report the error but continue extracting remaining files (with a non-zero exit code at the end).
-
Maximum file size constraint (u32)
- What we know:
original_size,compressed_size,encrypted_sizeare all u32 (max ~4 GB). - What's unclear: Should the archiver check and reject files > 4 GB?
- Recommendation: Yes, validate file sizes during
packand produce a clear error if any file exceedsu32::MAX. This is acceptable given the Out of Scope note ("files fit in memory").
- What we know:
Sources
Primary (HIGH confidence)
docs/FORMAT.mdv1.0 -- The normative specification for the binary format. All byte offsets, field sizes, and pipeline steps are from this document.docs.rs/aes/0.8.4-- AES crate API documentationdocs.rs/cbc/0.1.2-- CBC mode crate API documentation and usage examplesdocs.rs/hmac/0.12.1-- HMAC crate API documentation and usage examplesdocs.rs/sha2/0.10.9-- SHA-2 crate API documentationdocs.rs/flate2/1.1.9-- flate2 crate API documentation (GzEncoder, GzDecoder, GzBuilder)docs.rs/clap/4.5.60-- Clap CLI crate documentationdocs.rs/rand/0.9.2-- Rand crate documentation- Research validation: Full pipeline (compress -> encrypt -> HMAC -> verify -> decrypt -> decompress -> verify) was compiled and executed successfully as a Rust program during this research.
Secondary (MEDIUM confidence)
crates.ioversion listings -- Latest stable versions verified viacargo searchand crates.io APIrust-random.github.io/book-- Rand book confirming ThreadRng is ChaCha-based CSPRNG
Tertiary (LOW confidence)
- None. All findings are verified against official documentation and compilation tests.
Metadata
Confidence breakdown:
- Standard stack: HIGH -- All crates verified via
cargo check, full pipeline compiled and executed - Architecture: HIGH -- Follows standard Rust CLI patterns; FORMAT.md provides exact byte-level specification
- Pitfalls: HIGH -- Common issues identified from official docs, GitHub issues, and practical validation
Research date: 2026-02-24 Valid until: 2026-04-24 (stable crates, slow-moving ecosystem)