Files
android-encrypted-archiver/.planning/phases/02-core-archiver/02-RESEARCH.md

24 KiB

Phase 2: Core Archiver - Research

Researched: 2026-02-24 Domain: Rust CLI binary with custom binary format, AES-256-CBC encryption, gzip compression, HMAC-SHA-256 authentication Confidence: HIGH

Summary

Phase 2 implements the core Rust CLI archiver from scratch (greenfield -- no existing source code). The tool must produce archives matching the FORMAT.md specification (v1) exactly: 40-byte fixed header, variable-length TOC with per-file metadata, and encrypted data blocks. The pipeline for each file is: SHA-256 hash -> gzip compress (optional) -> PKCS7 pad -> AES-256-CBC encrypt -> HMAC-SHA-256 authenticate.

The Rust ecosystem has mature, well-tested crates for every component: aes + cbc for encryption, hmac + sha2 for authentication and hashing, flate2 for gzip, clap for CLI, rand for IV generation. All stable versions are compatible and compile together (verified). The full crypto pipeline (compress -> encrypt -> HMAC -> verify -> decrypt -> decompress -> verify SHA-256) was validated as a working Rust program during this research.

Primary recommendation: Use stable RustCrypto crates (aes 0.8, cbc 0.1, hmac 0.12, sha2 0.10) rather than the 0.9/0.2/0.13/0.11 release candidates. The stable versions are battle-tested, have extensive documentation, and all compile together with Rust 1.93. Structure the project with clear module separation: cli, format, crypto, compression, archive (pack/unpack/inspect logic).

<phase_requirements>

Phase Requirements

ID Description Research Support
FMT-01 Custom binary format with non-standard magic bytes (not recognized by binwalk/file/7z) Magic bytes 0x00 0xEA 0x72 0x63 defined in FORMAT.md; leading null byte prevents file recognition. Binary serialization uses Rust std to_le_bytes()/from_le_bytes() -- no external crate needed.
FMT-02 Version field (1 byte) for forward compatibility Simple u8 at offset 0x04; reject version != 1. Trivial to implement.
FMT-03 File table with metadata: name, sizes, offset, IV, HMAC, SHA-256 Variable-length TOC entries (101 + name_length bytes each). UTF-8 filenames, length-prefixed. All field types are standard Rust primitives.
FMT-04 Little-endian for all multi-byte fields Rust std: u16::to_le_bytes(), u32::to_le_bytes(), u16::from_le_bytes(), u32::from_le_bytes(). No external crate needed.
ENC-01 AES-256-CBC encryption per file aes 0.8.4 + cbc 0.1.2 crates. Type alias: type Aes256CbcEnc = cbc::Encryptor<aes::Aes256>. Verified working.
ENC-02 HMAC-SHA-256 authentication (encrypt-then-MAC) per file hmac 0.12.1 + sha2 0.10.9. HMAC input = IV (16 bytes) || ciphertext. Verified working.
ENC-03 Random 16-byte IV per file, stored in cleartext TOC rand 0.9.2: rand::rng().fill(&mut iv). ThreadRng is cryptographically secure (ChaCha-based with OS seeding).
ENC-04 Hardcoded 32-byte key Const array const KEY: [u8; 32] = [...] in source. Same key for AES and HMAC in v1.
ENC-05 PKCS7 padding for AES-CBC cbc crate handles PKCS7 via encrypt_padded_mut::<Pkcs7>(). Formula: encrypted_size = ((compressed_size / 16) + 1) * 16. Verified.
CMP-01 Gzip compression per file before encryption flate2 1.1.9: GzEncoder::new(Vec::new(), Compression::default()). Use GzBuilder::new().mtime(0) for reproducible output in tests.
CMP-02 Per-file compression flag (skip for already-compressed files) CLI --no-compress flag + extension-based auto-detection for .apk, .zip, .png, .jpg, .jpeg, .gz, .bz2, .xz, .mp4, .mp3.
INT-01 SHA-256 checksum per file (verify after decompression) sha2 0.10.9: Sha256::digest(&original_data). Computed BEFORE compression. Stored in TOC entry.
CLI-01 Rust CLI utility for archive creation (Linux/macOS) clap 4.5.60 with derive API. Binary target in src/main.rs. Standard cargo build.
CLI-02 Pack multiple files (text + APK) into one archive pack subcommand accepts Vec<PathBuf> input files + -o output path. Reads files into memory (per Out of Scope: no streaming).
CLI-03 Subcommands: pack, unpack, inspect Three subcommands via clap #[derive(Subcommand)]. inspect reads header + TOC only, displays metadata without decrypting data blocks.
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
aes 0.8.4 AES-256 block cipher RustCrypto official. 96M+ downloads. Pure Rust with hardware acceleration (AES-NI).
cbc 0.1.2 CBC mode of operation RustCrypto official. Handles PKCS7 padding natively via block_padding::Pkcs7.
hmac 0.12.1 HMAC-SHA-256 computation RustCrypto official. Constant-time comparison via verify_slice().
sha2 0.10.9 SHA-256 hashing RustCrypto official. Both one-shot (Sha256::digest()) and streaming APIs.
flate2 1.1.9 Gzip compression/decompression De facto standard. Uses miniz_oxide (pure Rust) by default.
clap 4.5.60 CLI argument parsing Industry standard. Derive API for subcommands.
rand 0.9.2 Cryptographic random IV generation rand::rng() returns ChaCha-based CSPRNG with OS seeding.
anyhow 1.0.102 Error handling Ergonomic Result<T> with context. Standard for CLI apps.

Supporting

Library Version Purpose When to Use
(none -- std lib) - Little-endian serialization u16::to_le_bytes(), u32::from_le_bytes() etc. Built into Rust std.

Alternatives Considered

Instead of Could Use Tradeoff
aes 0.8 + cbc 0.1 (stable) aes 0.9-rc + cbc 0.2-rc (RC) RC versions have newer API but are pre-release. Stable versions are battle-tested and fully compatible. Use stable.
byteorder crate Rust std to_le_bytes()/from_le_bytes() std is sufficient since Rust 1.32. No external crate needed.
ring (Google) RustCrypto stack ring does not expose AES-CBC. It focuses on AEAD modes (AES-GCM). Not suitable for this format.
openssl crate RustCrypto stack Links to C library. RustCrypto is pure Rust, no system dependencies. Simpler cross-compilation.
serde + bincode Manual binary serialization Format spec requires exact byte layout. Manual serialization gives precise control over every byte. Serde/bincode add unnecessary abstraction for a fixed binary format.

Installation:

cargo init --name encrypted_archive
cargo add aes@0.8 cbc@0.1 hmac@0.12 sha2@0.10 flate2@1.1 clap@4.5 --features clap/derive rand@0.9 anyhow@1.0

Architecture Patterns

encrypted_archive/
├── Cargo.toml
├── src/
│   ├── main.rs              # Entry point: clap CLI parsing, dispatch to commands
│   ├── cli.rs               # Clap derive structs (Cli, Commands enum)
│   ├── format.rs            # Binary format constants, header/TOC structs, serialization/deserialization
│   ├── crypto.rs            # encrypt_file(), decrypt_file(), compute_hmac(), verify_hmac()
│   ├── compression.rs       # compress(), decompress(), should_compress()
│   ├── archive.rs           # pack(), unpack(), inspect() -- orchestration logic
│   └── key.rs               # Hardcoded 32-byte key constant
├── docs/
│   └── FORMAT.md            # Binary format specification (already exists)
└── tests/                   # Integration tests (Phase 3)

Pattern 1: Pipeline Processing per File

What: Each file goes through a sequential pipeline: hash -> compress -> pad+encrypt -> HMAC When to use: Always during pack operation Example:

// Source: Verified working pipeline from research validation
use aes::cipher::{block_padding::Pkcs7, BlockEncryptMut, KeyIvInit};
use hmac::{Hmac, Mac};
use sha2::{Sha256, Digest};
use flate2::write::GzEncoder;
use flate2::Compression;
use std::io::Write;

type Aes256CbcEnc = cbc::Encryptor<aes::Aes256>;
type HmacSha256 = Hmac<Sha256>;

struct ProcessedFile {
    name: String,
    original_size: u32,
    compressed_size: u32,
    encrypted_size: u32,
    iv: [u8; 16],
    hmac: [u8; 32],
    sha256: [u8; 32],
    compression_flag: u8,
    ciphertext: Vec<u8>,
}

fn process_file(name: &str, data: &[u8], key: &[u8; 32], compress: bool) -> ProcessedFile {
    // Step 1: SHA-256 of original
    let sha256: [u8; 32] = Sha256::digest(data).into();

    // Step 2: Compress (optional)
    let compressed = if compress {
        let mut encoder = GzEncoder::new(Vec::new(), Compression::default());
        encoder.write_all(data).unwrap();
        encoder.finish().unwrap()
    } else {
        data.to_vec()
    };

    // Step 3: Generate random IV
    let mut iv = [0u8; 16];
    rand::rng().fill(&mut iv);

    // Step 4: Encrypt with PKCS7 padding
    let encrypted_size = ((compressed.len() / 16) + 1) * 16;
    let mut buf = vec![0u8; encrypted_size];
    buf[..compressed.len()].copy_from_slice(&compressed);
    let ciphertext = Aes256CbcEnc::new(key.into(), &iv.into())
        .encrypt_padded_mut::<Pkcs7>(&mut buf, compressed.len())
        .unwrap()
        .to_vec();

    // Step 5: HMAC-SHA-256 over IV || ciphertext
    let mut mac = HmacSha256::new_from_slice(key).unwrap();
    mac.update(&iv);
    mac.update(&ciphertext);
    let hmac: [u8; 32] = mac.finalize().into_bytes().into();

    ProcessedFile {
        name: name.to_string(),
        original_size: data.len() as u32,
        compressed_size: compressed.len() as u32,
        encrypted_size: encrypted_size as u32,
        iv,
        hmac,
        sha256,
        compression_flag: if compress { 1 } else { 0 },
        ciphertext,
    }
}

Pattern 2: Two-Pass Archive Writing

What: First pass processes all files to compute sizes and offsets; second pass writes the archive sequentially. When to use: Always during pack. The TOC must contain data_offset for each file, but data blocks come after the TOC. You must know TOC size before writing data blocks. Example:

fn compute_offsets(files: &mut [ProcessedFile], file_count: u16) {
    let header_size: u32 = 40;

    // Compute TOC size
    let toc_size: u32 = files.iter()
        .map(|f| 101 + f.name.len() as u32)
        .sum();

    let toc_offset = header_size;
    let mut data_offset = toc_offset + toc_size;

    // Assign data offsets
    for file in files.iter_mut() {
        file.data_offset = data_offset;
        data_offset += file.encrypted_size;
        // padding_after = 0 in Phase 2 (no decoy padding)
    }
}

Pattern 3: CLI Subcommand Dispatch

What: Use clap derive API with an enum of subcommands When to use: Always for the CLI entry point Example:

// Source: Verified working clap derive pattern from research validation
use clap::{Parser, Subcommand};
use std::path::PathBuf;

#[derive(Parser)]
#[command(name = "encrypted_archive")]
#[command(about = "Custom encrypted archive tool")]
struct Cli {
    #[command(subcommand)]
    command: Commands,
}

#[derive(Subcommand)]
enum Commands {
    /// Pack files into an encrypted archive
    Pack {
        /// Input files to archive
        #[arg(required = true)]
        files: Vec<PathBuf>,
        /// Output archive file
        #[arg(short, long)]
        output: PathBuf,
        /// Disable compression for specified files
        #[arg(long)]
        no_compress: Vec<String>,
    },
    /// Unpack an encrypted archive (for testing)
    Unpack {
        /// Archive file to unpack
        archive: PathBuf,
        /// Output directory
        #[arg(short, long, default_value = ".")]
        output_dir: PathBuf,
    },
    /// Inspect archive metadata without decrypting
    Inspect {
        /// Archive file to inspect
        archive: PathBuf,
    },
}

Anti-Patterns to Avoid

  • Streaming writes without knowing offsets: The TOC contains data_offset for each file. You MUST compute all offsets before writing the TOC. Process all files first, then serialize.
  • Using serde/bincode for binary format: The format spec requires exact byte-level control. Manual serialization with to_le_bytes() is correct and simpler.
  • Single large buffer for entire archive: Process and encrypt files individually, write them sequentially. Each file should be processed independently.
  • Reusing IVs: Each file MUST have a unique random IV. Never reuse IVs across files or archive creations.
  • MAC-then-encrypt: The spec mandates encrypt-then-MAC. HMAC MUST be computed over IV || ciphertext, NOT over plaintext.

Don't Hand-Roll

Problem Don't Build Use Instead Why
AES-256-CBC encryption Custom AES implementation aes 0.8 + cbc 0.1 crates Side-channel resistance, hardware acceleration, audited
PKCS7 padding Manual padding logic cbc crate's Pkcs7 padding (via block_padding) Off-by-one errors in padding are security-critical
HMAC-SHA-256 Manual HMAC construction hmac 0.12 crate Constant-time comparison, correct key scheduling
SHA-256 hashing Custom hash sha2 0.10 crate Correctness, performance, hardware acceleration
Gzip compression Custom deflate flate2 1.1 crate RFC 1952 compliance, performance, battle-tested
CLI argument parsing Manual arg parsing clap 4.5 with derive Validation, help text, error messages, subcommands
Random IV generation Custom RNG rand 0.9 with rand::rng() CSPRNG with OS seeding, no bias
Little-endian serialization Manual byte shifting Rust std to_le_bytes()/from_le_bytes() Built-in, zero-cost, correct

Key insight: Every component in the encryption pipeline is security-sensitive. Using audited, well-tested crates for crypto operations is not optional -- hand-rolled crypto is the single highest-risk anti-pattern in this domain.

Common Pitfalls

Pitfall 1: Buffer Sizing for encrypt_padded_mut

What goes wrong: PadError at runtime because the buffer is too small for PKCS7-padded output. Why it happens: PKCS7 ALWAYS adds at least 1 byte. When input is a multiple of 16, a full 16-byte padding block is added. Formula: ((input_len / 16) + 1) * 16. How to avoid: Always allocate encrypted_size = ((compressed_size / 16) + 1) * 16 bytes for the encryption buffer. Copy compressed data to the start, then call encrypt_padded_mut with compressed_size as the plaintext length. Warning signs: PadError or unwrap() panic during encryption.

Pitfall 2: Gzip Non-Determinism in Tests

What goes wrong: Gzip output varies between runs (different compressed_size), making golden tests impossible. Why it happens: Gzip headers contain a timestamp (mtime) and OS byte that vary. How to avoid: Use GzBuilder::new().mtime(0).write(Vec::new(), Compression::default()) to zero out the timestamp. The OS byte defaults to the build platform but is consistent on the same machine. Warning signs: compressed_size changes between test runs for identical input.

Pitfall 3: Incorrect HMAC Scope

What goes wrong: HMAC computed over wrong data (just ciphertext, or including TOC metadata). Why it happens: Ambiguity about what "encrypt-then-MAC" covers. How to avoid: FORMAT.md is explicit: HMAC_input = IV (16 bytes) || ciphertext (encrypted_size bytes). Nothing else. The IV from the TOC entry, concatenated with the ciphertext from the data block. Warning signs: HMAC verification failures in other decoders (Kotlin, shell).

Pitfall 4: TOC Offset Calculation Errors

What goes wrong: Data blocks written at wrong offsets; decoders read garbage. Why it happens: Variable-length filename fields make TOC entry sizes differ. Off-by-one in offset arithmetic. How to avoid: Use the formula from FORMAT.md: entry_size = 101 + name_length. Total TOC size = sum of all entry sizes. First data block offset = toc_offset + toc_size. Each subsequent data block offset = previous offset + previous encrypted_size. Warning signs: inspect command shows corrupted filenames or impossible sizes.

Pitfall 5: Endianness Errors

What goes wrong: Multi-byte fields written in big-endian or native-endian instead of little-endian. Why it happens: Forgetting to convert, or using wrong conversion function. How to avoid: Always use value.to_le_bytes() when writing and u32::from_le_bytes([b0, b1, b2, b3]) when reading. Never use to_ne_bytes() or to_be_bytes(). Warning signs: Values look "swapped" when inspecting hex dump. Shell decoder reads wrong numbers.

Pitfall 6: UTF-8 Filename Length vs. Character Count

What goes wrong: name_length field stores character count instead of byte count. Why it happens: Confusion between str.len() (byte count, correct) and str.chars().count() (character count, wrong). How to avoid: FORMAT.md specifies name_length as "Filename length in bytes (UTF-8 encoded byte count)". In Rust, String::len() returns byte count, which is correct. Warning signs: Non-ASCII filenames (Cyrillic) cause parsing errors in decoders.

Pitfall 7: Forgetting Flags Byte

What goes wrong: Archive header has wrong flags, decoders misinterpret format features. Why it happens: Phase 2 uses only bit 0 (compression). Bits 1-7 must be zero. How to avoid: Set flags = 0x01 when any file uses compression (global flag), flags = 0x00 when no files use compression. Bits 1-3 are for Phase 6 obfuscation features. Bits 4-7 MUST be zero. Warning signs: Decoders reject archive due to unknown flags.

Code Examples

Verified patterns from official sources and research validation:

Binary Format Serialization (Header)

// Source: FORMAT.md Section 4 + Rust std library
fn write_header(
    writer: &mut impl std::io::Write,
    file_count: u16,
    toc_offset: u32,
    toc_size: u32,
    flags: u8,
) -> std::io::Result<()> {
    // Magic bytes
    writer.write_all(&[0x00, 0xEA, 0x72, 0x63])?;
    // Version
    writer.write_all(&[0x01])?;
    // Flags
    writer.write_all(&[flags])?;
    // File count (LE)
    writer.write_all(&file_count.to_le_bytes())?;
    // TOC offset (LE)
    writer.write_all(&toc_offset.to_le_bytes())?;
    // TOC size (LE)
    writer.write_all(&toc_size.to_le_bytes())?;
    // TOC IV (zero-filled, TOC not encrypted in Phase 2)
    writer.write_all(&[0u8; 16])?;
    // Reserved
    writer.write_all(&[0u8; 8])?;
    Ok(())
}

TOC Entry Serialization

// Source: FORMAT.md Section 5
fn write_toc_entry(
    writer: &mut impl std::io::Write,
    file: &ProcessedFile,
    data_offset: u32,
) -> std::io::Result<()> {
    let name_bytes = file.name.as_bytes();
    writer.write_all(&(name_bytes.len() as u16).to_le_bytes())?;
    writer.write_all(name_bytes)?;
    writer.write_all(&file.original_size.to_le_bytes())?;
    writer.write_all(&file.compressed_size.to_le_bytes())?;
    writer.write_all(&file.encrypted_size.to_le_bytes())?;
    writer.write_all(&data_offset.to_le_bytes())?;
    writer.write_all(&file.iv)?;
    writer.write_all(&file.hmac)?;
    writer.write_all(&file.sha256)?;
    writer.write_all(&[file.compression_flag])?;
    writer.write_all(&0u16.to_le_bytes())?; // padding_after = 0
    Ok(())
}

Inspect Command (Read Header + TOC Only)

// Source: FORMAT.md Section 10, steps 1-4
use std::io::{Read, Seek, SeekFrom};

fn read_header(reader: &mut impl Read) -> anyhow::Result<Header> {
    let mut buf = [0u8; 40];
    reader.read_exact(&mut buf)?;

    // Verify magic
    anyhow::ensure!(
        buf[0..4] == [0x00, 0xEA, 0x72, 0x63],
        "Invalid magic bytes"
    );

    let version = buf[4];
    anyhow::ensure!(version == 1, "Unsupported version: {}", version);

    let flags = buf[5];
    anyhow::ensure!(flags & 0xF0 == 0, "Unknown flags set: 0x{:02X}", flags);

    let file_count = u16::from_le_bytes([buf[6], buf[7]]);
    let toc_offset = u32::from_le_bytes([buf[8], buf[9], buf[10], buf[11]]);
    let toc_size = u32::from_le_bytes([buf[12], buf[13], buf[14], buf[15]]);

    Ok(Header { version, flags, file_count, toc_offset, toc_size })
}

Compression Decision Heuristic

// Source: FORMAT.md Section 8 recommendation
fn should_compress(filename: &str, no_compress_list: &[String]) -> bool {
    // Explicit exclusion from CLI
    if no_compress_list.iter().any(|nc| filename.ends_with(nc) || filename == nc) {
        return false;
    }
    // Auto-detect already-compressed formats
    let ext = filename.rsplit('.').next().unwrap_or("").to_lowercase();
    !matches!(
        ext.as_str(),
        "apk" | "zip" | "gz" | "bz2" | "xz" | "zst"
        | "png" | "jpg" | "jpeg" | "gif" | "webp"
        | "mp4" | "mp3" | "aac" | "ogg" | "flac"
        | "7z" | "rar" | "jar"
    )
}

State of the Art

Old Approach Current Approach When Changed Impact
block-modes 0.8 crate cbc 0.1 crate (separate crate per mode) 2022 block-modes is deprecated. Use cbc directly.
rand::thread_rng() rand::rng() rand 0.9 (2025) Function renamed. Same underlying ChaCha CSPRNG.
GenericArray for keys/IVs .into() conversion from [u8; N] aes/cbc 0.8/0.1 Can pass &key.into() directly from fixed arrays.
byteorder crate Rust std to_le_bytes()/from_le_bytes() Rust 1.32 (2018) No external crate needed for endian conversion.

Deprecated/outdated:

  • block-modes crate: Replaced by individual mode crates (cbc, ecb, cfb, ofb). Do NOT use block-modes.
  • rand::thread_rng(): Renamed to rand::rng() in 0.9. The old name is removed.
  • crypto-mac crate: Merged into digest 0.10. Use hmac 0.12 which uses digest 0.10 internally.

Open Questions

  1. Hardcoded key value

    • What we know: The key is 32 bytes, hardcoded, shared across all decoders.
    • What's unclear: The specific key bytes are not defined in FORMAT.md (only the worked example uses 00 01 02 ... 1F).
    • Recommendation: Define a non-trivial key constant in src/key.rs. The planner should decide the actual key bytes or generate them randomly once. The worked example key is fine for testing but should be replaced for production.
  2. Error handling strategy for unpack

    • What we know: FORMAT.md says "MUST reject" on HMAC failure, "MUST fail" on bad version.
    • What's unclear: Should unpack abort on first file error, or continue extracting other files?
    • Recommendation: Abort on header/TOC errors. For per-file errors (HMAC mismatch, SHA-256 mismatch), report the error but continue extracting remaining files (with a non-zero exit code at the end).
  3. Maximum file size constraint (u32)

    • What we know: original_size, compressed_size, encrypted_size are all u32 (max ~4 GB).
    • What's unclear: Should the archiver check and reject files > 4 GB?
    • Recommendation: Yes, validate file sizes during pack and produce a clear error if any file exceeds u32::MAX. This is acceptable given the Out of Scope note ("files fit in memory").

Sources

Primary (HIGH confidence)

  • docs/FORMAT.md v1.0 -- The normative specification for the binary format. All byte offsets, field sizes, and pipeline steps are from this document.
  • docs.rs/aes/0.8.4 -- AES crate API documentation
  • docs.rs/cbc/0.1.2 -- CBC mode crate API documentation and usage examples
  • docs.rs/hmac/0.12.1 -- HMAC crate API documentation and usage examples
  • docs.rs/sha2/0.10.9 -- SHA-2 crate API documentation
  • docs.rs/flate2/1.1.9 -- flate2 crate API documentation (GzEncoder, GzDecoder, GzBuilder)
  • docs.rs/clap/4.5.60 -- Clap CLI crate documentation
  • docs.rs/rand/0.9.2 -- Rand crate documentation
  • Research validation: Full pipeline (compress -> encrypt -> HMAC -> verify -> decrypt -> decompress -> verify) was compiled and executed successfully as a Rust program during this research.

Secondary (MEDIUM confidence)

  • crates.io version listings -- Latest stable versions verified via cargo search and crates.io API
  • rust-random.github.io/book -- Rand book confirming ThreadRng is ChaCha-based CSPRNG

Tertiary (LOW confidence)

  • None. All findings are verified against official documentation and compilation tests.

Metadata

Confidence breakdown:

  • Standard stack: HIGH -- All crates verified via cargo check, full pipeline compiled and executed
  • Architecture: HIGH -- Follows standard Rust CLI patterns; FORMAT.md provides exact byte-level specification
  • Pitfalls: HIGH -- Common issues identified from official docs, GitHub issues, and practical validation

Research date: 2026-02-24 Valid until: 2026-04-24 (stable crates, slow-moving ecosystem)