diff --git a/.planning/phases/05-shell-decoder/05-RESEARCH.md b/.planning/phases/05-shell-decoder/05-RESEARCH.md new file mode 100644 index 0000000..b748ca9 --- /dev/null +++ b/.planning/phases/05-shell-decoder/05-RESEARCH.md @@ -0,0 +1,385 @@ +# Phase 5: Shell Decoder - Research + +**Researched:** 2026-02-25 +**Domain:** POSIX/busybox shell scripting, binary format parsing, AES-256-CBC decryption via CLI +**Confidence:** HIGH + +## Summary + +The shell decoder is a busybox-compatible shell script that extracts files from archives created by the Rust archiver. The script must parse the binary format (header + TOC) using `dd` and hex conversion tools, decrypt each file with `openssl enc -aes-256-cbc`, optionally decompress with `gunzip`, and verify integrity with `sha256sum`. The format spec (FORMAT.md Section 13) already provides reference functions for most operations. + +The main technical challenges are: (1) `openssl` is NOT a busybox applet -- it requires a separate `openssl` binary on the target system; (2) `xxd` was added to busybox in v1.28 (2017) but older versions lack it -- `od` must serve as fallback; (3) extracting UTF-8 filenames (Cyrillic) from binary data requires careful byte-range extraction with `dd` piped to raw output; (4) little-endian integer parsing in shell requires byte-swapping via hex string manipulation. + +**Primary recommendation:** Build a single self-contained `decode.sh` script using the reference functions from FORMAT.md Section 13 as a foundation, with `od`-based fallbacks for `xxd`, graceful degradation for HMAC verification if `openssl dgst` lacks `-mac` support, and a cross-validation test script modeled after `kotlin/test_decoder.sh`. + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|-----------------| +| SHL-01 | Shell script dearchivation via busybox (dd, xxd, openssl, gunzip) | FORMAT.md Section 13 provides complete reference functions. `dd` and `gunzip` are native busybox applets. `xxd` available in busybox >=1.28 with `od` fallback. `openssl` requires external binary. | +| SHL-02 | openssl enc -aes-256-cbc with -K/-iv/-nosalt for raw key mode | OpenSSL `enc` supports `-K` (hex key), `-iv` (hex IV), `-nosalt` and auto-removes PKCS7 padding on decryption. Standard across OpenSSL 1.x and 3.x. | +| SHL-03 | Support files with non-ASCII names (Cyrillic) | `dd` extracts raw bytes which are valid UTF-8. The filename bytes can be written directly to a variable using command substitution. Shell/filesystem handles UTF-8 natively if `LANG` is set properly. | + + +## Standard Stack + +### Core + +| Tool | Version | Purpose | Why Standard | +|------|---------|---------|--------------| +| `dd` | busybox built-in | Extract byte ranges from archive | Native applet, universal on all busybox builds | +| `openssl` | 1.1.1+ or 3.x (external) | AES-256-CBC decryption, HMAC-SHA-256 | Only CLI tool supporting raw-key AES-CBC decryption | +| `gunzip` | busybox built-in | Gzip decompression | Native applet, handles standard gzip streams | +| `sha256sum` | busybox built-in | SHA-256 integrity verification | Native applet, produces standard hash output | +| `sh` | busybox ash/sh | Script interpreter | POSIX-compatible shell, always available | + +### Supporting + +| Tool | Version | Purpose | When to Use | +|------|---------|---------|-------------| +| `xxd` | busybox >=1.28 (optional) | Binary-to-hex conversion | Primary hex encoder; added to busybox in 2017 | +| `od` | busybox built-in | Binary-to-hex conversion (fallback) | Fallback when `xxd` is unavailable | +| `awk` | busybox built-in | Text processing (parse openssl output) | Extract hash values from command output | +| `tr` | busybox built-in | Character deletion/translation | Remove whitespace/newlines from hex output | +| `printf` | shell built-in | Hex-to-decimal conversion | Convert `0xNN` strings to decimal integers | +| `mktemp` | busybox built-in | Create temporary files | Temporary storage for ciphertext/decrypted data | + +### Alternatives Considered + +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| `xxd -p` for hex | `od -A n -t x1` for hex | `od` is universally available but output needs more cleanup (spaces, newlines) | +| `openssl dgst -mac HMAC` for HMAC | Skip HMAC verification | Older/minimal openssl may not support `-mac HMAC -macopt`; graceful degradation is the spec's recommendation | +| `printf '%d' "0x..."` for hex-to-dec | `$(( 16#... ))` bash arithmetic | `printf` is more portable across sh implementations; bash arithmetic is not POSIX | +| `sha256sum` for SHA-256 | `openssl dgst -sha256` | Both work; `sha256sum` is a busybox applet (no external dependency) | + +## Architecture Patterns + +### Recommended Script Structure + +``` +shell/ +├── decode.sh # Main decoder script (single file, self-contained) +└── test_decoder.sh # Cross-validation test script (Rust pack -> Shell decode) +``` + +The decoder is a SINGLE self-contained script. No external libraries, no sourced files. This matches the project's deployment model: the script is copied alongside the archive to the target device. + +### Pattern 1: Detect-and-Fallback for xxd/od + +**What:** Auto-detect if `xxd` is available; if not, define wrapper functions using `od`. +**When to use:** At script startup, before any binary parsing. +**Example:** + +```sh +# Source: FORMAT.md Section 13.1 + 13.2 (od fallback) +if command -v xxd >/dev/null 2>&1; then + read_hex() { + dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null | xxd -p | tr -d '\n' + } +else + read_hex() { + dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null \ + | od -A n -t x1 | tr -d ' \n' + } +fi +``` + +### Pattern 2: Little-Endian Integer Parsing via Hex Byte Swap + +**What:** Read N bytes as hex, swap byte order, convert to decimal. +**When to use:** For every u16/u32 field in the header and TOC. +**Example:** + +```sh +# Source: FORMAT.md Section 13.1 +read_le_u16() { + local hex=$(read_hex "$1" "$2" 2) + local b0=${hex:0:2} b1=${hex:2:2} + printf '%d' "0x${b1}${b0}" +} + +read_le_u32() { + local hex=$(read_hex "$1" "$2" 4) + local b0=${hex:0:2} b1=${hex:2:2} b2=${hex:4:2} b3=${hex:6:2} + printf '%d' "0x${b3}${b2}${b1}${b0}" +} +``` + +### Pattern 3: Sequential TOC Parsing with Running Offset + +**What:** Parse variable-length TOC entries using a running byte offset. +**When to use:** When reading the file table (TOC), each entry has a variable-length filename. +**Example:** + +```sh +# Start at toc_offset +pos=$toc_offset + +for i in $(seq 0 $((file_count - 1))); do + name_length=$(read_le_u16 "$ARCHIVE" "$pos") + pos=$((pos + 2)) + + # Extract filename (raw UTF-8 bytes via dd) + filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null) + pos=$((pos + name_length)) + + original_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4)) + compressed_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4)) + encrypted_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4)) + data_offset=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4)) + + iv_hex=$(read_hex "$ARCHIVE" "$pos" 16); pos=$((pos + 16)) + hmac_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32)) + sha256_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32)) + + compression_flag=$(read_hex "$ARCHIVE" "$pos" 1); pos=$((pos + 1)) + padding_after=$(read_le_u16 "$ARCHIVE" "$pos"); pos=$((pos + 2)) + + # Process this file entry... +done +``` + +### Pattern 4: Pipe-Based Decryption (dd | openssl) + +**What:** Extract ciphertext with `dd` and pipe directly to `openssl enc -d` for decryption. +**When to use:** For each file's data block decryption. +**Example:** + +```sh +# Source: FORMAT.md Section 13.4 +dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" 2>/dev/null \ + | openssl enc -d -aes-256-cbc -nosalt -K "$KEY_HEX" -iv "$iv_hex" \ + > "$tmpfile" +``` + +### Pattern 5: Graceful HMAC Degradation + +**What:** Detect if `openssl dgst -mac HMAC` is supported; skip HMAC if not. +**When to use:** Before the file extraction loop. +**Example:** + +```sh +# Source: FORMAT.md Section 13.3 +SKIP_HMAC=0 +if ! echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00 >/dev/null 2>&1; then + echo "WARNING: openssl HMAC not available, skipping integrity verification" + SKIP_HMAC=1 +fi +``` + +### Anti-Patterns to Avoid + +- **Using bash-specific syntax:** The script must run in busybox `ash`/`sh`. No `[[ ]]`, no `$((16#FF))`, no arrays, no process substitution `<()`. Use `[ ]`, `printf '%d' "0x..."`, positional parameters or temp files. +- **Reading entire archive into memory:** Shell cannot handle binary data in variables. Always use `dd` to extract specific byte ranges to files or pipes. +- **Using `-e` flag with echo for binary:** Portability issues across shells. Use `printf` or `dd` instead. +- **Storing binary data in shell variables:** NULL bytes (`\0`) terminate strings in shell. Only store hex strings in variables, never raw binary. +- **Hardcoding `/tmp`:** Use `mktemp` for temporary files. Clean up with a trap. +- **Using `xxd -e` (little-endian mode):** Not supported in busybox xxd. Manual byte swapping is required. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| AES-256-CBC decryption | Custom decryption in shell | `openssl enc -d -aes-256-cbc` | Impossible to implement AES in pure shell; openssl handles PKCS7 removal automatically | +| Gzip decompression | Custom DEFLATE in shell | `gunzip -c` | Compression algorithms cannot be implemented in shell | +| SHA-256 hashing | Custom hash in shell | `sha256sum` (busybox) | Cryptographic hash requires proper implementation | +| HMAC-SHA-256 | Custom HMAC in shell | `openssl dgst -sha256 -mac HMAC` | HMAC construction is subtle; openssl handles it correctly | +| Hex-to-binary conversion | Manual byte construction | `xxd -r -p` or `printf '\xNN'` | Direct hex-to-binary tools already exist | + +**Key insight:** The shell decoder is fundamentally a "glue script" -- it orchestrates existing tools (`dd`, `openssl`, `gunzip`, `sha256sum`) to implement the decode pipeline. All cryptographic and compression operations are delegated to dedicated tools; the script only handles binary format parsing (offsets, lengths, byte swapping). + +## Common Pitfalls + +### Pitfall 1: openssl enc Output Contains Extra Bytes + +**What goes wrong:** When using `openssl enc -d` with piped input from `dd`, the `openssl` command may behave differently depending on whether input comes from a file or a pipe. Some versions have issues with incomplete reads on pipes. +**Why it happens:** Pipe buffering and EOF handling can differ from file I/O. +**How to avoid:** Always extract ciphertext to a temp file first, then decrypt from the file. Or use `dd ... | openssl enc -d ...` and verify the output size matches `compressed_size` (or `original_size` if uncompressed). +**Warning signs:** Decrypted output size doesn't match expected `compressed_size`. + +### Pitfall 2: Hex String Case Mismatch in HMAC Comparison + +**What goes wrong:** `openssl dgst` outputs lowercase hex, but the stored HMAC extracted via `xxd -p` or `od` may produce different case. +**Why it happens:** Different tools use different case conventions for hex output. +**How to avoid:** Normalize both sides to lowercase before comparison: `echo "$hex" | tr 'A-F' 'a-f'`. +**Warning signs:** HMAC verification always fails despite correct data. + +### Pitfall 3: Empty File (0 bytes) Causes gunzip Error + +**What goes wrong:** A 0-byte original file has `compression_flag=1` but after decryption produces a valid (but tiny) gzip stream. However, the compressed size may be very small and edge cases exist. +**Why it happens:** Gzip of empty input produces a minimal gzip stream (~20 bytes). After AES encryption, `encrypted_size = 32` (one PKCS7 block of padding added). The decrypted output is a valid gzip stream that decompresses to 0 bytes. +**How to avoid:** Check `original_size == 0` before decompression; if zero, just create an empty file. Alternatively, let `gunzip` handle it (it should produce empty output from a valid empty gzip stream). +**Warning signs:** Script crashes or produces garbage for 0-byte files. + +### Pitfall 4: Cyrillic Filename Extraction Corrupts Characters + +**What goes wrong:** The `dd`-extracted filename bytes are valid UTF-8, but the shell variable may be corrupted if the locale is not set or if intermediate processing strips high bytes. +**Why it happens:** Some busybox builds strip non-ASCII bytes, or the `LANG`/`LC_ALL` environment may not support UTF-8. +**How to avoid:** Set `export LC_ALL=C` (or `C.UTF-8` if available) at the top of the script. Use `dd` to extract raw bytes directly. Do NOT process the filename through `tr`, `sed`, or `awk` before writing. Verify that `printf '%s'` preserves the bytes. +**Warning signs:** Extracted files have garbled names (mojibake) or are named with `?` characters. + +### Pitfall 5: Shell Arithmetic Overflow on Large Files + +**What goes wrong:** Shell arithmetic uses platform-native integers. On 32-bit shells, values above 2^31 (2 GB) overflow. +**Why it happens:** The archive format uses u32 for sizes and offsets (max 4 GB), but shell arithmetic may be limited to signed 32-bit. +**How to avoid:** All fields are u32 LE (max ~4 GB). Busybox on ARM typically uses 32-bit arithmetic. For v1, file sizes under 4 GB are expected, so this is LOW risk. If needed, use `awk` for arithmetic on large numbers. +**Warning signs:** Negative offsets or sizes in the script output for files larger than 2 GB. + +### Pitfall 6: dd stderr Noise Pollutes Output + +**What goes wrong:** `dd` writes transfer statistics to stderr (e.g., "32 bytes transferred"). If stderr is not suppressed, it may confuse piped commands or pollute user output. +**Why it happens:** `dd` always writes stats to stderr unless suppressed. +**How to avoid:** Always use `2>/dev/null` with `dd` commands. This is already shown in FORMAT.md Section 13 reference functions. +**Warning signs:** Unexpected text mixed into hex output or filenames. + +### Pitfall 7: openssl 3.x Changes HMAC Syntax + +**What goes wrong:** The `-mac HMAC -macopt hexkey:KEY` syntax works in OpenSSL 1.x and 3.x but is soft-deprecated in 3.x. The new `openssl mac` subcommand is preferred but has different syntax. +**Why it happens:** OpenSSL 3.x migrated to provider-based architecture; legacy options still work but may be removed. +**How to avoid:** Implement graceful degradation (already specified in FORMAT.md Section 13.3). Test with `echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00` at script startup. If it fails, set `SKIP_HMAC=1`. +**Warning signs:** HMAC check produces error messages instead of hash values. + +## Code Examples + +### Complete Decode Pipeline for One File + +```sh +# Verified pattern from FORMAT.md Section 13.4 + project decisions +KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550" +TMPDIR=$(mktemp -d) +trap 'rm -rf "$TMPDIR"' EXIT + +# Step 1: Extract ciphertext to temp file +dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" \ + of="$TMPDIR/ct.bin" 2>/dev/null + +# Step 2: Verify HMAC (if available) +if [ "$SKIP_HMAC" = "0" ]; then + computed_hmac=$( + { + dd if="$ARCHIVE" bs=1 skip="$iv_toc_offset" count=16 2>/dev/null # IV from TOC + cat "$TMPDIR/ct.bin" # ciphertext + } | openssl dgst -sha256 -mac HMAC -macopt "hexkey:${KEY_HEX}" -hex 2>/dev/null \ + | awk '{print $NF}' + ) + if [ "$computed_hmac" != "$hmac_hex" ]; then + echo "HMAC failed for $filename, skipping" >&2 + continue + fi +fi + +# Step 3: Decrypt (openssl auto-removes PKCS7 padding) +openssl enc -d -aes-256-cbc -nosalt \ + -K "$KEY_HEX" -iv "$iv_hex" \ + -in "$TMPDIR/ct.bin" -out "$TMPDIR/dec.bin" + +# Step 4: Decompress if needed +if [ "$compression_flag" = "01" ]; then + gunzip -c "$TMPDIR/dec.bin" > "$TMPDIR/out.bin" +else + mv "$TMPDIR/dec.bin" "$TMPDIR/out.bin" +fi + +# Step 5: Verify SHA-256 +actual_sha=$(sha256sum "$TMPDIR/out.bin" | awk '{print $1}') +if [ "$actual_sha" != "$sha256_hex" ]; then + echo "WARNING: SHA-256 mismatch for $filename" >&2 +fi + +# Step 6: Write output +mv "$TMPDIR/out.bin" "$OUTPUT_DIR/$filename" +``` + +### Key Hex Constant (from src/key.rs) + +```sh +# Hardcoded 32-byte AES-256 key as hex string (matching src/key.rs) +KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550" +``` + +Derivation from `src/key.rs`: +``` +0x7A 0x35 0xC1 0xD9 0x4F 0xE8 0x2B 0x6A +0x91 0x0D 0xF3 0x58 0xBC 0x74 0xA6 0x1E +0x42 0x8F 0xD0 0x63 0xE5 0x17 0x9B 0x2C +0xFA 0x84 0x06 0xCD 0x3E 0x79 0xB5 0x50 +``` + +### HMAC Verification with IV from Archive (Not from TOC-parsed Variable) + +A subtle point: for HMAC verification, the IV bytes must come from the archive file (not from a hex variable). The HMAC is computed over raw `IV || ciphertext` bytes, not hex strings. The approach using `dd` to extract IV bytes and concatenating with ciphertext via subshell `{ dd ...; dd ...; }` is correct (as shown in FORMAT.md Section 13.3). + +However, for the HMAC *comparison*, we compare hex strings (both from `openssl dgst` output and from the TOC hex extraction). Both must be lowercase. + +### UTF-8 Filename Extraction + +```sh +# dd extracts raw bytes; if they are valid UTF-8, the shell preserves them +filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null) +# $filename now contains UTF-8 string, including Cyrillic characters +# Works because: (1) dd copies raw bytes, (2) $() captures them, (3) no null bytes in UTF-8 filenames +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| `openssl dgst -mac HMAC -macopt` | `openssl mac -digest SHA256 -macopt hexkey:...` | OpenSSL 3.0 (2021) | Old syntax still works in 3.x but soft-deprecated | +| `xxd` not in busybox | `xxd` applet in busybox | BusyBox 1.28 (2017) | Available on newer builds, but `od` fallback still needed for older systems | +| `openssl enc` with `-md md5` default | `-md sha256` default | OpenSSL 1.1.0 (2016) | No impact for raw key mode (`-K`/`-iv`); `-md` only affects password-derived keys | + +**Deprecated/outdated:** +- `openssl dgst -hmac "key"` (string key): Still works but `-macopt hexkey:` is required for binary keys. The hex key mode is NOT deprecated. +- busybox builds without xxd: Still common on very old/minimal systems, hence `od` fallback is essential. + +## Open Questions + +1. **Does the target busybox have `openssl`?** + - What we know: openssl is NOT a busybox applet. It must be a separate binary on the target system. The project mentions "busybox-compatible" in requirements. + - What's unclear: Whether the specific target device (Android-based car head unit) has the `openssl` CLI installed. + - Recommendation: The script MUST fail with a clear error message if `openssl` is not found. Document this as a prerequisite. The script already checks for tool availability at startup. + +2. **Does the target `openssl` support `-mac HMAC -macopt hexkey:`?** + - What we know: Standard OpenSSL 1.1.1+ and 3.x support this syntax. Busybox does not include openssl. Minimal/embedded openssl builds may lack HMAC support. + - What's unclear: Exact openssl version on target. + - Recommendation: Implement graceful degradation per FORMAT.md Section 13.3. Skip HMAC if unsupported, print warning. + +3. **Performance on large files with `dd bs=1`?** + - What we know: `dd bs=1` reads one byte at a time. For extracting large data blocks (megabytes), this is very slow. + - What's unclear: Whether the shell decoder needs to handle large files efficiently. + - Recommendation: For data block extraction (Step 1 of decode), use larger block sizes. Extract full ciphertext with `dd bs=1 skip=OFFSET count=SIZE` which still uses bs=1 but lets dd handle the buffering. For truly large files, consider `dd bs=4096` with calculated skip/count, but the added complexity may not be worth it for a fallback decoder. + +## Sources + +### Primary (HIGH confidence) +- FORMAT.md Section 13 (Shell Decoder Reference) - Complete reference functions for all operations +- FORMAT.md Section 10 (Decode Order of Operations) - Mandatory decode pipeline +- FORMAT.md Section 4-5 (Header/TOC structure) - Binary layout +- `src/key.rs` - Actual hardcoded key bytes +- `src/format.rs` - Rust implementation of header/TOC parsing (reference) +- `src/crypto.rs` - Rust crypto implementation (HMAC scope, encrypt/decrypt) +- `kotlin/ArchiveDecoder.kt` - Working decoder implementation (behavioral reference) +- `kotlin/test_decoder.sh` - Cross-validation test pattern (structural reference) + +### Secondary (MEDIUM confidence) +- [OpenSSL enc documentation (3.3)](https://docs.openssl.org/3.3/man1/openssl-enc/) - `-K`, `-iv`, `-nosalt`, PKCS7 auto-removal +- [OpenSSL dgst documentation (3.3)](https://docs.openssl.org/3.3/man1/openssl-dgst/) - `-mac HMAC -macopt hexkey:` syntax +- [BusyBox xxd commit (2017)](https://lists.busybox.net/pipermail/busybox-cvs/2017-January/036600.html) - xxd applet added to busybox +- [BusyBox applet list](https://www.busybox.net/downloads/BusyBox.html) - dd, gunzip, sha256sum, od are native applets; openssl, xxd are not +- [BusyBox xxd options](https://www.boxmatrix.info/wiki/Property:xxd_(bbcmd)) - Supported flags: -p, -r, -l, -s, -g, -c, -u +- [docker-library/busybox#13](https://github.com/docker-library/busybox/issues/13) - UTF-8 support limitations in busybox + +### Tertiary (LOW confidence) +- Various shell scripting guides on binary file handling - General patterns, not project-specific + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH - All tools are well-documented CLI utilities with decades of stability. FORMAT.md Section 13 provides verified reference code. +- Architecture: HIGH - Single-file script pattern is proven by the existing `kotlin/test_decoder.sh`. Binary parsing pattern with dd+xxd/od is well-established. +- Pitfalls: HIGH - Identified from real tool behavior (openssl pipe handling, hex case, PKCS7, busybox limitations). FORMAT.md already anticipates several pitfalls (graceful HMAC degradation, od fallback). + +**Research date:** 2026-02-25 +**Valid until:** 2026-03-25 (stable domain -- shell tools and openssl CLI interface change very slowly)