docs(phase5): research shell decoder domain

This commit is contained in:
NikitolProject
2026-02-25 01:29:50 +03:00
parent 4666a348fe
commit 79a7ce2010

View File

@@ -0,0 +1,385 @@
# Phase 5: Shell Decoder - Research
**Researched:** 2026-02-25
**Domain:** POSIX/busybox shell scripting, binary format parsing, AES-256-CBC decryption via CLI
**Confidence:** HIGH
## Summary
The shell decoder is a busybox-compatible shell script that extracts files from archives created by the Rust archiver. The script must parse the binary format (header + TOC) using `dd` and hex conversion tools, decrypt each file with `openssl enc -aes-256-cbc`, optionally decompress with `gunzip`, and verify integrity with `sha256sum`. The format spec (FORMAT.md Section 13) already provides reference functions for most operations.
The main technical challenges are: (1) `openssl` is NOT a busybox applet -- it requires a separate `openssl` binary on the target system; (2) `xxd` was added to busybox in v1.28 (2017) but older versions lack it -- `od` must serve as fallback; (3) extracting UTF-8 filenames (Cyrillic) from binary data requires careful byte-range extraction with `dd` piped to raw output; (4) little-endian integer parsing in shell requires byte-swapping via hex string manipulation.
**Primary recommendation:** Build a single self-contained `decode.sh` script using the reference functions from FORMAT.md Section 13 as a foundation, with `od`-based fallbacks for `xxd`, graceful degradation for HMAC verification if `openssl dgst` lacks `-mac` support, and a cross-validation test script modeled after `kotlin/test_decoder.sh`.
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|-----------------|
| SHL-01 | Shell script dearchivation via busybox (dd, xxd, openssl, gunzip) | FORMAT.md Section 13 provides complete reference functions. `dd` and `gunzip` are native busybox applets. `xxd` available in busybox >=1.28 with `od` fallback. `openssl` requires external binary. |
| SHL-02 | openssl enc -aes-256-cbc with -K/-iv/-nosalt for raw key mode | OpenSSL `enc` supports `-K` (hex key), `-iv` (hex IV), `-nosalt` and auto-removes PKCS7 padding on decryption. Standard across OpenSSL 1.x and 3.x. |
| SHL-03 | Support files with non-ASCII names (Cyrillic) | `dd` extracts raw bytes which are valid UTF-8. The filename bytes can be written directly to a variable using command substitution. Shell/filesystem handles UTF-8 natively if `LANG` is set properly. |
</phase_requirements>
## Standard Stack
### Core
| Tool | Version | Purpose | Why Standard |
|------|---------|---------|--------------|
| `dd` | busybox built-in | Extract byte ranges from archive | Native applet, universal on all busybox builds |
| `openssl` | 1.1.1+ or 3.x (external) | AES-256-CBC decryption, HMAC-SHA-256 | Only CLI tool supporting raw-key AES-CBC decryption |
| `gunzip` | busybox built-in | Gzip decompression | Native applet, handles standard gzip streams |
| `sha256sum` | busybox built-in | SHA-256 integrity verification | Native applet, produces standard hash output |
| `sh` | busybox ash/sh | Script interpreter | POSIX-compatible shell, always available |
### Supporting
| Tool | Version | Purpose | When to Use |
|------|---------|---------|-------------|
| `xxd` | busybox >=1.28 (optional) | Binary-to-hex conversion | Primary hex encoder; added to busybox in 2017 |
| `od` | busybox built-in | Binary-to-hex conversion (fallback) | Fallback when `xxd` is unavailable |
| `awk` | busybox built-in | Text processing (parse openssl output) | Extract hash values from command output |
| `tr` | busybox built-in | Character deletion/translation | Remove whitespace/newlines from hex output |
| `printf` | shell built-in | Hex-to-decimal conversion | Convert `0xNN` strings to decimal integers |
| `mktemp` | busybox built-in | Create temporary files | Temporary storage for ciphertext/decrypted data |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| `xxd -p` for hex | `od -A n -t x1` for hex | `od` is universally available but output needs more cleanup (spaces, newlines) |
| `openssl dgst -mac HMAC` for HMAC | Skip HMAC verification | Older/minimal openssl may not support `-mac HMAC -macopt`; graceful degradation is the spec's recommendation |
| `printf '%d' "0x..."` for hex-to-dec | `$(( 16#... ))` bash arithmetic | `printf` is more portable across sh implementations; bash arithmetic is not POSIX |
| `sha256sum` for SHA-256 | `openssl dgst -sha256` | Both work; `sha256sum` is a busybox applet (no external dependency) |
## Architecture Patterns
### Recommended Script Structure
```
shell/
├── decode.sh # Main decoder script (single file, self-contained)
└── test_decoder.sh # Cross-validation test script (Rust pack -> Shell decode)
```
The decoder is a SINGLE self-contained script. No external libraries, no sourced files. This matches the project's deployment model: the script is copied alongside the archive to the target device.
### Pattern 1: Detect-and-Fallback for xxd/od
**What:** Auto-detect if `xxd` is available; if not, define wrapper functions using `od`.
**When to use:** At script startup, before any binary parsing.
**Example:**
```sh
# Source: FORMAT.md Section 13.1 + 13.2 (od fallback)
if command -v xxd >/dev/null 2>&1; then
read_hex() {
dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null | xxd -p | tr -d '\n'
}
else
read_hex() {
dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null \
| od -A n -t x1 | tr -d ' \n'
}
fi
```
### Pattern 2: Little-Endian Integer Parsing via Hex Byte Swap
**What:** Read N bytes as hex, swap byte order, convert to decimal.
**When to use:** For every u16/u32 field in the header and TOC.
**Example:**
```sh
# Source: FORMAT.md Section 13.1
read_le_u16() {
local hex=$(read_hex "$1" "$2" 2)
local b0=${hex:0:2} b1=${hex:2:2}
printf '%d' "0x${b1}${b0}"
}
read_le_u32() {
local hex=$(read_hex "$1" "$2" 4)
local b0=${hex:0:2} b1=${hex:2:2} b2=${hex:4:2} b3=${hex:6:2}
printf '%d' "0x${b3}${b2}${b1}${b0}"
}
```
### Pattern 3: Sequential TOC Parsing with Running Offset
**What:** Parse variable-length TOC entries using a running byte offset.
**When to use:** When reading the file table (TOC), each entry has a variable-length filename.
**Example:**
```sh
# Start at toc_offset
pos=$toc_offset
for i in $(seq 0 $((file_count - 1))); do
name_length=$(read_le_u16 "$ARCHIVE" "$pos")
pos=$((pos + 2))
# Extract filename (raw UTF-8 bytes via dd)
filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null)
pos=$((pos + name_length))
original_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
compressed_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
encrypted_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
data_offset=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
iv_hex=$(read_hex "$ARCHIVE" "$pos" 16); pos=$((pos + 16))
hmac_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32))
sha256_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32))
compression_flag=$(read_hex "$ARCHIVE" "$pos" 1); pos=$((pos + 1))
padding_after=$(read_le_u16 "$ARCHIVE" "$pos"); pos=$((pos + 2))
# Process this file entry...
done
```
### Pattern 4: Pipe-Based Decryption (dd | openssl)
**What:** Extract ciphertext with `dd` and pipe directly to `openssl enc -d` for decryption.
**When to use:** For each file's data block decryption.
**Example:**
```sh
# Source: FORMAT.md Section 13.4
dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" 2>/dev/null \
| openssl enc -d -aes-256-cbc -nosalt -K "$KEY_HEX" -iv "$iv_hex" \
> "$tmpfile"
```
### Pattern 5: Graceful HMAC Degradation
**What:** Detect if `openssl dgst -mac HMAC` is supported; skip HMAC if not.
**When to use:** Before the file extraction loop.
**Example:**
```sh
# Source: FORMAT.md Section 13.3
SKIP_HMAC=0
if ! echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00 >/dev/null 2>&1; then
echo "WARNING: openssl HMAC not available, skipping integrity verification"
SKIP_HMAC=1
fi
```
### Anti-Patterns to Avoid
- **Using bash-specific syntax:** The script must run in busybox `ash`/`sh`. No `[[ ]]`, no `$((16#FF))`, no arrays, no process substitution `<()`. Use `[ ]`, `printf '%d' "0x..."`, positional parameters or temp files.
- **Reading entire archive into memory:** Shell cannot handle binary data in variables. Always use `dd` to extract specific byte ranges to files or pipes.
- **Using `-e` flag with echo for binary:** Portability issues across shells. Use `printf` or `dd` instead.
- **Storing binary data in shell variables:** NULL bytes (`\0`) terminate strings in shell. Only store hex strings in variables, never raw binary.
- **Hardcoding `/tmp`:** Use `mktemp` for temporary files. Clean up with a trap.
- **Using `xxd -e` (little-endian mode):** Not supported in busybox xxd. Manual byte swapping is required.
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| AES-256-CBC decryption | Custom decryption in shell | `openssl enc -d -aes-256-cbc` | Impossible to implement AES in pure shell; openssl handles PKCS7 removal automatically |
| Gzip decompression | Custom DEFLATE in shell | `gunzip -c` | Compression algorithms cannot be implemented in shell |
| SHA-256 hashing | Custom hash in shell | `sha256sum` (busybox) | Cryptographic hash requires proper implementation |
| HMAC-SHA-256 | Custom HMAC in shell | `openssl dgst -sha256 -mac HMAC` | HMAC construction is subtle; openssl handles it correctly |
| Hex-to-binary conversion | Manual byte construction | `xxd -r -p` or `printf '\xNN'` | Direct hex-to-binary tools already exist |
**Key insight:** The shell decoder is fundamentally a "glue script" -- it orchestrates existing tools (`dd`, `openssl`, `gunzip`, `sha256sum`) to implement the decode pipeline. All cryptographic and compression operations are delegated to dedicated tools; the script only handles binary format parsing (offsets, lengths, byte swapping).
## Common Pitfalls
### Pitfall 1: openssl enc Output Contains Extra Bytes
**What goes wrong:** When using `openssl enc -d` with piped input from `dd`, the `openssl` command may behave differently depending on whether input comes from a file or a pipe. Some versions have issues with incomplete reads on pipes.
**Why it happens:** Pipe buffering and EOF handling can differ from file I/O.
**How to avoid:** Always extract ciphertext to a temp file first, then decrypt from the file. Or use `dd ... | openssl enc -d ...` and verify the output size matches `compressed_size` (or `original_size` if uncompressed).
**Warning signs:** Decrypted output size doesn't match expected `compressed_size`.
### Pitfall 2: Hex String Case Mismatch in HMAC Comparison
**What goes wrong:** `openssl dgst` outputs lowercase hex, but the stored HMAC extracted via `xxd -p` or `od` may produce different case.
**Why it happens:** Different tools use different case conventions for hex output.
**How to avoid:** Normalize both sides to lowercase before comparison: `echo "$hex" | tr 'A-F' 'a-f'`.
**Warning signs:** HMAC verification always fails despite correct data.
### Pitfall 3: Empty File (0 bytes) Causes gunzip Error
**What goes wrong:** A 0-byte original file has `compression_flag=1` but after decryption produces a valid (but tiny) gzip stream. However, the compressed size may be very small and edge cases exist.
**Why it happens:** Gzip of empty input produces a minimal gzip stream (~20 bytes). After AES encryption, `encrypted_size = 32` (one PKCS7 block of padding added). The decrypted output is a valid gzip stream that decompresses to 0 bytes.
**How to avoid:** Check `original_size == 0` before decompression; if zero, just create an empty file. Alternatively, let `gunzip` handle it (it should produce empty output from a valid empty gzip stream).
**Warning signs:** Script crashes or produces garbage for 0-byte files.
### Pitfall 4: Cyrillic Filename Extraction Corrupts Characters
**What goes wrong:** The `dd`-extracted filename bytes are valid UTF-8, but the shell variable may be corrupted if the locale is not set or if intermediate processing strips high bytes.
**Why it happens:** Some busybox builds strip non-ASCII bytes, or the `LANG`/`LC_ALL` environment may not support UTF-8.
**How to avoid:** Set `export LC_ALL=C` (or `C.UTF-8` if available) at the top of the script. Use `dd` to extract raw bytes directly. Do NOT process the filename through `tr`, `sed`, or `awk` before writing. Verify that `printf '%s'` preserves the bytes.
**Warning signs:** Extracted files have garbled names (mojibake) or are named with `?` characters.
### Pitfall 5: Shell Arithmetic Overflow on Large Files
**What goes wrong:** Shell arithmetic uses platform-native integers. On 32-bit shells, values above 2^31 (2 GB) overflow.
**Why it happens:** The archive format uses u32 for sizes and offsets (max 4 GB), but shell arithmetic may be limited to signed 32-bit.
**How to avoid:** All fields are u32 LE (max ~4 GB). Busybox on ARM typically uses 32-bit arithmetic. For v1, file sizes under 4 GB are expected, so this is LOW risk. If needed, use `awk` for arithmetic on large numbers.
**Warning signs:** Negative offsets or sizes in the script output for files larger than 2 GB.
### Pitfall 6: dd stderr Noise Pollutes Output
**What goes wrong:** `dd` writes transfer statistics to stderr (e.g., "32 bytes transferred"). If stderr is not suppressed, it may confuse piped commands or pollute user output.
**Why it happens:** `dd` always writes stats to stderr unless suppressed.
**How to avoid:** Always use `2>/dev/null` with `dd` commands. This is already shown in FORMAT.md Section 13 reference functions.
**Warning signs:** Unexpected text mixed into hex output or filenames.
### Pitfall 7: openssl 3.x Changes HMAC Syntax
**What goes wrong:** The `-mac HMAC -macopt hexkey:KEY` syntax works in OpenSSL 1.x and 3.x but is soft-deprecated in 3.x. The new `openssl mac` subcommand is preferred but has different syntax.
**Why it happens:** OpenSSL 3.x migrated to provider-based architecture; legacy options still work but may be removed.
**How to avoid:** Implement graceful degradation (already specified in FORMAT.md Section 13.3). Test with `echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00` at script startup. If it fails, set `SKIP_HMAC=1`.
**Warning signs:** HMAC check produces error messages instead of hash values.
## Code Examples
### Complete Decode Pipeline for One File
```sh
# Verified pattern from FORMAT.md Section 13.4 + project decisions
KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550"
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
# Step 1: Extract ciphertext to temp file
dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" \
of="$TMPDIR/ct.bin" 2>/dev/null
# Step 2: Verify HMAC (if available)
if [ "$SKIP_HMAC" = "0" ]; then
computed_hmac=$(
{
dd if="$ARCHIVE" bs=1 skip="$iv_toc_offset" count=16 2>/dev/null # IV from TOC
cat "$TMPDIR/ct.bin" # ciphertext
} | openssl dgst -sha256 -mac HMAC -macopt "hexkey:${KEY_HEX}" -hex 2>/dev/null \
| awk '{print $NF}'
)
if [ "$computed_hmac" != "$hmac_hex" ]; then
echo "HMAC failed for $filename, skipping" >&2
continue
fi
fi
# Step 3: Decrypt (openssl auto-removes PKCS7 padding)
openssl enc -d -aes-256-cbc -nosalt \
-K "$KEY_HEX" -iv "$iv_hex" \
-in "$TMPDIR/ct.bin" -out "$TMPDIR/dec.bin"
# Step 4: Decompress if needed
if [ "$compression_flag" = "01" ]; then
gunzip -c "$TMPDIR/dec.bin" > "$TMPDIR/out.bin"
else
mv "$TMPDIR/dec.bin" "$TMPDIR/out.bin"
fi
# Step 5: Verify SHA-256
actual_sha=$(sha256sum "$TMPDIR/out.bin" | awk '{print $1}')
if [ "$actual_sha" != "$sha256_hex" ]; then
echo "WARNING: SHA-256 mismatch for $filename" >&2
fi
# Step 6: Write output
mv "$TMPDIR/out.bin" "$OUTPUT_DIR/$filename"
```
### Key Hex Constant (from src/key.rs)
```sh
# Hardcoded 32-byte AES-256 key as hex string (matching src/key.rs)
KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550"
```
Derivation from `src/key.rs`:
```
0x7A 0x35 0xC1 0xD9 0x4F 0xE8 0x2B 0x6A
0x91 0x0D 0xF3 0x58 0xBC 0x74 0xA6 0x1E
0x42 0x8F 0xD0 0x63 0xE5 0x17 0x9B 0x2C
0xFA 0x84 0x06 0xCD 0x3E 0x79 0xB5 0x50
```
### HMAC Verification with IV from Archive (Not from TOC-parsed Variable)
A subtle point: for HMAC verification, the IV bytes must come from the archive file (not from a hex variable). The HMAC is computed over raw `IV || ciphertext` bytes, not hex strings. The approach using `dd` to extract IV bytes and concatenating with ciphertext via subshell `{ dd ...; dd ...; }` is correct (as shown in FORMAT.md Section 13.3).
However, for the HMAC *comparison*, we compare hex strings (both from `openssl dgst` output and from the TOC hex extraction). Both must be lowercase.
### UTF-8 Filename Extraction
```sh
# dd extracts raw bytes; if they are valid UTF-8, the shell preserves them
filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null)
# $filename now contains UTF-8 string, including Cyrillic characters
# Works because: (1) dd copies raw bytes, (2) $() captures them, (3) no null bytes in UTF-8 filenames
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| `openssl dgst -mac HMAC -macopt` | `openssl mac -digest SHA256 -macopt hexkey:...` | OpenSSL 3.0 (2021) | Old syntax still works in 3.x but soft-deprecated |
| `xxd` not in busybox | `xxd` applet in busybox | BusyBox 1.28 (2017) | Available on newer builds, but `od` fallback still needed for older systems |
| `openssl enc` with `-md md5` default | `-md sha256` default | OpenSSL 1.1.0 (2016) | No impact for raw key mode (`-K`/`-iv`); `-md` only affects password-derived keys |
**Deprecated/outdated:**
- `openssl dgst -hmac "key"` (string key): Still works but `-macopt hexkey:` is required for binary keys. The hex key mode is NOT deprecated.
- busybox builds without xxd: Still common on very old/minimal systems, hence `od` fallback is essential.
## Open Questions
1. **Does the target busybox have `openssl`?**
- What we know: openssl is NOT a busybox applet. It must be a separate binary on the target system. The project mentions "busybox-compatible" in requirements.
- What's unclear: Whether the specific target device (Android-based car head unit) has the `openssl` CLI installed.
- Recommendation: The script MUST fail with a clear error message if `openssl` is not found. Document this as a prerequisite. The script already checks for tool availability at startup.
2. **Does the target `openssl` support `-mac HMAC -macopt hexkey:`?**
- What we know: Standard OpenSSL 1.1.1+ and 3.x support this syntax. Busybox does not include openssl. Minimal/embedded openssl builds may lack HMAC support.
- What's unclear: Exact openssl version on target.
- Recommendation: Implement graceful degradation per FORMAT.md Section 13.3. Skip HMAC if unsupported, print warning.
3. **Performance on large files with `dd bs=1`?**
- What we know: `dd bs=1` reads one byte at a time. For extracting large data blocks (megabytes), this is very slow.
- What's unclear: Whether the shell decoder needs to handle large files efficiently.
- Recommendation: For data block extraction (Step 1 of decode), use larger block sizes. Extract full ciphertext with `dd bs=1 skip=OFFSET count=SIZE` which still uses bs=1 but lets dd handle the buffering. For truly large files, consider `dd bs=4096` with calculated skip/count, but the added complexity may not be worth it for a fallback decoder.
## Sources
### Primary (HIGH confidence)
- FORMAT.md Section 13 (Shell Decoder Reference) - Complete reference functions for all operations
- FORMAT.md Section 10 (Decode Order of Operations) - Mandatory decode pipeline
- FORMAT.md Section 4-5 (Header/TOC structure) - Binary layout
- `src/key.rs` - Actual hardcoded key bytes
- `src/format.rs` - Rust implementation of header/TOC parsing (reference)
- `src/crypto.rs` - Rust crypto implementation (HMAC scope, encrypt/decrypt)
- `kotlin/ArchiveDecoder.kt` - Working decoder implementation (behavioral reference)
- `kotlin/test_decoder.sh` - Cross-validation test pattern (structural reference)
### Secondary (MEDIUM confidence)
- [OpenSSL enc documentation (3.3)](https://docs.openssl.org/3.3/man1/openssl-enc/) - `-K`, `-iv`, `-nosalt`, PKCS7 auto-removal
- [OpenSSL dgst documentation (3.3)](https://docs.openssl.org/3.3/man1/openssl-dgst/) - `-mac HMAC -macopt hexkey:` syntax
- [BusyBox xxd commit (2017)](https://lists.busybox.net/pipermail/busybox-cvs/2017-January/036600.html) - xxd applet added to busybox
- [BusyBox applet list](https://www.busybox.net/downloads/BusyBox.html) - dd, gunzip, sha256sum, od are native applets; openssl, xxd are not
- [BusyBox xxd options](https://www.boxmatrix.info/wiki/Property:xxd_(bbcmd)) - Supported flags: -p, -r, -l, -s, -g, -c, -u
- [docker-library/busybox#13](https://github.com/docker-library/busybox/issues/13) - UTF-8 support limitations in busybox
### Tertiary (LOW confidence)
- Various shell scripting guides on binary file handling - General patterns, not project-specific
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH - All tools are well-documented CLI utilities with decades of stability. FORMAT.md Section 13 provides verified reference code.
- Architecture: HIGH - Single-file script pattern is proven by the existing `kotlin/test_decoder.sh`. Binary parsing pattern with dd+xxd/od is well-established.
- Pitfalls: HIGH - Identified from real tool behavior (openssl pipe handling, hex case, PKCS7, busybox limitations). FORMAT.md already anticipates several pitfalls (graceful HMAC degradation, od fallback).
**Research date:** 2026-02-25
**Valid until:** 2026-03-25 (stable domain -- shell tools and openssl CLI interface change very slowly)