Files
2026-02-25 01:29:50 +03:00

22 KiB

Phase 5: Shell Decoder - Research

Researched: 2026-02-25 Domain: POSIX/busybox shell scripting, binary format parsing, AES-256-CBC decryption via CLI Confidence: HIGH

Summary

The shell decoder is a busybox-compatible shell script that extracts files from archives created by the Rust archiver. The script must parse the binary format (header + TOC) using dd and hex conversion tools, decrypt each file with openssl enc -aes-256-cbc, optionally decompress with gunzip, and verify integrity with sha256sum. The format spec (FORMAT.md Section 13) already provides reference functions for most operations.

The main technical challenges are: (1) openssl is NOT a busybox applet -- it requires a separate openssl binary on the target system; (2) xxd was added to busybox in v1.28 (2017) but older versions lack it -- od must serve as fallback; (3) extracting UTF-8 filenames (Cyrillic) from binary data requires careful byte-range extraction with dd piped to raw output; (4) little-endian integer parsing in shell requires byte-swapping via hex string manipulation.

Primary recommendation: Build a single self-contained decode.sh script using the reference functions from FORMAT.md Section 13 as a foundation, with od-based fallbacks for xxd, graceful degradation for HMAC verification if openssl dgst lacks -mac support, and a cross-validation test script modeled after kotlin/test_decoder.sh.

<phase_requirements>

Phase Requirements

ID Description Research Support
SHL-01 Shell script dearchivation via busybox (dd, xxd, openssl, gunzip) FORMAT.md Section 13 provides complete reference functions. dd and gunzip are native busybox applets. xxd available in busybox >=1.28 with od fallback. openssl requires external binary.
SHL-02 openssl enc -aes-256-cbc with -K/-iv/-nosalt for raw key mode OpenSSL enc supports -K (hex key), -iv (hex IV), -nosalt and auto-removes PKCS7 padding on decryption. Standard across OpenSSL 1.x and 3.x.
SHL-03 Support files with non-ASCII names (Cyrillic) dd extracts raw bytes which are valid UTF-8. The filename bytes can be written directly to a variable using command substitution. Shell/filesystem handles UTF-8 natively if LANG is set properly.
</phase_requirements>

Standard Stack

Core

Tool Version Purpose Why Standard
dd busybox built-in Extract byte ranges from archive Native applet, universal on all busybox builds
openssl 1.1.1+ or 3.x (external) AES-256-CBC decryption, HMAC-SHA-256 Only CLI tool supporting raw-key AES-CBC decryption
gunzip busybox built-in Gzip decompression Native applet, handles standard gzip streams
sha256sum busybox built-in SHA-256 integrity verification Native applet, produces standard hash output
sh busybox ash/sh Script interpreter POSIX-compatible shell, always available

Supporting

Tool Version Purpose When to Use
xxd busybox >=1.28 (optional) Binary-to-hex conversion Primary hex encoder; added to busybox in 2017
od busybox built-in Binary-to-hex conversion (fallback) Fallback when xxd is unavailable
awk busybox built-in Text processing (parse openssl output) Extract hash values from command output
tr busybox built-in Character deletion/translation Remove whitespace/newlines from hex output
printf shell built-in Hex-to-decimal conversion Convert 0xNN strings to decimal integers
mktemp busybox built-in Create temporary files Temporary storage for ciphertext/decrypted data

Alternatives Considered

Instead of Could Use Tradeoff
xxd -p for hex od -A n -t x1 for hex od is universally available but output needs more cleanup (spaces, newlines)
openssl dgst -mac HMAC for HMAC Skip HMAC verification Older/minimal openssl may not support -mac HMAC -macopt; graceful degradation is the spec's recommendation
printf '%d' "0x..." for hex-to-dec $(( 16#... )) bash arithmetic printf is more portable across sh implementations; bash arithmetic is not POSIX
sha256sum for SHA-256 openssl dgst -sha256 Both work; sha256sum is a busybox applet (no external dependency)

Architecture Patterns

shell/
├── decode.sh           # Main decoder script (single file, self-contained)
└── test_decoder.sh     # Cross-validation test script (Rust pack -> Shell decode)

The decoder is a SINGLE self-contained script. No external libraries, no sourced files. This matches the project's deployment model: the script is copied alongside the archive to the target device.

Pattern 1: Detect-and-Fallback for xxd/od

What: Auto-detect if xxd is available; if not, define wrapper functions using od. When to use: At script startup, before any binary parsing. Example:

# Source: FORMAT.md Section 13.1 + 13.2 (od fallback)
if command -v xxd >/dev/null 2>&1; then
  read_hex() {
    dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null | xxd -p | tr -d '\n'
  }
else
  read_hex() {
    dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null \
      | od -A n -t x1 | tr -d ' \n'
  }
fi

Pattern 2: Little-Endian Integer Parsing via Hex Byte Swap

What: Read N bytes as hex, swap byte order, convert to decimal. When to use: For every u16/u32 field in the header and TOC. Example:

# Source: FORMAT.md Section 13.1
read_le_u16() {
  local hex=$(read_hex "$1" "$2" 2)
  local b0=${hex:0:2} b1=${hex:2:2}
  printf '%d' "0x${b1}${b0}"
}

read_le_u32() {
  local hex=$(read_hex "$1" "$2" 4)
  local b0=${hex:0:2} b1=${hex:2:2} b2=${hex:4:2} b3=${hex:6:2}
  printf '%d' "0x${b3}${b2}${b1}${b0}"
}

Pattern 3: Sequential TOC Parsing with Running Offset

What: Parse variable-length TOC entries using a running byte offset. When to use: When reading the file table (TOC), each entry has a variable-length filename. Example:

# Start at toc_offset
pos=$toc_offset

for i in $(seq 0 $((file_count - 1))); do
  name_length=$(read_le_u16 "$ARCHIVE" "$pos")
  pos=$((pos + 2))

  # Extract filename (raw UTF-8 bytes via dd)
  filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null)
  pos=$((pos + name_length))

  original_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
  compressed_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
  encrypted_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
  data_offset=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))

  iv_hex=$(read_hex "$ARCHIVE" "$pos" 16); pos=$((pos + 16))
  hmac_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32))
  sha256_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32))

  compression_flag=$(read_hex "$ARCHIVE" "$pos" 1); pos=$((pos + 1))
  padding_after=$(read_le_u16 "$ARCHIVE" "$pos"); pos=$((pos + 2))

  # Process this file entry...
done

Pattern 4: Pipe-Based Decryption (dd | openssl)

What: Extract ciphertext with dd and pipe directly to openssl enc -d for decryption. When to use: For each file's data block decryption. Example:

# Source: FORMAT.md Section 13.4
dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" 2>/dev/null \
  | openssl enc -d -aes-256-cbc -nosalt -K "$KEY_HEX" -iv "$iv_hex" \
  > "$tmpfile"

Pattern 5: Graceful HMAC Degradation

What: Detect if openssl dgst -mac HMAC is supported; skip HMAC if not. When to use: Before the file extraction loop. Example:

# Source: FORMAT.md Section 13.3
SKIP_HMAC=0
if ! echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00 >/dev/null 2>&1; then
  echo "WARNING: openssl HMAC not available, skipping integrity verification"
  SKIP_HMAC=1
fi

Anti-Patterns to Avoid

  • Using bash-specific syntax: The script must run in busybox ash/sh. No [[ ]], no $((16#FF)), no arrays, no process substitution <(). Use [ ], printf '%d' "0x...", positional parameters or temp files.
  • Reading entire archive into memory: Shell cannot handle binary data in variables. Always use dd to extract specific byte ranges to files or pipes.
  • Using -e flag with echo for binary: Portability issues across shells. Use printf or dd instead.
  • Storing binary data in shell variables: NULL bytes (\0) terminate strings in shell. Only store hex strings in variables, never raw binary.
  • Hardcoding /tmp: Use mktemp for temporary files. Clean up with a trap.
  • Using xxd -e (little-endian mode): Not supported in busybox xxd. Manual byte swapping is required.

Don't Hand-Roll

Problem Don't Build Use Instead Why
AES-256-CBC decryption Custom decryption in shell openssl enc -d -aes-256-cbc Impossible to implement AES in pure shell; openssl handles PKCS7 removal automatically
Gzip decompression Custom DEFLATE in shell gunzip -c Compression algorithms cannot be implemented in shell
SHA-256 hashing Custom hash in shell sha256sum (busybox) Cryptographic hash requires proper implementation
HMAC-SHA-256 Custom HMAC in shell openssl dgst -sha256 -mac HMAC HMAC construction is subtle; openssl handles it correctly
Hex-to-binary conversion Manual byte construction xxd -r -p or printf '\xNN' Direct hex-to-binary tools already exist

Key insight: The shell decoder is fundamentally a "glue script" -- it orchestrates existing tools (dd, openssl, gunzip, sha256sum) to implement the decode pipeline. All cryptographic and compression operations are delegated to dedicated tools; the script only handles binary format parsing (offsets, lengths, byte swapping).

Common Pitfalls

Pitfall 1: openssl enc Output Contains Extra Bytes

What goes wrong: When using openssl enc -d with piped input from dd, the openssl command may behave differently depending on whether input comes from a file or a pipe. Some versions have issues with incomplete reads on pipes. Why it happens: Pipe buffering and EOF handling can differ from file I/O. How to avoid: Always extract ciphertext to a temp file first, then decrypt from the file. Or use dd ... | openssl enc -d ... and verify the output size matches compressed_size (or original_size if uncompressed). Warning signs: Decrypted output size doesn't match expected compressed_size.

Pitfall 2: Hex String Case Mismatch in HMAC Comparison

What goes wrong: openssl dgst outputs lowercase hex, but the stored HMAC extracted via xxd -p or od may produce different case. Why it happens: Different tools use different case conventions for hex output. How to avoid: Normalize both sides to lowercase before comparison: echo "$hex" | tr 'A-F' 'a-f'. Warning signs: HMAC verification always fails despite correct data.

Pitfall 3: Empty File (0 bytes) Causes gunzip Error

What goes wrong: A 0-byte original file has compression_flag=1 but after decryption produces a valid (but tiny) gzip stream. However, the compressed size may be very small and edge cases exist. Why it happens: Gzip of empty input produces a minimal gzip stream (~20 bytes). After AES encryption, encrypted_size = 32 (one PKCS7 block of padding added). The decrypted output is a valid gzip stream that decompresses to 0 bytes. How to avoid: Check original_size == 0 before decompression; if zero, just create an empty file. Alternatively, let gunzip handle it (it should produce empty output from a valid empty gzip stream). Warning signs: Script crashes or produces garbage for 0-byte files.

Pitfall 4: Cyrillic Filename Extraction Corrupts Characters

What goes wrong: The dd-extracted filename bytes are valid UTF-8, but the shell variable may be corrupted if the locale is not set or if intermediate processing strips high bytes. Why it happens: Some busybox builds strip non-ASCII bytes, or the LANG/LC_ALL environment may not support UTF-8. How to avoid: Set export LC_ALL=C (or C.UTF-8 if available) at the top of the script. Use dd to extract raw bytes directly. Do NOT process the filename through tr, sed, or awk before writing. Verify that printf '%s' preserves the bytes. Warning signs: Extracted files have garbled names (mojibake) or are named with ? characters.

Pitfall 5: Shell Arithmetic Overflow on Large Files

What goes wrong: Shell arithmetic uses platform-native integers. On 32-bit shells, values above 2^31 (2 GB) overflow. Why it happens: The archive format uses u32 for sizes and offsets (max 4 GB), but shell arithmetic may be limited to signed 32-bit. How to avoid: All fields are u32 LE (max ~4 GB). Busybox on ARM typically uses 32-bit arithmetic. For v1, file sizes under 4 GB are expected, so this is LOW risk. If needed, use awk for arithmetic on large numbers. Warning signs: Negative offsets or sizes in the script output for files larger than 2 GB.

Pitfall 6: dd stderr Noise Pollutes Output

What goes wrong: dd writes transfer statistics to stderr (e.g., "32 bytes transferred"). If stderr is not suppressed, it may confuse piped commands or pollute user output. Why it happens: dd always writes stats to stderr unless suppressed. How to avoid: Always use 2>/dev/null with dd commands. This is already shown in FORMAT.md Section 13 reference functions. Warning signs: Unexpected text mixed into hex output or filenames.

Pitfall 7: openssl 3.x Changes HMAC Syntax

What goes wrong: The -mac HMAC -macopt hexkey:KEY syntax works in OpenSSL 1.x and 3.x but is soft-deprecated in 3.x. The new openssl mac subcommand is preferred but has different syntax. Why it happens: OpenSSL 3.x migrated to provider-based architecture; legacy options still work but may be removed. How to avoid: Implement graceful degradation (already specified in FORMAT.md Section 13.3). Test with echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00 at script startup. If it fails, set SKIP_HMAC=1. Warning signs: HMAC check produces error messages instead of hash values.

Code Examples

Complete Decode Pipeline for One File

# Verified pattern from FORMAT.md Section 13.4 + project decisions
KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550"
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT

# Step 1: Extract ciphertext to temp file
dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" \
  of="$TMPDIR/ct.bin" 2>/dev/null

# Step 2: Verify HMAC (if available)
if [ "$SKIP_HMAC" = "0" ]; then
  computed_hmac=$(
    {
      dd if="$ARCHIVE" bs=1 skip="$iv_toc_offset" count=16 2>/dev/null  # IV from TOC
      cat "$TMPDIR/ct.bin"                                                # ciphertext
    } | openssl dgst -sha256 -mac HMAC -macopt "hexkey:${KEY_HEX}" -hex 2>/dev/null \
      | awk '{print $NF}'
  )
  if [ "$computed_hmac" != "$hmac_hex" ]; then
    echo "HMAC failed for $filename, skipping" >&2
    continue
  fi
fi

# Step 3: Decrypt (openssl auto-removes PKCS7 padding)
openssl enc -d -aes-256-cbc -nosalt \
  -K "$KEY_HEX" -iv "$iv_hex" \
  -in "$TMPDIR/ct.bin" -out "$TMPDIR/dec.bin"

# Step 4: Decompress if needed
if [ "$compression_flag" = "01" ]; then
  gunzip -c "$TMPDIR/dec.bin" > "$TMPDIR/out.bin"
else
  mv "$TMPDIR/dec.bin" "$TMPDIR/out.bin"
fi

# Step 5: Verify SHA-256
actual_sha=$(sha256sum "$TMPDIR/out.bin" | awk '{print $1}')
if [ "$actual_sha" != "$sha256_hex" ]; then
  echo "WARNING: SHA-256 mismatch for $filename" >&2
fi

# Step 6: Write output
mv "$TMPDIR/out.bin" "$OUTPUT_DIR/$filename"

Key Hex Constant (from src/key.rs)

# Hardcoded 32-byte AES-256 key as hex string (matching src/key.rs)
KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550"

Derivation from src/key.rs:

0x7A 0x35 0xC1 0xD9 0x4F 0xE8 0x2B 0x6A
0x91 0x0D 0xF3 0x58 0xBC 0x74 0xA6 0x1E
0x42 0x8F 0xD0 0x63 0xE5 0x17 0x9B 0x2C
0xFA 0x84 0x06 0xCD 0x3E 0x79 0xB5 0x50

HMAC Verification with IV from Archive (Not from TOC-parsed Variable)

A subtle point: for HMAC verification, the IV bytes must come from the archive file (not from a hex variable). The HMAC is computed over raw IV || ciphertext bytes, not hex strings. The approach using dd to extract IV bytes and concatenating with ciphertext via subshell { dd ...; dd ...; } is correct (as shown in FORMAT.md Section 13.3).

However, for the HMAC comparison, we compare hex strings (both from openssl dgst output and from the TOC hex extraction). Both must be lowercase.

UTF-8 Filename Extraction

# dd extracts raw bytes; if they are valid UTF-8, the shell preserves them
filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null)
# $filename now contains UTF-8 string, including Cyrillic characters
# Works because: (1) dd copies raw bytes, (2) $() captures them, (3) no null bytes in UTF-8 filenames

State of the Art

Old Approach Current Approach When Changed Impact
openssl dgst -mac HMAC -macopt openssl mac -digest SHA256 -macopt hexkey:... OpenSSL 3.0 (2021) Old syntax still works in 3.x but soft-deprecated
xxd not in busybox xxd applet in busybox BusyBox 1.28 (2017) Available on newer builds, but od fallback still needed for older systems
openssl enc with -md md5 default -md sha256 default OpenSSL 1.1.0 (2016) No impact for raw key mode (-K/-iv); -md only affects password-derived keys

Deprecated/outdated:

  • openssl dgst -hmac "key" (string key): Still works but -macopt hexkey: is required for binary keys. The hex key mode is NOT deprecated.
  • busybox builds without xxd: Still common on very old/minimal systems, hence od fallback is essential.

Open Questions

  1. Does the target busybox have openssl?

    • What we know: openssl is NOT a busybox applet. It must be a separate binary on the target system. The project mentions "busybox-compatible" in requirements.
    • What's unclear: Whether the specific target device (Android-based car head unit) has the openssl CLI installed.
    • Recommendation: The script MUST fail with a clear error message if openssl is not found. Document this as a prerequisite. The script already checks for tool availability at startup.
  2. Does the target openssl support -mac HMAC -macopt hexkey:?

    • What we know: Standard OpenSSL 1.1.1+ and 3.x support this syntax. Busybox does not include openssl. Minimal/embedded openssl builds may lack HMAC support.
    • What's unclear: Exact openssl version on target.
    • Recommendation: Implement graceful degradation per FORMAT.md Section 13.3. Skip HMAC if unsupported, print warning.
  3. Performance on large files with dd bs=1?

    • What we know: dd bs=1 reads one byte at a time. For extracting large data blocks (megabytes), this is very slow.
    • What's unclear: Whether the shell decoder needs to handle large files efficiently.
    • Recommendation: For data block extraction (Step 1 of decode), use larger block sizes. Extract full ciphertext with dd bs=1 skip=OFFSET count=SIZE which still uses bs=1 but lets dd handle the buffering. For truly large files, consider dd bs=4096 with calculated skip/count, but the added complexity may not be worth it for a fallback decoder.

Sources

Primary (HIGH confidence)

  • FORMAT.md Section 13 (Shell Decoder Reference) - Complete reference functions for all operations
  • FORMAT.md Section 10 (Decode Order of Operations) - Mandatory decode pipeline
  • FORMAT.md Section 4-5 (Header/TOC structure) - Binary layout
  • src/key.rs - Actual hardcoded key bytes
  • src/format.rs - Rust implementation of header/TOC parsing (reference)
  • src/crypto.rs - Rust crypto implementation (HMAC scope, encrypt/decrypt)
  • kotlin/ArchiveDecoder.kt - Working decoder implementation (behavioral reference)
  • kotlin/test_decoder.sh - Cross-validation test pattern (structural reference)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • Various shell scripting guides on binary file handling - General patterns, not project-specific

Metadata

Confidence breakdown:

  • Standard stack: HIGH - All tools are well-documented CLI utilities with decades of stability. FORMAT.md Section 13 provides verified reference code.
  • Architecture: HIGH - Single-file script pattern is proven by the existing kotlin/test_decoder.sh. Binary parsing pattern with dd+xxd/od is well-established.
  • Pitfalls: HIGH - Identified from real tool behavior (openssl pipe handling, hex case, PKCS7, busybox limitations). FORMAT.md already anticipates several pitfalls (graceful HMAC degradation, od fallback).

Research date: 2026-02-25 Valid until: 2026-03-25 (stable domain -- shell tools and openssl CLI interface change very slowly)