NikitolProject/android-encrypted-archiver

Files

NikitolProject 79a7ce2010 docs(phase5): research shell decoder domain

2026-02-25 01:29:50 +03:00

22 KiB

Raw Permalink Blame History

Phase 5: Shell Decoder - Research

Researched: 2026-02-25 Domain: POSIX/busybox shell scripting, binary format parsing, AES-256-CBC decryption via CLI Confidence: HIGH

Summary

The shell decoder is a busybox-compatible shell script that extracts files from archives created by the Rust archiver. The script must parse the binary format (header + TOC) using dd and hex conversion tools, decrypt each file with openssl enc -aes-256-cbc, optionally decompress with gunzip, and verify integrity with sha256sum. The format spec (FORMAT.md Section 13) already provides reference functions for most operations.

The main technical challenges are: (1) openssl is NOT a busybox applet -- it requires a separate openssl binary on the target system; (2) xxd was added to busybox in v1.28 (2017) but older versions lack it -- od must serve as fallback; (3) extracting UTF-8 filenames (Cyrillic) from binary data requires careful byte-range extraction with dd piped to raw output; (4) little-endian integer parsing in shell requires byte-swapping via hex string manipulation.

Primary recommendation: Build a single self-contained decode.sh script using the reference functions from FORMAT.md Section 13 as a foundation, with od-based fallbacks for xxd, graceful degradation for HMAC verification if openssl dgst lacks -mac support, and a cross-validation test script modeled after kotlin/test_decoder.sh.

<phase_requirements>

Phase Requirements

ID	Description	Research Support
SHL-01	Shell script dearchivation via busybox (dd, xxd, openssl, gunzip)	FORMAT.md Section 13 provides complete reference functions. `dd` and `gunzip` are native busybox applets. `xxd` available in busybox >=1.28 with `od` fallback. `openssl` requires external binary.
SHL-02	openssl enc -aes-256-cbc with -K/-iv/-nosalt for raw key mode	OpenSSL `enc` supports `-K` (hex key), `-iv` (hex IV), `-nosalt` and auto-removes PKCS7 padding on decryption. Standard across OpenSSL 1.x and 3.x.
SHL-03	Support files with non-ASCII names (Cyrillic)	`dd` extracts raw bytes which are valid UTF-8. The filename bytes can be written directly to a variable using command substitution. Shell/filesystem handles UTF-8 natively if `LANG` is set properly.
</phase_requirements>

Standard Stack

Core

Tool	Version	Purpose	Why Standard
`dd`	busybox built-in	Extract byte ranges from archive	Native applet, universal on all busybox builds
`openssl`	1.1.1+ or 3.x (external)	AES-256-CBC decryption, HMAC-SHA-256	Only CLI tool supporting raw-key AES-CBC decryption
`gunzip`	busybox built-in	Gzip decompression	Native applet, handles standard gzip streams
`sha256sum`	busybox built-in	SHA-256 integrity verification	Native applet, produces standard hash output
`sh`	busybox ash/sh	Script interpreter	POSIX-compatible shell, always available

Supporting

Tool	Version	Purpose	When to Use
`xxd`	busybox >=1.28 (optional)	Binary-to-hex conversion	Primary hex encoder; added to busybox in 2017
`od`	busybox built-in	Binary-to-hex conversion (fallback)	Fallback when `xxd` is unavailable
`awk`	busybox built-in	Text processing (parse openssl output)	Extract hash values from command output
`tr`	busybox built-in	Character deletion/translation	Remove whitespace/newlines from hex output
`printf`	shell built-in	Hex-to-decimal conversion	Convert `0xNN` strings to decimal integers
`mktemp`	busybox built-in	Create temporary files	Temporary storage for ciphertext/decrypted data

Alternatives Considered

Instead of	Could Use	Tradeoff
`xxd -p` for hex	`od -A n -t x1` for hex	`od` is universally available but output needs more cleanup (spaces, newlines)
`openssl dgst -mac HMAC` for HMAC	Skip HMAC verification	Older/minimal openssl may not support `-mac HMAC -macopt`; graceful degradation is the spec's recommendation
`printf '%d' "0x..."` for hex-to-dec	`$(( 16#... ))` bash arithmetic	`printf` is more portable across sh implementations; bash arithmetic is not POSIX
`sha256sum` for SHA-256	`openssl dgst -sha256`	Both work; `sha256sum` is a busybox applet (no external dependency)

Architecture Patterns

Recommended Script Structure

shell/
├── decode.sh           # Main decoder script (single file, self-contained)
└── test_decoder.sh     # Cross-validation test script (Rust pack -> Shell decode)

The decoder is a SINGLE self-contained script. No external libraries, no sourced files. This matches the project's deployment model: the script is copied alongside the archive to the target device.

Pattern 1: Detect-and-Fallback for xxd/od

What: Auto-detect if xxd is available; if not, define wrapper functions using od. When to use: At script startup, before any binary parsing. Example:

# Source: FORMAT.md Section 13.1 + 13.2 (od fallback)
if command -v xxd >/dev/null 2>&1; then
  read_hex() {
    dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null | xxd -p | tr -d '\n'
  }
else
  read_hex() {
    dd if="$1" bs=1 skip="$2" count="$3" 2>/dev/null \
      | od -A n -t x1 | tr -d ' \n'
  }
fi

Pattern 2: Little-Endian Integer Parsing via Hex Byte Swap

What: Read N bytes as hex, swap byte order, convert to decimal. When to use: For every u16/u32 field in the header and TOC. Example:

# Source: FORMAT.md Section 13.1
read_le_u16() {
  local hex=$(read_hex "$1" "$2" 2)
  local b0=${hex:0:2} b1=${hex:2:2}
  printf '%d' "0x${b1}${b0}"
}

read_le_u32() {
  local hex=$(read_hex "$1" "$2" 4)
  local b0=${hex:0:2} b1=${hex:2:2} b2=${hex:4:2} b3=${hex:6:2}
  printf '%d' "0x${b3}${b2}${b1}${b0}"
}

Pattern 3: Sequential TOC Parsing with Running Offset

What: Parse variable-length TOC entries using a running byte offset. When to use: When reading the file table (TOC), each entry has a variable-length filename. Example:

# Start at toc_offset
pos=$toc_offset

for i in $(seq 0 $((file_count - 1))); do
  name_length=$(read_le_u16 "$ARCHIVE" "$pos")
  pos=$((pos + 2))

  # Extract filename (raw UTF-8 bytes via dd)
  filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null)
  pos=$((pos + name_length))

  original_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
  compressed_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
  encrypted_size=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))
  data_offset=$(read_le_u32 "$ARCHIVE" "$pos"); pos=$((pos + 4))

  iv_hex=$(read_hex "$ARCHIVE" "$pos" 16); pos=$((pos + 16))
  hmac_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32))
  sha256_hex=$(read_hex "$ARCHIVE" "$pos" 32); pos=$((pos + 32))

  compression_flag=$(read_hex "$ARCHIVE" "$pos" 1); pos=$((pos + 1))
  padding_after=$(read_le_u16 "$ARCHIVE" "$pos"); pos=$((pos + 2))

  # Process this file entry...
done

Pattern 4: Pipe-Based Decryption (dd | openssl)

What: Extract ciphertext with dd and pipe directly to openssl enc -d for decryption. When to use: For each file's data block decryption. Example:

# Source: FORMAT.md Section 13.4
dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" 2>/dev/null \
  | openssl enc -d -aes-256-cbc -nosalt -K "$KEY_HEX" -iv "$iv_hex" \
  > "$tmpfile"

Pattern 5: Graceful HMAC Degradation

What: Detect if openssl dgst -mac HMAC is supported; skip HMAC if not. When to use: Before the file extraction loop. Example:

# Source: FORMAT.md Section 13.3
SKIP_HMAC=0
if ! echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00 >/dev/null 2>&1; then
  echo "WARNING: openssl HMAC not available, skipping integrity verification"
  SKIP_HMAC=1
fi

Anti-Patterns to Avoid

Using bash-specific syntax: The script must run in busybox ash/sh. No [[ ]], no $((16#FF)), no arrays, no process substitution <(). Use [ ], printf '%d' "0x...", positional parameters or temp files.
Reading entire archive into memory: Shell cannot handle binary data in variables. Always use dd to extract specific byte ranges to files or pipes.
Using -e flag with echo for binary: Portability issues across shells. Use printf or dd instead.
Storing binary data in shell variables: NULL bytes (\0) terminate strings in shell. Only store hex strings in variables, never raw binary.
Hardcoding /tmp: Use mktemp for temporary files. Clean up with a trap.
Using xxd -e (little-endian mode): Not supported in busybox xxd. Manual byte swapping is required.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
AES-256-CBC decryption	Custom decryption in shell	`openssl enc -d -aes-256-cbc`	Impossible to implement AES in pure shell; openssl handles PKCS7 removal automatically
Gzip decompression	Custom DEFLATE in shell	`gunzip -c`	Compression algorithms cannot be implemented in shell
SHA-256 hashing	Custom hash in shell	`sha256sum` (busybox)	Cryptographic hash requires proper implementation
HMAC-SHA-256	Custom HMAC in shell	`openssl dgst -sha256 -mac HMAC`	HMAC construction is subtle; openssl handles it correctly
Hex-to-binary conversion	Manual byte construction	`xxd -r -p` or `printf '\xNN'`	Direct hex-to-binary tools already exist

Key insight: The shell decoder is fundamentally a "glue script" -- it orchestrates existing tools (dd, openssl, gunzip, sha256sum) to implement the decode pipeline. All cryptographic and compression operations are delegated to dedicated tools; the script only handles binary format parsing (offsets, lengths, byte swapping).

Common Pitfalls

Pitfall 1: openssl enc Output Contains Extra Bytes

What goes wrong: When using openssl enc -d with piped input from dd, the openssl command may behave differently depending on whether input comes from a file or a pipe. Some versions have issues with incomplete reads on pipes. Why it happens: Pipe buffering and EOF handling can differ from file I/O. How to avoid: Always extract ciphertext to a temp file first, then decrypt from the file. Or use dd ... | openssl enc -d ... and verify the output size matches compressed_size (or original_size if uncompressed). Warning signs: Decrypted output size doesn't match expected compressed_size.

Pitfall 2: Hex String Case Mismatch in HMAC Comparison

What goes wrong: openssl dgst outputs lowercase hex, but the stored HMAC extracted via xxd -p or od may produce different case. Why it happens: Different tools use different case conventions for hex output. How to avoid: Normalize both sides to lowercase before comparison: echo "$hex" | tr 'A-F' 'a-f'. Warning signs: HMAC verification always fails despite correct data.

Pitfall 3: Empty File (0 bytes) Causes gunzip Error

What goes wrong: A 0-byte original file has compression_flag=1 but after decryption produces a valid (but tiny) gzip stream. However, the compressed size may be very small and edge cases exist. Why it happens: Gzip of empty input produces a minimal gzip stream (~20 bytes). After AES encryption, encrypted_size = 32 (one PKCS7 block of padding added). The decrypted output is a valid gzip stream that decompresses to 0 bytes. How to avoid: Check original_size == 0 before decompression; if zero, just create an empty file. Alternatively, let gunzip handle it (it should produce empty output from a valid empty gzip stream). Warning signs: Script crashes or produces garbage for 0-byte files.

Pitfall 4: Cyrillic Filename Extraction Corrupts Characters

What goes wrong: The dd-extracted filename bytes are valid UTF-8, but the shell variable may be corrupted if the locale is not set or if intermediate processing strips high bytes. Why it happens: Some busybox builds strip non-ASCII bytes, or the LANG/LC_ALL environment may not support UTF-8. How to avoid: Set export LC_ALL=C (or C.UTF-8 if available) at the top of the script. Use dd to extract raw bytes directly. Do NOT process the filename through tr, sed, or awk before writing. Verify that printf '%s' preserves the bytes. Warning signs: Extracted files have garbled names (mojibake) or are named with ? characters.

Pitfall 5: Shell Arithmetic Overflow on Large Files

What goes wrong: Shell arithmetic uses platform-native integers. On 32-bit shells, values above 2^31 (2 GB) overflow. Why it happens: The archive format uses u32 for sizes and offsets (max 4 GB), but shell arithmetic may be limited to signed 32-bit. How to avoid: All fields are u32 LE (max ~4 GB). Busybox on ARM typically uses 32-bit arithmetic. For v1, file sizes under 4 GB are expected, so this is LOW risk. If needed, use awk for arithmetic on large numbers. Warning signs: Negative offsets or sizes in the script output for files larger than 2 GB.

Pitfall 6: dd stderr Noise Pollutes Output

What goes wrong: dd writes transfer statistics to stderr (e.g., "32 bytes transferred"). If stderr is not suppressed, it may confuse piped commands or pollute user output. Why it happens: dd always writes stats to stderr unless suppressed. How to avoid: Always use 2>/dev/null with dd commands. This is already shown in FORMAT.md Section 13 reference functions. Warning signs: Unexpected text mixed into hex output or filenames.

Pitfall 7: openssl 3.x Changes HMAC Syntax

What goes wrong: The -mac HMAC -macopt hexkey:KEY syntax works in OpenSSL 1.x and 3.x but is soft-deprecated in 3.x. The new openssl mac subcommand is preferred but has different syntax. Why it happens: OpenSSL 3.x migrated to provider-based architecture; legacy options still work but may be removed. How to avoid: Implement graceful degradation (already specified in FORMAT.md Section 13.3). Test with echo -n "test" | openssl dgst -sha256 -mac HMAC -macopt hexkey:00 at script startup. If it fails, set SKIP_HMAC=1. Warning signs: HMAC check produces error messages instead of hash values.

Code Examples

Complete Decode Pipeline for One File

# Verified pattern from FORMAT.md Section 13.4 + project decisions
KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550"
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT

# Step 1: Extract ciphertext to temp file
dd if="$ARCHIVE" bs=1 skip="$data_offset" count="$encrypted_size" \
  of="$TMPDIR/ct.bin" 2>/dev/null

# Step 2: Verify HMAC (if available)
if [ "$SKIP_HMAC" = "0" ]; then
  computed_hmac=$(
    {
      dd if="$ARCHIVE" bs=1 skip="$iv_toc_offset" count=16 2>/dev/null  # IV from TOC
      cat "$TMPDIR/ct.bin"                                                # ciphertext
    } | openssl dgst -sha256 -mac HMAC -macopt "hexkey:${KEY_HEX}" -hex 2>/dev/null \
      | awk '{print $NF}'
  )
  if [ "$computed_hmac" != "$hmac_hex" ]; then
    echo "HMAC failed for $filename, skipping" >&2
    continue
  fi
fi

# Step 3: Decrypt (openssl auto-removes PKCS7 padding)
openssl enc -d -aes-256-cbc -nosalt \
  -K "$KEY_HEX" -iv "$iv_hex" \
  -in "$TMPDIR/ct.bin" -out "$TMPDIR/dec.bin"

# Step 4: Decompress if needed
if [ "$compression_flag" = "01" ]; then
  gunzip -c "$TMPDIR/dec.bin" > "$TMPDIR/out.bin"
else
  mv "$TMPDIR/dec.bin" "$TMPDIR/out.bin"
fi

# Step 5: Verify SHA-256
actual_sha=$(sha256sum "$TMPDIR/out.bin" | awk '{print $1}')
if [ "$actual_sha" != "$sha256_hex" ]; then
  echo "WARNING: SHA-256 mismatch for $filename" >&2
fi

# Step 6: Write output
mv "$TMPDIR/out.bin" "$OUTPUT_DIR/$filename"

Key Hex Constant (from src/key.rs)

# Hardcoded 32-byte AES-256 key as hex string (matching src/key.rs)
KEY_HEX="7a35c1d94fe82b6a910df358bc74a61e428fd063e5179b2cfa8406cd3e79b550"

Derivation from src/key.rs:

0x7A 0x35 0xC1 0xD9 0x4F 0xE8 0x2B 0x6A
0x91 0x0D 0xF3 0x58 0xBC 0x74 0xA6 0x1E
0x42 0x8F 0xD0 0x63 0xE5 0x17 0x9B 0x2C
0xFA 0x84 0x06 0xCD 0x3E 0x79 0xB5 0x50

HMAC Verification with IV from Archive (Not from TOC-parsed Variable)

A subtle point: for HMAC verification, the IV bytes must come from the archive file (not from a hex variable). The HMAC is computed over raw IV || ciphertext bytes, not hex strings. The approach using dd to extract IV bytes and concatenating with ciphertext via subshell { dd ...; dd ...; } is correct (as shown in FORMAT.md Section 13.3).

However, for the HMAC comparison, we compare hex strings (both from openssl dgst output and from the TOC hex extraction). Both must be lowercase.

UTF-8 Filename Extraction

# dd extracts raw bytes; if they are valid UTF-8, the shell preserves them
filename=$(dd if="$ARCHIVE" bs=1 skip="$pos" count="$name_length" 2>/dev/null)
# $filename now contains UTF-8 string, including Cyrillic characters
# Works because: (1) dd copies raw bytes, (2) $() captures them, (3) no null bytes in UTF-8 filenames

State of the Art

Old Approach	Current Approach	When Changed	Impact
`openssl dgst -mac HMAC -macopt`	`openssl mac -digest SHA256 -macopt hexkey:...`	OpenSSL 3.0 (2021)	Old syntax still works in 3.x but soft-deprecated
`xxd` not in busybox	`xxd` applet in busybox	BusyBox 1.28 (2017)	Available on newer builds, but `od` fallback still needed for older systems
`openssl enc` with `-md md5` default	`-md sha256` default	OpenSSL 1.1.0 (2016)	No impact for raw key mode (`-K`/`-iv`); `-md` only affects password-derived keys

Deprecated/outdated:

openssl dgst -hmac "key" (string key): Still works but -macopt hexkey: is required for binary keys. The hex key mode is NOT deprecated.
busybox builds without xxd: Still common on very old/minimal systems, hence od fallback is essential.

Open Questions

Does the target busybox have openssl?
- What we know: openssl is NOT a busybox applet. It must be a separate binary on the target system. The project mentions "busybox-compatible" in requirements.
- What's unclear: Whether the specific target device (Android-based car head unit) has the openssl CLI installed.
- Recommendation: The script MUST fail with a clear error message if openssl is not found. Document this as a prerequisite. The script already checks for tool availability at startup.
Does the target openssl support -mac HMAC -macopt hexkey:?
- What we know: Standard OpenSSL 1.1.1+ and 3.x support this syntax. Busybox does not include openssl. Minimal/embedded openssl builds may lack HMAC support.
- What's unclear: Exact openssl version on target.
- Recommendation: Implement graceful degradation per FORMAT.md Section 13.3. Skip HMAC if unsupported, print warning.
Performance on large files with dd bs=1?
- What we know: dd bs=1 reads one byte at a time. For extracting large data blocks (megabytes), this is very slow.
- What's unclear: Whether the shell decoder needs to handle large files efficiently.
- Recommendation: For data block extraction (Step 1 of decode), use larger block sizes. Extract full ciphertext with dd bs=1 skip=OFFSET count=SIZE which still uses bs=1 but lets dd handle the buffering. For truly large files, consider dd bs=4096 with calculated skip/count, but the added complexity may not be worth it for a fallback decoder.

Sources

Primary (HIGH confidence)

FORMAT.md Section 13 (Shell Decoder Reference) - Complete reference functions for all operations
FORMAT.md Section 10 (Decode Order of Operations) - Mandatory decode pipeline
FORMAT.md Section 4-5 (Header/TOC structure) - Binary layout
src/key.rs - Actual hardcoded key bytes
src/format.rs - Rust implementation of header/TOC parsing (reference)
src/crypto.rs - Rust crypto implementation (HMAC scope, encrypt/decrypt)
kotlin/ArchiveDecoder.kt - Working decoder implementation (behavioral reference)
kotlin/test_decoder.sh - Cross-validation test pattern (structural reference)

Secondary (MEDIUM confidence)

OpenSSL enc documentation (3.3) - -K, -iv, -nosalt, PKCS7 auto-removal
OpenSSL dgst documentation (3.3) - -mac HMAC -macopt hexkey: syntax
BusyBox xxd commit (2017) - xxd applet added to busybox
BusyBox applet list - dd, gunzip, sha256sum, od are native applets; openssl, xxd are not
BusyBox xxd options - Supported flags: -p, -r, -l, -s, -g, -c, -u
docker-library/busybox#13 - UTF-8 support limitations in busybox

Tertiary (LOW confidence)

Various shell scripting guides on binary file handling - General patterns, not project-specific

Metadata

Confidence breakdown:

Standard stack: HIGH - All tools are well-documented CLI utilities with decades of stability. FORMAT.md Section 13 provides verified reference code.
Architecture: HIGH - Single-file script pattern is proven by the existing kotlin/test_decoder.sh. Binary parsing pattern with dd+xxd/od is well-established.
Pitfalls: HIGH - Identified from real tool behavior (openssl pipe handling, hex case, PKCS7, busybox limitations). FORMAT.md already anticipates several pitfalls (graceful HMAC degradation, od fallback).

Research date: 2026-02-25 Valid until: 2026-03-25 (stable domain -- shell tools and openssl CLI interface change very slowly)

22 KiB Raw Permalink Blame History