diff --git a/docs/FORMAT.md b/docs/FORMAT.md index a4b745e..d1510b5 100644 --- a/docs/FORMAT.md +++ b/docs/FORMAT.md @@ -1,7 +1,7 @@ # Encrypted Archive Binary Format Specification -**Version:** 1.0 -**Date:** 2026-02-24 +**Version:** 1.1 +**Date:** 2026-02-26 **Status:** Normative --- @@ -12,7 +12,7 @@ 2. [Notation Conventions](#2-notation-conventions) 3. [Archive Structure Diagram](#3-archive-structure-diagram) 4. [Archive Header Definition](#4-archive-header-definition) -5. [File Table Entry Definition](#5-file-table-entry-definition) +5. [Table of Contents (TOC) Entry Definition](#5-table-of-contents-toc-entry-definition) 6. [Data Block Layout](#6-data-block-layout) 7. [Encryption and Authentication Details](#7-encryption-and-authentication-details) 8. [Compression Details](#8-compression-details) @@ -63,7 +63,7 @@ The shell decoder must be able to parse the archive format using `dd` (for byte - All multi-byte integers are **little-endian (LE)**. - All sizes are in **bytes** unless stated otherwise. - All offsets are **absolute** from archive byte 0 (the first byte of the file). -- Filenames are **UTF-8 encoded**, length-prefixed with a u16 byte count (NOT null-terminated). +- Entry names are **UTF-8 encoded** relative paths using `/` as the path separator (e.g., `dir/subdir/file.txt`). Names MUST NOT start with `/` or contain `..` components. For top-level files, the name is just the filename (e.g., `readme.txt`). Names are length-prefixed with a u16 byte count (NOT null-terminated). - Reserved fields are **zero-filled** and MUST be written as `0x00` bytes. --- @@ -74,13 +74,14 @@ The shell decoder must be able to parse the archive format using `dd` (for byte +=======================================+ | ARCHIVE HEADER | Fixed 40 bytes | magic(4) | ver(1) | flags(1) | -| file_count(2) | toc_offset(4) | +| entry_count(2) | toc_offset(4) | | toc_size(4) | toc_iv(16) | | reserved(8) | +=======================================+ | FILE TABLE (TOC) | Variable size -| Entry 1: name, sizes, offset, | Optionally encrypted -| iv, hmac, sha256, flags | (see Section 9.2) +| Entry 1: name, type, perms, | Optionally encrypted +| sizes, offset, iv, hmac, | Files AND directories +| sha256, flags | (see Section 9.2) | Entry 2: ... | | ... | | Entry N: ... | @@ -102,8 +103,8 @@ The shell decoder must be able to parse the archive format using `dd` (for byte The archive consists of three contiguous regions: 1. **Header** (fixed 40 bytes) -- contains magic bytes, version, flags, and a pointer to the file table. -2. **File Table (TOC)** (variable size) -- contains one entry per archived file with all metadata needed for extraction. -3. **Data Blocks** (variable size) -- contains the encrypted (and optionally compressed) file contents, one block per file, optionally separated by decoy padding. +2. **File Table (TOC)** (variable size) -- contains one entry per archived file or directory with all metadata needed for extraction. +3. **Data Blocks** (variable size) -- contains the encrypted (and optionally compressed) file contents, one block per file entry (directory entries have no data block), optionally separated by decoy padding. --- @@ -114,11 +115,11 @@ The header is a fixed-size 40-byte structure at offset 0x00. | Offset | Size | Type | Endian | Field | Description | |--------|------|------|--------|-------|-------------| | `0x00` | 4 | bytes | - | `magic` | Custom magic bytes: `0x00 0xEA 0x72 0x63`. The leading `0x00` signals binary content; the remaining bytes (`0xEA 0x72 0x63`) do not match any known file signature. | -| `0x04` | 1 | u8 | - | `version` | Format version. Value `1` for this specification (v1). | +| `0x04` | 1 | u8 | - | `version` | Format version. Value `2` for this specification (v1.1). Value `1` for legacy v1.0 (no directory support). | | `0x05` | 1 | u8 | - | `flags` | Feature flags bitfield (see below). | -| `0x06` | 2 | u16 | LE | `file_count` | Number of files stored in the archive. | -| `0x08` | 4 | u32 | LE | `toc_offset` | Absolute byte offset of the file table from archive start. | -| `0x0C` | 4 | u32 | LE | `toc_size` | Size of the file table in bytes (if TOC encryption is on, this is the encrypted size including PKCS7 padding). | +| `0x06` | 2 | u16 | LE | `entry_count` | Number of entries (files and directories) stored in the archive. | +| `0x08` | 4 | u32 | LE | `toc_offset` | Absolute byte offset of the entry table from archive start. | +| `0x0C` | 4 | u32 | LE | `toc_size` | Size of the entry table in bytes (if TOC encryption is on, this is the encrypted size including PKCS7 padding). | | `0x10` | 16 | bytes | - | `toc_iv` | Initialization vector for encrypted TOC. Zero-filled (`0x00` x 16) when TOC encryption flag (bit 1) is off. | | `0x20` | 8 | bytes | - | `reserved` | Reserved for future use. MUST be zero-filled. | @@ -136,33 +137,64 @@ The header is a fixed-size 40-byte structure at offset 0x00. --- -## 5. File Table Entry Definition +## 5. Table of Contents (TOC) Entry Definition -The file table (TOC) is a contiguous sequence of variable-length entries, one per file. Entries are stored in the order files were added to the archive. There is no per-entry delimiter; entries are read sequentially using the `name_length` field to determine where each entry's variable-length name ends. +The file table (TOC) is a contiguous sequence of variable-length entries, one per file or directory. Entries are stored so that directory entries appear before any files within them (parent-before-child ordering). There is no per-entry delimiter; entries are read sequentially using the `name_length` field to determine where each entry's variable-length name ends. ### Entry Field Table | Field | Size | Type | Endian | Description | |-------|------|------|--------|-------------| -| `name_length` | 2 | u16 | LE | Filename length in bytes (UTF-8 encoded byte count). | -| `name` | `name_length` | bytes | - | Filename as UTF-8 bytes. NOT null-terminated. May contain path separators (`/`). | -| `original_size` | 4 | u32 | LE | Original file size in bytes (before compression). | -| `compressed_size` | 4 | u32 | LE | Size after gzip compression. Equals `original_size` if `compression_flag` is 0 (no compression). | -| `encrypted_size` | 4 | u32 | LE | Size after AES-256-CBC encryption with PKCS7 padding. Formula: `((compressed_size / 16) + 1) * 16`. | -| `data_offset` | 4 | u32 | LE | Absolute byte offset of this file's data block from archive start. | -| `iv` | 16 | bytes | - | Random AES-256-CBC initialization vector for this file. | -| `hmac` | 32 | bytes | - | HMAC-SHA-256 over `iv || ciphertext`. See Section 7 for details. | -| `sha256` | 32 | bytes | - | SHA-256 hash of the original file content (before compression and encryption). | -| `compression_flag` | 1 | u8 | - | `0` = raw (no compression), `1` = gzip compressed. | +| `name_length` | 2 | u16 | LE | Entry name length in bytes (UTF-8 encoded byte count). | +| `name` | `name_length` | bytes | - | Entry name as UTF-8 bytes. NOT null-terminated. Relative path using `/` as separator (see Entry Name Semantics below). | +| `entry_type` | 1 | u8 | - | Entry type: `0x00` = regular file, `0x01` = directory. Directories have `original_size`, `compressed_size`, and `encrypted_size` all set to 0 and no corresponding data block. | +| `permissions` | 2 | u16 | LE | Unix permission bits (lower 12 bits of POSIX `mode_t`). Bit layout: `[suid(1)][sgid(1)][sticky(1)][owner_rwx(3)][group_rwx(3)][other_rwx(3)]`. Example: `0o755` = `0x01ED` = owner rwx, group r-x, other r-x. Stored as u16 LE. | +| `original_size` | 4 | u32 | LE | Original file size in bytes (before compression). For directories: 0. | +| `compressed_size` | 4 | u32 | LE | Size after gzip compression. Equals `original_size` if `compression_flag` is 0 (no compression). For directories: 0. | +| `encrypted_size` | 4 | u32 | LE | Size after AES-256-CBC encryption with PKCS7 padding. Formula: `((compressed_size / 16) + 1) * 16`. For directories: 0. | +| `data_offset` | 4 | u32 | LE | Absolute byte offset of this entry's data block from archive start. For directories: 0. | +| `iv` | 16 | bytes | - | Random AES-256-CBC initialization vector for this file. For directories: zero-filled. | +| `hmac` | 32 | bytes | - | HMAC-SHA-256 over `iv || ciphertext`. See Section 7 for details. For directories: zero-filled. | +| `sha256` | 32 | bytes | - | SHA-256 hash of the original file content (before compression and encryption). For directories: zero-filled. | +| `compression_flag` | 1 | u8 | - | `0` = raw (no compression), `1` = gzip compressed. For directories: 0. | | `padding_after` | 2 | u16 | LE | Number of decoy padding bytes after this file's data block. Always `0` when flags bit 3 (decoy_padding) is off. | +### Entry Type Values + +| Value | Name | Description | +|-------|------|-------------| +| `0x00` | File | Regular file. Has associated data block with ciphertext. All size fields and data_offset are meaningful. | +| `0x01` | Directory | Directory entry. `original_size`, `compressed_size`, `encrypted_size` are all 0. `data_offset` is 0. `iv` is zero-filled. `hmac` is zero-filled. `sha256` is zero-filled. `compression_flag` is 0. No data block exists for this entry. | + +### Permission Bits Layout + +| Bits | Mask | Name | Description | +|------|------|------|-------------| +| 11 | `0o4000` | setuid | Set user ID on execution | +| 10 | `0o2000` | setgid | Set group ID on execution | +| 9 | `0o1000` | sticky | Sticky bit | +| 8-6 | `0o0700` | owner | Owner read(4)/write(2)/execute(1) | +| 5-3 | `0o0070` | group | Group read(4)/write(2)/execute(1) | +| 2-0 | `0o0007` | other | Other read(4)/write(2)/execute(1) | + +Common examples: `0o755` (rwxr-xr-x) = `0x01ED`, `0o644` (rw-r--r--) = `0x01A4`, `0o700` (rwx------) = `0x01C0`. + +### Entry Name Semantics + +- Names are relative paths from the archive root, using `/` as separator. +- Example: a file at `project/src/main.rs` has name `project/src/main.rs`. +- A directory entry for `project/src/` has name `project/src` (no trailing slash). +- Names MUST NOT start with `/` (no absolute paths). +- Names MUST NOT contain `..` components (no directory traversal). +- The encoder MUST sort entries so that directory entries appear before any files within them (parent-before-child ordering). This allows the decoder to `mkdir -p` or create directories in a single sequential pass. + ### Entry Size Formula -Each file table entry has a total size of: +Each TOC entry has a total size of: ``` -entry_size = 2 + name_length + 4 + 4 + 4 + 4 + 16 + 32 + 32 + 1 + 2 - = 101 + name_length bytes +entry_size = 2 + name_length + 1 + 2 + 4 + 4 + 4 + 4 + 16 + 32 + 32 + 1 + 2 + = 104 + name_length bytes ``` ### File Table Total Size @@ -170,7 +202,7 @@ entry_size = 2 + name_length + 4 + 4 + 4 + 4 + 16 + 32 + 32 + 1 + 2 The total file table size is the sum of all entry sizes: ``` -toc_size = SUM(101 + name_length_i) for i in 0..file_count-1 +toc_size = SUM(104 + name_length_i) for i in 0..entry_count-1 ``` When TOC encryption (flags bit 1) is active, the encrypted TOC size includes PKCS7 padding: @@ -185,7 +217,7 @@ The `toc_size` field in the header stores the **actual size on disk** (encrypted ## 6. Data Block Layout -Each file has a single contiguous data block containing **only the ciphertext** (the AES-256-CBC encrypted output). +Each file entry has a single contiguous data block containing **only the ciphertext** (the AES-256-CBC encrypted output). Directory entries (`entry_type = 0x01`) have no data block. The decoder MUST skip directory entries when processing data blocks. ``` [ciphertext: encrypted_size bytes] @@ -402,10 +434,10 @@ The following steps MUST be followed in order by all decoders: 3. Parse header fields: - Verify magic == 0x00 0xEA 0x72 0x63 - - Read version (must be 1) + - Read version (must be 2 for v1.1) - Read flags - Check for unknown flag bits (bits 4-7 must be 0; reject if not) - - Read file_count + - Read entry_count - Read toc_offset, toc_size, toc_iv 4. Read TOC: @@ -414,54 +446,67 @@ The following steps MUST be followed in order by all decoders: c. If flags bit 1 (toc_encrypted) is set: - Decrypt TOC with AES-256-CBC using toc_iv and the 32-byte key. - Remove PKCS7 padding. - d. Parse file_count entries sequentially from the (decrypted) TOC bytes. + d. Parse entry_count entries sequentially from the (decrypted) TOC bytes. -5. For each file entry (i = 0 to file_count - 1): - a. Read ciphertext: +5. For each entry (i = 0 to entry_count - 1): + a. Check entry_type. If 0x01 (directory): create the directory using the entry + name as a relative path, apply permissions from the `permissions` field, + and skip to the next entry (no ciphertext to read). + + b. Read ciphertext (file entries only): - Seek to data_offset. - Read encrypted_size bytes. - b. Verify HMAC: + c. Verify HMAC: - Compute HMAC-SHA-256(key, iv || ciphertext). - Compare with stored hmac (32 bytes). - If mismatch: REJECT this file. Do NOT attempt decryption. - c. Decrypt: + d. Decrypt: - Decrypt ciphertext with AES-256-CBC using entry's iv and the 32-byte key. - Remove PKCS7 padding. - Result = compressed_data (or raw data if compression_flag = 0). - d. Decompress (if compression_flag = 1): + e. Decompress (if compression_flag = 1): - Decompress with gzip. - Result = original file content. - e. Verify integrity: + f. Verify integrity: - Compute SHA-256 of the decompressed/raw result. - Compare with stored sha256 (32 bytes). - If mismatch: WARN (data corruption or wrong key). - f. Write to output: + g. Write to output: + - Create parent directories as needed (using the path components of the entry name). - Create output file using stored name. - Write the verified content. + - Apply permissions from the entry's `permissions` field. ``` --- ## 11. Version Compatibility Rules -1. **Version field:** The `version` field at offset `0x04` identifies the format version. This specification defines version `1`. +1. **Version field:** The `version` field at offset `0x04` identifies the format version. This specification defines version `2` (v1.1). Version `1` was the original v1.0 format (no directory support, no entry_type/permissions fields). -2. **Forward compatibility:** Decoders MUST reject archives with `version` greater than their supported version. A v1 decoder encountering `version = 2` MUST fail with a clear error message. +2. **Version 2 changes from version 1:** + - TOC entries now include `entry_type` (1 byte) and `permissions` (2 bytes) fields after `name` and before `original_size`. + - Entry size formula changed from `101 + name_length` to `104 + name_length`. + - `file_count` header field renamed to `entry_count` (same offset, same type; directories count as entries). + - Entry names are relative paths with `/` separator (not filename-only). + - Entries are ordered parent-before-child (directories before their contents). -3. **Unknown flags:** Decoders MUST reject archives that have any reserved flag bits (bits 4-7) set to `1`. Unknown flags indicate features the decoder does not understand and cannot safely skip. Silent ignoring of unknown flags is prohibited. +3. **Forward compatibility:** Decoders MUST reject archives with `version` greater than their supported version. A v2 decoder encountering `version = 3` MUST fail with a clear error message. -4. **Future versions:** Version 2+ MAY: +4. **Unknown flags:** Decoders MUST reject archives that have any reserved flag bits (bits 4-7) set to `1`. Unknown flags indicate features the decoder does not understand and cannot safely skip. Silent ignoring of unknown flags is prohibited. + +5. **Future versions:** Version 3+ MAY: - Add fields after the `reserved` bytes in the header (growing header size). - Define new flag bits (bits 4-7). - Change the `reserved` field to carry metadata. - Introduce HKDF-derived per-file keys (replacing single shared key). -5. **Backward compatibility:** Future versions SHOULD maintain the same magic bytes and the same position of the `version` field (offset `0x04`) so that decoders can read the version before deciding how to proceed. +6. **Backward compatibility:** Future versions SHOULD maintain the same magic bytes and the same position of the `version` field (offset `0x04`) so that decoders can read the version before deciding how to proceed. ---