Storage Engine
The storage engine is an append-only WAL (write-ahead log) where the log IS the database. Every write appends a new entry. The file only grows (until garbage collection reclaims unreachable entries). This design gives you crash recovery, versioning, and integrity verification as structural properties rather than bolted-on features.
Entry Format
Every entry on disk shares the same header format:
[Entry Header - 31 bytes fixed + hash_length variable]
magic: u32 (0x0AE012DB - marks the start of a valid entry)
entry_version: u8 (format version, starting at 1)
entry_type: u8 (Chunk, FileRecord, DirectoryIndex, etc.)
flags: u8 (operational flags)
hash_algo: u16 (BLAKE3_256 = 0x0001, SHA256 = 0x0002, etc.)
compression_algo: u8 (None = 0x00, Zstd = 0x01)
encryption_algo: u8 (None = 0x00, reserved for future use)
key_length: u32 (length of the key field)
value_length: u32 (length of the value field)
timestamp: i64 (UTC milliseconds since epoch)
total_length: u32 (total bytes including header, for jump-scanning)
hash: [u8; N] (integrity hash, N determined by hash_algo)
[Key - key_length bytes]
[Value - value_length bytes]
Key properties:
magic(0x0AE012DB) enables recovery scanning – find entry boundaries even in a corrupted file by scanning for magic bytestotal_lengthenables jump-scanning – skip to the next entry without reading the full key/valuehashcoversentry_type + key + value– re-hash and compare to detect corruptionentry_versionenables format evolution – the engine selects the correct parser based on this byte
For BLAKE3-256 (the default), the hash is 32 bytes, making the full header 63 bytes.
Content-Addressed Hashing
Every piece of data is identified by its BLAKE3 hash. Hash inputs are prefixed by type (domain separation) to prevent collisions between different entry types:
| Entry Type | Hash Input | Example |
|---|---|---|
| Chunk | chunk: + raw bytes | BLAKE3("chunk:" + file_bytes) |
| FileRecord (path key) | file: + path | BLAKE3("file:/users/alice.json") |
| FileRecord (content key) | filec: + serialized record | BLAKE3("filec:" + record_bytes) |
| DirectoryIndex (path key) | dir: + path | BLAKE3("dir:/users/") |
| DirectoryIndex (content key) | dirc: + serialized data | BLAKE3("dirc:" + dir_bytes) |
The domain prefix ensures that a chunk’s raw data can never produce the same hash as a file path, even if the bytes are identical.
Chunking
Files are split into 256KB chunks for storage. Each chunk is content-addressed independently:
Original file (700KB):
[Chunk 1: 256KB] -> hash_a
[Chunk 2: 256KB] -> hash_b
[Chunk 3: 188KB] -> hash_c
FileRecord:
path: "/docs/report.pdf"
chunk_hashes: [hash_a, hash_b, hash_c]
total_size: 700KB
Chunking provides:
- Deduplication: Two files sharing identical 256KB blocks store those blocks only once
- Efficient updates: Modifying 3 bytes of a 10GB file creates one new chunk, not a new copy of the entire file
- Streaming reads: Read a file by iterating its chunk hashes and fetching each chunk
Dual-Key FileRecords
FileRecords are stored at two keys to support both current reads and historical versioning:
-
Path key (
file:/path) – mutable, always points to the latest version. Used for reads, metadata, indexing, and deletion. O(1) lookup. -
Content key (
filec:+ serialized record) – immutable, content-addressed. The directory tree’sChildEntry.hashpoints to this key.
When the version manager walks a snapshot’s directory tree, it follows ChildEntry.hash to the content key, which resolves to the FileRecord as it existed at snapshot time – not the current version. This is what makes historical reads correct.
Directories use the same pattern: dir:/path (mutable) and dirc: + data (immutable content key).
FileRecord Format
[FileRecord Value]
path_length: u16
path: [u8; path_length] (full file path)
content_type_len: u16
content_type: [u8; content_type_len] (MIME type)
total_size: u64 (file size in bytes)
created_at: i64 (UTC milliseconds)
updated_at: i64 (UTC milliseconds)
metadata_length: u32
metadata: [u8; metadata_length] (arbitrary JSON metadata)
chunk_count: u32
chunk_hashes: [u8; chunk_count * 32] (ordered BLAKE3 hashes)
Metadata fields come first so you can read file metadata without skipping past the chunk list. Chunk hashes are the tail of the record for streaming reads.
Directory Propagation
When a file is stored or deleted, the change propagates up the directory tree:
Store /users/alice.json:
1. Store chunks -> [hash_a, hash_b]
2. Store FileRecord at path key + content key
3. Update /users/ DirectoryIndex (new ChildEntry for alice.json)
4. Update / root DirectoryIndex (new ChildEntry for users/)
5. Update HEAD in file header (new root hash)
Each directory gets a new content hash because one of its children changed. This chain of updates from leaf to root is what maintains the Merkle tree and makes versioning work.
Void Management
When garbage collection reclaims an entry, the space becomes a Void – a marker for reclaimable space. Voids are tracked by size using deterministic hash keys:
Key: BLAKE3("::aeordb:void:262144")
Value: [list of file offsets where 262144-byte voids exist]
When a new entry needs to be written, the engine checks for a void of sufficient size before appending to the end of the file. If a void is larger than needed, it is split: the entry occupies the front, and a smaller void is created for the remainder (if the remainder is at least 63 bytes – the minimum entry header size).
Compression
Compression is a post-hash transform:
Write: raw data -> hash -> compress -> store
Read: load -> decompress -> verify hash -> return
The hash is always computed on the raw uncompressed data. This preserves deduplication (same content = same hash regardless of compression) and integrity verification.
Each entry carries its own compression_algo byte, so compressed and uncompressed entries coexist in the same file. Currently, zstd is the only supported compression algorithm.
fsync Strategy
Not all entries are equally important for durability:
| Data | fsync | Rationale |
|---|---|---|
| Chunks, FileRecords, DeletionRecords | Immediate | The truth – not rebuildable from other data |
| KV store, NVT, DirectoryIndex, Snapshots | Deferred | Derived data – can be rebuilt from a full entry scan |
This gives durability where it matters and performance where it doesn’t.
Crash Recovery
The recovery hierarchy, from least to most damage:
| What’s Lost | Recovery Method |
|---|---|
| Nothing | Read HEAD from KV store, load directory index, ready |
| KV store only | Entry-by-entry scan, rebuild KV store, load latest directory index |
| Directory index only | Scan FileRecords + DeletionRecords, reconstruct from paths + timestamps |
| KV store + directory | Full entry scan, rebuild KV, reconstruct directory |
| Only chunks + FileRecords survive | Full data recovery, version history reconstructed via DeletionRecords |
The magic bytes at the start of every entry enable boundary detection even in partially corrupted files. The total_length field in each header enables efficient forward scanning.
Next Steps
- Architecture – high-level system overview
- Versioning – snapshots, forks, and the Merkle tree