A minimal, single-process key–value storage engine built to explore durability, crash behaviour, concurrency correctness, and storage semantics from first principles.
This project intentionally avoids distribution, networking, and performance optimisations, in order to focus more on correctness, failure modes, and reasoning about state under concurrency.
The system exposes a simple key–value API:
get(key) → value | Noneput(key, value)delete(key)
Internally, the store maintains an in-memory map and persists state to disk to survive restarts.
The model I've chosen is essentially a single-owner concurrency model. In this model, many threads can interact with the same process, but one thread has exclusive access to the database at any given time. With this, operations appear to be executed sequentially, even if they happen concurrently, on top of protecting the in-memory state from concurrent writes.
Additionally, both reads and writes are synchronised, meaning that they cannot see a state that's mid-write.
-
Data stored entirely in memory
-
Defines stable API semantics:
- missing keys return
None - deleting a missing key is a no-op
- missing keys return
-
No persistence
-
Restart loses all data
This stage exists to lock down behavioural guarantees before introducing storage.
- Entire in-memory state is serialized to disk after each mutation
- State is loaded from disk on startup
- Clean shutdown + restart restores all data
- Disk is treated as a mirror of memory (memory is authoritative)
This stage provides persistence but is not crash-safe.
- Writes are performed to a temporary file, never directly to the main database file
- Each write is made durable with
flush+fsyncbefore becoming visible - The main database file is replaced atomically using filesystem rename
- After a crash, the database is always in a previously committed state
- Temporary files are ignored on startup and never required for recovery
This stage guarantees atomicity and durability, but not write ordering or history preservation.
- All operations (get, put, delete) happen with exclusive ownership of the database
- Concurrent access behaves as if operations were executed sequentially in some total order
- Both Read and Write operations are locked out whilst another operation is in progress to prevent an operation from seeing a partial or uncommitted state
- Operations appear to happen instantaneously somewhere between their start and end times
This stage guarantees linearizable behaviour under concurrent access within a single process.
- Snapshots are no longer trusted solely as the main source. We instead rely on the walfile to guarantee ordering and durability
- A Walfile has been introduced that records all writes to the database (
putordelete) and then made durable using flush() followed by fsync() - Upon restart, the database is restored change by change from the most recent snapshot, based from the walfile, stopping at the first invalid record (it's assumed to have happened due te a crash or powercut or something).
This stage focuses on durability, write ordering and recovery from a crash without needing to write an entire snapshot each time.
- Snapshots are now treated as explicit checkpoints that end up being written atomically
- A checkpoint metadata file records the WAL byte offset after the most recent snapshot
- On restart, WAL replay begins after the last checkpoint offset, not from the start, meaning recovery time is proportional to how much the WALfile has changed since the last checkpoint, not the entire history
- Corrupted WAL suffixes beyond the checkpoint are safely ignored
- WAL replays are idempotent
This stage focuses on decreasing recovery time.
- Measuring end-to-end latency using a monotonic clock
- Recording put latency distributions (p50, p95, p99)
- No changes to actual functionality
This stage focuses on measuring performance.
Below you can see two tables regarding the performance of the program. The first table shows the latency of each key operation. The second shows the p1,50 and 99 of the operations.
| Name | Min (ns) | Max (ns) | Mean (ns) | StdDev (ns) | Median (ns) | IQR (ns) | Outliers | OPS (Kops/s) | Rounds | Iterations |
|---|---|---|---|---|---|---|---|---|---|---|
| test_get_latency | 174.3707 (1.0) | 50,026.2206 (1.0) | 224.5733 (1.0) | 126.0082 (1.0) | 220.6652 (1.0) | 18.5532 (1.0) | 2130;3935 | 4,452.8888 | 195,122 | 27 |
| test_put_latency | 25,458.9831 (146.00) | 1,790,000.0094 (35.78) | 214,571.3537 (955.46) | 387,432.6197 (>1000.0) | 36,625.0169 (165.98) | 10,499.9635 (565.94) | 650;787 | 4.6605 | 3,910 | 1 |
| test_delete_latency | 29,625.0219 (169.90) | 38,867,915.9805 (776.95) | 257,314.1452 (>1000.0) | 715,234.9094 (>1000.0) | 36,999.9907 (167.67) | 7,583.0030 (408.72) | 1631;3975 | 3.8863 | 19,786 | 1 |
Legend:
- Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
- OPS: Operations Per Second, computed as 1 / Mean
These tests can be found in tests/test_benchmarking.py
| Operation | p1 | p50 | p99 | Max | Iterations |
|---|---|---|---|---|---|
| GET | 211 ns | 229 ns | 260 ns | 1,489 ns | 198,373 |
| PUT | 28,208 ns | 37,042 ns | 1,042,404 ns | 1,850,166 ns | 2,336 |
| DELETE | 32,874 ns | 37,833 ns | 3,111,150 ns | 42,740,790 ns | 22,202 |
Total processing time: around 0.219 seconds
More information can be found in DESIGN.md.
| Category | Guaranteed at this stage? | Notes |
|---|---|---|
| API semantics | ✅ | get returns None for missing keys; put overwrites; delete is idempotent |
| In-memory correctness | ✅ | In-memory state reflects the latest committed WAL entry |
| Persistence across clean restart | ✅ | Snapshot + WAL replay restores full state |
| Disk ↔ memory consistency | ✅ | Memory is reconstructed from snapshot + WAL, not snapshots alone |
| Valid on-disk format | ✅ | WAL is append-only; snapshots are atomically replaced |
| Crash safety | ✅ | Recovery proceeds up to the last valid WAL record |
| Atomicity under crash | ✅ | A change/mutation is either fully applied or not applied after restart |
| Concurrency safety | ✅ | Thread-safe, linearizable access within a single process |
| Linearizable access | ✅ | All operations appear totally ordered |
| Write ordering guarantees | ✅ | WAL preserves changes order |
| History preservation | ✅ | All acknowledged writes survive crashes |
| Partial-write tolerance | ✅ | Truncated or corrupted WAL suffixes are safely ignored |
| Reduced recovery time | ✅ | WAL replay starts after last checkpoint |
| Performance guarantees | ❌ | WAL fsync on every write; no batching |
| Transactions | ❌ | Single-key operations only |
| Asynchronous writes | ❌ | Writes are synchronous and blocking |
| Background flushing | ❌ | Snapshots are not generated in the background incrementally |
- Persistence is implemented via Write-ahead logging. Snapshots act as checkpoints, not as the primary source of truth
- On startup:
- Snapshot is loaded from disk
- WAL is replayed from the snapshot offset to the end
- In-memory state is reconstructed from the snapshot + WAL
- The order of writing to the store is: callin put/delete -> walfile modified -> flush/fsync -> modify tempfile -> rename tempfile to mainfile -> update checkpoint if needed
db = KVStore("db.json")
db.put("A", 0)
db.delete("A")
print(db.get("A")) # None- Fine-grained or lock-free concurrency
- We've made a single owner, coarse grained locking system
- Asynchronous writes
- A way to truncate WAL so it doesn't go on forever
- Transactions
- Distributed systems
- The system is intentionally single process on a single node so it's not distributed by definition
- Replication or sharding
- Query languages
- https://www.geeksforgeeks.org/python/python-os-fsync-method/ to learn about fsync() and how to use it with file objects
- https://www.geeksforgeeks.org/python/conftest-in-pytest/ to learn about conftest.py
- https://www.geeksforgeeks.org/python/multithreading-python-set-1/ to learn about threads
- https://stackoverflow.com/questions/3310049/proper-use-of-mutexes-in-python to learn about mutexes
- https://maxnilz.com/docs/006-arch/003-concurrency-protocol/#:~:text=Q:%20Compare%20different%20Concurrency%20models,conditions%20and%20ensure%20data%20integrity to learn about the different types ofconcurrency models
- https://www.architecture-weekly.com/p/the-write-ahead-log-a-foundation to learn about write-ahead logging
- https://medium.com/@vinciabhinav7/write-ahead-logs-but-why-494c3efd722d to understand why use WAL in the first place
System Design Interview An Insider’s Guide by Alex Yu.pdf, chapter 6 in particular to learn about standards for building a key-value store- https://medium.com/@jatinumamtora/a-deep-dive-into-write-ahead-logging-wal-in-database-engines-recovery-71f6d98f0e23 and https://blog.algomaster.io/p/designing-a-distributed-key-value-store to better understand how to use checkpointing/ARIES
- https://www.systemdesignhandbook.com/guides/design-a-key-value-store/#2-snapshots-and-checkpointing to gain a better understanding of how a key-value store is meant to work, including snapshots and checkpointing
- Pytest docs to learn about using fixtures and pytest-benchmark