kv-engine

A minimal, single-process key–value storage engine built to explore durability, crash behaviour, concurrency correctness, and storage semantics from first principles.

This project intentionally avoids distribution, networking, and performance optimisations, in order to focus more on correctness, failure modes, and reasoning about state under concurrency.

Overview

The system exposes a simple key–value API:

get(key) → value | None
put(key, value)
delete(key)

Internally, the store maintains an in-memory map and persists state to disk to survive restarts.

Concurrency Model

The model I've chosen is essentially a single-owner concurrency model. In this model, many threads can interact with the same process, but one thread has exclusive access to the database at any given time. With this, operations appear to be executed sequentially, even if they happen concurrently, on top of protecting the in-memory state from concurrent writes.

Additionally, both reads and writes are synchronised, meaning that they cannot see a state that's mid-write.

Stages Implemented

Stage 0 — In-memory semantics

Data stored entirely in memory
Defines stable API semantics:
- missing keys return None
- deleting a missing key is a no-op
No persistence
Restart loses all data

This stage exists to lock down behavioural guarantees before introducing storage.

Stage 1 — Naive full-state persistence

Entire in-memory state is serialized to disk after each mutation
State is loaded from disk on startup
Clean shutdown + restart restores all data
Disk is treated as a mirror of memory (memory is authoritative)

This stage provides persistence but is not crash-safe.

Stage 2 — Crash-safe snapshot persistence

Writes are performed to a temporary file, never directly to the main database file
Each write is made durable with flush + fsync before becoming visible
The main database file is replaced atomically using filesystem rename
After a crash, the database is always in a previously committed state
Temporary files are ignored on startup and never required for recovery

This stage guarantees atomicity and durability, but not write ordering or history preservation.

Stage 3 — Thread-safe, linearizable access

All operations (get, put, delete) happen with exclusive ownership of the database
Concurrent access behaves as if operations were executed sequentially in some total order
Both Read and Write operations are locked out whilst another operation is in progress to prevent an operation from seeing a partial or uncommitted state
Operations appear to happen instantaneously somewhere between their start and end times

This stage guarantees linearizable behaviour under concurrent access within a single process.

Stage 4 — Write-ahead logging and crash recovery

Snapshots are no longer trusted solely as the main source. We instead rely on the walfile to guarantee ordering and durability
A Walfile has been introduced that records all writes to the database (put or delete) and then made durable using flush() followed by fsync()
Upon restart, the database is restored change by change from the most recent snapshot, based from the walfile, stopping at the first invalid record (it's assumed to have happened due te a crash or powercut or something).

This stage focuses on durability, write ordering and recovery from a crash without needing to write an entire snapshot each time.

Stage 5 — Snapshot checkpointing

Snapshots are now treated as explicit checkpoints that end up being written atomically
A checkpoint metadata file records the WAL byte offset after the most recent snapshot
On restart, WAL replay begins after the last checkpoint offset, not from the start, meaning recovery time is proportional to how much the WALfile has changed since the last checkpoint, not the entire history
- Corrupted WAL suffixes beyond the checkpoint are safely ignored
- WAL replays are idempotent

This stage focuses on decreasing recovery time.

Stage 6 - Performance Testing

Measuring end-to-end latency using a monotonic clock
Recording put latency distributions (p50, p95, p99)
No changes to actual functionality

This stage focuses on measuring performance.

Performance results

Below you can see two tables regarding the performance of the program. The first table shows the latency of each key operation. The second shows the p1,50 and 99 of the operations.

Pytest Benchmarking

Name	Min (ns)	Max (ns)	Mean (ns)	StdDev (ns)	Median (ns)	IQR (ns)	Outliers	OPS (Kops/s)	Rounds	Iterations
test_get_latency	174.3707 (1.0)	50,026.2206 (1.0)	224.5733 (1.0)	126.0082 (1.0)	220.6652 (1.0)	18.5532 (1.0)	2130;3935	4,452.8888	195,122	27
test_put_latency	25,458.9831 (146.00)	1,790,000.0094 (35.78)	214,571.3537 (955.46)	387,432.6197 (>1000.0)	36,625.0169 (165.98)	10,499.9635 (565.94)	650;787	4.6605	3,910	1
test_delete_latency	29,625.0219 (169.90)	38,867,915.9805 (776.95)	257,314.1452 (>1000.0)	715,234.9094 (>1000.0)	36,999.9907 (167.67)	7,583.0030 (408.72)	1631;3975	3.8863	19,786	1

Legend:

Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean

These tests can be found in tests/test_benchmarking.py

Latency Percentiles (nanoseconds)

Operation	p1	p50	p99	Max	Iterations
GET	211 ns	229 ns	260 ns	1,489 ns	198,373
PUT	28,208 ns	37,042 ns	1,042,404 ns	1,850,166 ns	2,336
DELETE	32,874 ns	37,833 ns	3,111,150 ns	42,740,790 ns	22,202

Total processing time: around 0.219 seconds

More information can be found in DESIGN.md.

Current Guarantees and limitations

Category	Guaranteed at this stage?	Notes
API semantics	✅	`get` returns `None` for missing keys; `put` overwrites; `delete` is idempotent
In-memory correctness	✅	In-memory state reflects the latest committed WAL entry
Persistence across clean restart	✅	Snapshot + WAL replay restores full state
Disk ↔ memory consistency	✅	Memory is reconstructed from snapshot + WAL, not snapshots alone
Valid on-disk format	✅	WAL is append-only; snapshots are atomically replaced
Crash safety	✅	Recovery proceeds up to the last valid WAL record
Atomicity under crash	✅	A change/mutation is either fully applied or not applied after restart
Concurrency safety	✅	Thread-safe, linearizable access within a single process
Linearizable access	✅	All operations appear totally ordered
Write ordering guarantees	✅	WAL preserves changes order
History preservation	✅	All acknowledged writes survive crashes
Partial-write tolerance	✅	Truncated or corrupted WAL suffixes are safely ignored
Reduced recovery time	✅	WAL replay starts after last checkpoint
Performance guarantees	❌	WAL fsync on every write; no batching
Transactions	❌	Single-key operations only
Asynchronous writes	❌	Writes are synchronous and blocking
Background flushing	❌	Snapshots are not generated in the background incrementally

Design Notes (Persistence & Storage)

Persistence is implemented via Write-ahead logging. Snapshots act as checkpoints, not as the primary source of truth
On startup:
- Snapshot is loaded from disk
- WAL is replayed from the snapshot offset to the end
- In-memory state is reconstructed from the snapshot + WAL
The order of writing to the store is: callin put/delete -> walfile modified -> flush/fsync -> modify tempfile -> rename tempfile to mainfile -> update checkpoint if needed

Example

db = KVStore("db.json")
db.put("A", 0)
db.delete("A")
print(db.get("A"))  # None

Out of Scope

Fine-grained or lock-free concurrency
- We've made a single owner, coarse grained locking system
Asynchronous writes
A way to truncate WAL so it doesn't go on forever
Transactions
Distributed systems
- The system is intentionally single process on a single node so it's not distributed by definition
Replication or sharding
Query languages

Resources Used

https://www.geeksforgeeks.org/python/python-os-fsync-method/ to learn about fsync() and how to use it with file objects
https://www.geeksforgeeks.org/python/conftest-in-pytest/ to learn about conftest.py
https://www.geeksforgeeks.org/python/multithreading-python-set-1/ to learn about threads
https://stackoverflow.com/questions/3310049/proper-use-of-mutexes-in-python to learn about mutexes
https://maxnilz.com/docs/006-arch/003-concurrency-protocol/#:~:text=Q:%20Compare%20different%20Concurrency%20models,conditions%20and%20ensure%20data%20integrity to learn about the different types ofconcurrency models
https://www.architecture-weekly.com/p/the-write-ahead-log-a-foundation to learn about write-ahead logging
https://medium.com/@vinciabhinav7/write-ahead-logs-but-why-494c3efd722d to understand why use WAL in the first place
System Design Interview An Insider’s Guide by Alex Yu.pdf, chapter 6 in particular to learn about standards for building a key-value store
https://medium.com/@jatinumamtora/a-deep-dive-into-write-ahead-logging-wal-in-database-engines-recovery-71f6d98f0e23 and https://blog.algomaster.io/p/designing-a-distributed-key-value-store to better understand how to use checkpointing/ARIES
https://www.systemdesignhandbook.com/guides/design-a-key-value-store/#2-snapshots-and-checkpointing to gain a better understanding of how a key-value store is meant to work, including snapshots and checkpointing
Pytest docs to learn about using fixtures and pytest-benchmark

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
README.md		README.md
checkpoint.json		checkpoint.json
conftest.py		conftest.py
kvstore.py		kvstore.py
plot_benchmarks.py		plot_benchmarks.py
wal.json		wal.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv-engine

Overview

Concurrency Model

Stages Implemented

Stage 0 — In-memory semantics

Stage 1 — Naive full-state persistence

Stage 2 — Crash-safe snapshot persistence

Stage 3 — Thread-safe, linearizable access

Stage 4 — Write-ahead logging and crash recovery

Stage 5 — Snapshot checkpointing

Stage 6 - Performance Testing

Performance results

Pytest Benchmarking

Latency Percentiles (nanoseconds)

Current Guarantees and limitations

Design Notes (Persistence & Storage)

Example

Out of Scope

Resources Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kv-engine

Overview

Concurrency Model

Stages Implemented

Stage 0 — In-memory semantics

Stage 1 — Naive full-state persistence

Stage 2 — Crash-safe snapshot persistence

Stage 3 — Thread-safe, linearizable access

Stage 4 — Write-ahead logging and crash recovery

Stage 5 — Snapshot checkpointing

Stage 6 - Performance Testing

Performance results

Pytest Benchmarking

Latency Percentiles (nanoseconds)

Current Guarantees and limitations

Design Notes (Persistence & Storage)

Example

Out of Scope

Resources Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages