Skip to content

CypherGuy/kv-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kv-engine

A minimal, single-process key–value storage engine built to explore durability, crash behaviour, concurrency correctness, and storage semantics from first principles.

This project intentionally avoids distribution, networking, and performance optimisations, in order to focus more on correctness, failure modes, and reasoning about state under concurrency.


Overview

The system exposes a simple key–value API:

  • get(key) → value | None
  • put(key, value)
  • delete(key)

Internally, the store maintains an in-memory map and persists state to disk to survive restarts.

Concurrency Model

The model I've chosen is essentially a single-owner concurrency model. In this model, many threads can interact with the same process, but one thread has exclusive access to the database at any given time. With this, operations appear to be executed sequentially, even if they happen concurrently, on top of protecting the in-memory state from concurrent writes.

Additionally, both reads and writes are synchronised, meaning that they cannot see a state that's mid-write.


Stages Implemented

Stage 0 — In-memory semantics

  • Data stored entirely in memory

  • Defines stable API semantics:

    • missing keys return None
    • deleting a missing key is a no-op
  • No persistence

  • Restart loses all data

This stage exists to lock down behavioural guarantees before introducing storage.

Stage 1 — Naive full-state persistence

  • Entire in-memory state is serialized to disk after each mutation
  • State is loaded from disk on startup
  • Clean shutdown + restart restores all data
  • Disk is treated as a mirror of memory (memory is authoritative)

This stage provides persistence but is not crash-safe.

Stage 2 — Crash-safe snapshot persistence

  • Writes are performed to a temporary file, never directly to the main database file
  • Each write is made durable with flush + fsync before becoming visible
  • The main database file is replaced atomically using filesystem rename
  • After a crash, the database is always in a previously committed state
  • Temporary files are ignored on startup and never required for recovery

This stage guarantees atomicity and durability, but not write ordering or history preservation.

Stage 3 — Thread-safe, linearizable access

  • All operations (get, put, delete) happen with exclusive ownership of the database
  • Concurrent access behaves as if operations were executed sequentially in some total order
  • Both Read and Write operations are locked out whilst another operation is in progress to prevent an operation from seeing a partial or uncommitted state
  • Operations appear to happen instantaneously somewhere between their start and end times

This stage guarantees linearizable behaviour under concurrent access within a single process.

Stage 4 — Write-ahead logging and crash recovery

  • Snapshots are no longer trusted solely as the main source. We instead rely on the walfile to guarantee ordering and durability
  • A Walfile has been introduced that records all writes to the database (put or delete) and then made durable using flush() followed by fsync()
  • Upon restart, the database is restored change by change from the most recent snapshot, based from the walfile, stopping at the first invalid record (it's assumed to have happened due te a crash or powercut or something).

This stage focuses on durability, write ordering and recovery from a crash without needing to write an entire snapshot each time.

Stage 5 — Snapshot checkpointing

  • Snapshots are now treated as explicit checkpoints that end up being written atomically
  • A checkpoint metadata file records the WAL byte offset after the most recent snapshot
  • On restart, WAL replay begins after the last checkpoint offset, not from the start, meaning recovery time is proportional to how much the WALfile has changed since the last checkpoint, not the entire history
    • Corrupted WAL suffixes beyond the checkpoint are safely ignored
    • WAL replays are idempotent

This stage focuses on decreasing recovery time.

Stage 6 - Performance Testing

  • Measuring end-to-end latency using a monotonic clock
  • Recording put latency distributions (p50, p95, p99)
  • No changes to actual functionality

This stage focuses on measuring performance.

Performance results

Below you can see two tables regarding the performance of the program. The first table shows the latency of each key operation. The second shows the p1,50 and 99 of the operations.

Pytest Benchmarking

Name Min (ns) Max (ns) Mean (ns) StdDev (ns) Median (ns) IQR (ns) Outliers OPS (Kops/s) Rounds Iterations
test_get_latency 174.3707 (1.0) 50,026.2206 (1.0) 224.5733 (1.0) 126.0082 (1.0) 220.6652 (1.0) 18.5532 (1.0) 2130;3935 4,452.8888 195,122 27
test_put_latency 25,458.9831 (146.00) 1,790,000.0094 (35.78) 214,571.3537 (955.46) 387,432.6197 (>1000.0) 36,625.0169 (165.98) 10,499.9635 (565.94) 650;787 4.6605 3,910 1
test_delete_latency 29,625.0219 (169.90) 38,867,915.9805 (776.95) 257,314.1452 (>1000.0) 715,234.9094 (>1000.0) 36,999.9907 (167.67) 7,583.0030 (408.72) 1631;3975 3.8863 19,786 1

Legend:

  • Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  • OPS: Operations Per Second, computed as 1 / Mean

These tests can be found in tests/test_benchmarking.py

Latency Percentiles (nanoseconds)

Operation p1 p50 p99 Max Iterations
GET 211 ns 229 ns 260 ns 1,489 ns 198,373
PUT 28,208 ns 37,042 ns 1,042,404 ns 1,850,166 ns 2,336
DELETE 32,874 ns 37,833 ns 3,111,150 ns 42,740,790 ns 22,202

Total processing time: around 0.219 seconds

More information can be found in DESIGN.md.

Current Guarantees and limitations

Category Guaranteed at this stage? Notes
API semantics get returns None for missing keys; put overwrites; delete is idempotent
In-memory correctness In-memory state reflects the latest committed WAL entry
Persistence across clean restart Snapshot + WAL replay restores full state
Disk ↔ memory consistency Memory is reconstructed from snapshot + WAL, not snapshots alone
Valid on-disk format WAL is append-only; snapshots are atomically replaced
Crash safety Recovery proceeds up to the last valid WAL record
Atomicity under crash A change/mutation is either fully applied or not applied after restart
Concurrency safety Thread-safe, linearizable access within a single process
Linearizable access All operations appear totally ordered
Write ordering guarantees WAL preserves changes order
History preservation All acknowledged writes survive crashes
Partial-write tolerance Truncated or corrupted WAL suffixes are safely ignored
Reduced recovery time WAL replay starts after last checkpoint
Performance guarantees WAL fsync on every write; no batching
Transactions Single-key operations only
Asynchronous writes Writes are synchronous and blocking
Background flushing Snapshots are not generated in the background incrementally

Design Notes (Persistence & Storage)

  • Persistence is implemented via Write-ahead logging. Snapshots act as checkpoints, not as the primary source of truth
  • On startup:
    • Snapshot is loaded from disk
    • WAL is replayed from the snapshot offset to the end
    • In-memory state is reconstructed from the snapshot + WAL
  • The order of writing to the store is: callin put/delete -> walfile modified -> flush/fsync -> modify tempfile -> rename tempfile to mainfile -> update checkpoint if needed

Example

db = KVStore("db.json")
db.put("A", 0)
db.delete("A")
print(db.get("A"))  # None

Out of Scope

  • Fine-grained or lock-free concurrency
    • We've made a single owner, coarse grained locking system
  • Asynchronous writes
  • A way to truncate WAL so it doesn't go on forever
  • Transactions
  • Distributed systems
    • The system is intentionally single process on a single node so it's not distributed by definition
  • Replication or sharding
  • Query languages

Resources Used

About

A durable, concurrent key–value storage engine using write-ahead logging, atomic snapshots, and deterministic crash recovery.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages