NoKV

package module
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 3, 2026 License: Apache-2.0 Imports: 27 Imported by: 0

README

🚀 NoKV – High-Performance Distributed KV Engine

NoKV Logo

CI Coverage Go Report Card Go Reference Mentioned in Awesome DBDB.io

Go Version License DeepWiki

LSM Tree • ValueLog • MVCC • Multi-Raft Regions • Redis-Compatible

NoKV is a Go-native storage engine that mixes RocksDB-style manifest discipline with Badger-inspired value separation. You can embed it locally, drive it via multi-Raft regions, or front it with a Redis protocol gateway—all from a single topology file.


✨ Feature Highlights

  • 🚀 Dual runtime modes – call NoKV.Open inside your process or launch nokv serve for a distributed deployment, no code changes required.
  • 🔁 Hybrid LSM + ValueLog – WAL → MemTable → SST pipeline for latency, with a ValueLog to keep large payloads off the hot path.
  • MVCC + Percolator transaction path – distributed 2PC flows use MVCC versioned keys with snapshot-style reads and lock-based commits.
  • 🧠 Multi-Raft regionsraftstore manages per-region raft groups, WAL/manifest pointers, and tick-driven leader elections.
  • 🛰️ Redis gatewaycmd/nokv-redis exposes RESP commands (SET/GET/MGET/NX/XX/TTL/INCR...) on top of raft-backed storage.
  • 🧪 Pebble-inspired VFS – a unified vfs layer with deterministic fault injection (FaultFS) for sync/close/truncate rollback testing.
  • 🔍 Observability firstnokv stats, expvar endpoints, hot key tracking, RECOVERY/TRANSPORT metrics, and ready-to-use recovery scripts.
  • 🧰 Single-source configraft_config.json feeds local scripts, Docker Compose, Redis gateway, and CI so there’s zero drift.

🚦 Quick Start

Start an end-to-end playground with either the local script or Docker Compose. Both spin up a three-node Raft cluster with a PD-lite service and expose the Redis-compatible gateway.

# Option A: local processes
./scripts/run_local_cluster.sh --config ./raft_config.example.json
# In another shell: launch the Redis gateway on top of the running cluster
go run ./cmd/nokv-redis --addr 127.0.0.1:6380 --raft-config raft_config.example.json

# Option B: Docker Compose (cluster + gateway + PD)
docker compose up --build
# Tear down
docker compose down -v

Once the cluster is running you can point any Redis client at 127.0.0.1:6380 (or the address exposed by Compose).

For quick CLI checks:

# Inspect stats from an existing workdir
go run ./cmd/nokv stats --workdir ./artifacts/cluster/store-1

Minimal embedded snippet:

package main

import (
	"fmt"
	"log"

	NoKV "github.com/feichai0017/NoKV"
)

func main() {
	opt := NoKV.NewDefaultOptions()
	opt.WorkDir = "./workdir-demo"

	db := NoKV.Open(opt)
	defer db.Close()

	key := []byte("hello")
	if err := db.Set(key, []byte("world")); err != nil {
		log.Fatalf("set failed: %v", err)
	}

	entry, err := db.Get(key)
	if err != nil {
		log.Fatalf("get failed: %v", err)
	}
	fmt.Printf("value=%s\n", entry.Value)
}

Note: Public read APIs (DB.Get, DB.GetCF, DB.GetVersionedEntry) return detached entries. Do not call DecrRef on them.

ℹ️ run_local_cluster.sh rebuilds nokv and nokv-config, seeds manifests via nokv-config manifest, starts PD-lite (nokv pd), and parks logs under artifacts/cluster/store-<id>/server.log. Use Ctrl+C to exit cleanly; if the process crashes, wipe the workdir (rm -rf ./artifacts/cluster) before restarting to avoid WAL replay errors.


🧭 Topology & Configuration

Everything hangs off a single file: raft_config.example.json.

"pd": { "addr": "127.0.0.1:2379", "docker_addr": "nokv-pd:2379" },
"stores": [
  { "store_id": 1, "listen_addr": "127.0.0.1:20170", ... },
  { "store_id": 2, "listen_addr": "127.0.0.1:20171", ... },
  { "store_id": 3, "listen_addr": "127.0.0.1:20172", ... }
],
"regions": [
  { "id": 1, "range": [-inf,"m"), peers: 101/201/301, leader: store 1 },
  { "id": 2, "range": ["m",+inf), peers: 102/202/302, leader: store 2 }
]
  • Local scripts (run_local_cluster.sh, serve_from_config.sh, bootstrap_from_config.sh) ingest the same JSON, so local runs match production layouts.
  • Docker Compose mounts the file into each container; manifests, transports, and Redis gateway all stay in sync.
  • Need more stores or regions? Update the JSON and re-run the script/Compose—no code changes required.
  • Programmatic access: import github.com/feichai0017/NoKV/config and call config.LoadFile / Validate for a single source of truth across tools.
🧬 Tech Stack Snapshot
Layer Tech/Package Why it matters
Storage Core lsm/, wal/, vlog/ Hybrid log-structured design with manifest-backed durability and value separation.
Concurrency percolator/, raftstore/client Distributed 2PC, lock management, and MVCC version semantics in raft mode.
Replication raftstore/* + pd/* Multi-Raft data plane plus PD-backed control plane (routing, TSO, heartbeats).
Tooling cmd/nokv, cmd/nokv-config, cmd/nokv-redis CLI, config helper, Redis-compatible gateway share the same topology file.
Observability stats, hotring, expvar Built-in metrics, hot-key analytics, and crash recovery traces.

🧱 Architecture Overview

graph TD
    Client[Client API] -->|Set/Get| DBCore
    DBCore -->|Append| WAL
    DBCore -->|Insert| MemTable
    DBCore -->|ValuePtr| ValueLog
    MemTable -->|Flush Task| FlushMgr
    FlushMgr -->|Build SST| SSTBuilder
    SSTBuilder -->|LogEdit| Manifest
    Manifest -->|Version| LSMLevels
    LSMLevels -->|Compaction| Compactor
    FlushMgr -->|Discard Stats| ValueLog
    ValueLog -->|GC updates| Manifest
    DBCore -->|Stats/HotKeys| Observability

Key ideas:

  • Durability path – WAL first, memtable second. ValueLog writes occur before WAL append so crash replay can fully rebuild state.
  • Metadata – manifest stores SST topology, WAL checkpoints, and vlog head/deletion metadata.
  • Background workers – flush manager handles Prepare → Build → Install → Release, compaction reduces level overlap, and value log GC rewrites segments based on discard stats.
  • Distributed transactions – Percolator 2PC runs in raft mode; embedded mode exposes non-transactional DB APIs.

Dive deeper in docs/architecture.md.


🧩 Module Breakdown

Module Responsibilities Source Docs
WAL Append-only segments with CRC, rotation, replay (wal.Manager). wal/ WAL internals
LSM MemTable, flush pipeline, leveled compactions, iterator merging. lsm/ Memtable
Flush pipeline
Cache
Manifest VersionEdit log + CURRENT handling, WAL/vlog checkpoints, Region metadata. manifest/ Manifest semantics
ValueLog Large value storage, GC, discard stats integration. vlog.go, vlog/ Value log design
Percolator Distributed MVCC 2PC primitives (prewrite/commit/rollback/resolve/status). percolator/ Percolator transactions
RaftStore Multi-Raft Region management, hooks, metrics, transport. raftstore/ RaftStore overview
HotRing Hot key tracking, throttling helpers. hotring/ HotRing overview
Observability Periodic stats, hot key tracking, CLI integration. stats.go, cmd/nokv Stats & observability
CLI reference
Filesystem Pebble-inspired vfs abstraction + mmap-backed file helpers shared by SST/vlog, WAL, and manifest. vfs/, file/ VFS
File abstractions

Each module has a dedicated document under docs/ describing APIs, diagrams, and recovery notes.


📡 Observability & CLI

  • Stats.StartStats publishes metrics via expvar (flush backlog, WAL segments, value log GC stats, raft/region/cache/hot metrics).
  • cmd/nokv gives you:
    • nokv stats --workdir <dir> [--json] [--no-region-metrics]
    • nokv manifest --workdir <dir>
    • nokv regions --workdir <dir> [--json]
    • nokv vlog --workdir <dir>
  • hotring continuously surfaces hot keys in stats + CLI so you can pre-warm caches or debug skewed workloads.

More in docs/cli.md and docs/testing.md.


🔌 Redis Gateway

  • cmd/nokv-redis exposes a RESP-compatible endpoint. In embedded mode (--workdir) commands execute through regular DB APIs; in distributed mode (--raft-config) calls are routed through raftstore/client and committed with TwoPhaseCommit.
  • TTL metadata is stored under !redis:ttl!<key> and is automatically cleaned up when reads detect expiration.
  • --metrics-addr exposes Redis gateway metrics under NoKV.Stats.redis via expvar. In raft mode, --pd-addr can override config.pd when you need a non-default PD endpoint.
  • A ready-to-use cluster configuration is available at raft_config.example.json, matching both scripts/run_local_cluster.sh and the Docker Compose setup.

For the complete command matrix, configuration and deployment guides, see docs/nokv-redis.md.


📄 License

Apache-2.0. See LICENSE.

Documentation

Overview

Package NoKV provides the embedded database API and engine wiring.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CacheStatsSnapshot added in v0.6.0

type CacheStatsSnapshot struct {
	BlockL0HitRate float64 `json:"block_l0_hit_rate"`
	BlockL1HitRate float64 `json:"block_l1_hit_rate"`
	BloomHitRate   float64 `json:"bloom_hit_rate"`
	IndexHitRate   float64 `json:"index_hit_rate"`
	IteratorReused uint64  `json:"iterator_reused"`
}

CacheStatsSnapshot captures block/index/bloom hit-rate indicators.

type ColumnFamilySnapshot

type ColumnFamilySnapshot struct {
	Writes uint64 `json:"writes"`
	Reads  uint64 `json:"reads"`
}

ColumnFamilySnapshot aggregates read/write counters for a single column family.

type CompactionStatsSnapshot added in v0.6.0

type CompactionStatsSnapshot struct {
	Backlog              int64   `json:"backlog"`
	MaxScore             float64 `json:"max_score"`
	LastDurationMs       float64 `json:"last_duration_ms"`
	MaxDurationMs        float64 `json:"max_duration_ms"`
	Runs                 uint64  `json:"runs"`
	IngestRuns           int64   `json:"ingest_runs"`
	MergeRuns            int64   `json:"ingest_merge_runs"`
	IngestMs             float64 `json:"ingest_ms"`
	MergeMs              float64 `json:"ingest_merge_ms"`
	IngestTables         int64   `json:"ingest_tables"`
	MergeTables          int64   `json:"ingest_merge_tables"`
	ValueWeight          float64 `json:"value_weight"`
	ValueWeightSuggested float64 `json:"value_weight_suggested,omitempty"`
}

CompactionStatsSnapshot summarizes compaction backlog, runtime, and ingest behavior.

type CoreAPI

type CoreAPI interface {
	Set(key, value []byte) error
	SetWithTTL(key, value []byte, expiresAt uint64) error
	Get(key []byte) (*kv.Entry, error)
	Del(key []byte) error
	SetCF(cf kv.ColumnFamily, key, value []byte) error
	GetCF(cf kv.ColumnFamily, key []byte) (*kv.Entry, error)
	DelCF(cf kv.ColumnFamily, key []byte) error
	NewIterator(opt *utils.Options) utils.Iterator
	Info() *Stats
	Close() error
}

CoreAPI describes the externally exposed NoKV operations.

type DB

type DB struct {
	sync.RWMutex
	// contains filtered or unexported fields
}

DB is the global handle for the engine and owns shared resources.

func Open

func Open(opt *Options) *DB

Open DB

func (*DB) Close

func (db *DB) Close() error

Close stops background workers and flushes in-memory state before releasing all resources.

func (*DB) Del

func (db *DB) Del(key []byte) error

Del removes a key from the default column family by writing a tombstone.

func (*DB) DelCF

func (db *DB) DelCF(cf kv.ColumnFamily, key []byte) error

DelCF deletes a key from the specified column family.

func (*DB) DeleteVersionedEntry added in v0.2.0

func (db *DB) DeleteVersionedEntry(cf kv.ColumnFamily, key []byte, version uint64) error

DeleteVersionedEntry marks the specified version as deleted by writing a tombstone record.

func (*DB) Get

func (db *DB) Get(key []byte) (*kv.Entry, error)

Get reads the latest visible value for key from the default column family.

func (*DB) GetCF

func (db *DB) GetCF(cf kv.ColumnFamily, key []byte) (*kv.Entry, error)

GetCF reads a key from the specified column family. The returned entry is detached from internal pools. Callers must not call DecrRef.

func (*DB) GetVersionedEntry added in v0.2.0

func (db *DB) GetVersionedEntry(cf kv.ColumnFamily, key []byte, version uint64) (*kv.Entry, error)

GetVersionedEntry retrieves the value stored at the provided MVCC version. The returned entry is detached from internal pools. Callers must not call DecrRef.

func (*DB) Info

func (db *DB) Info() *Stats

Info returns the live stats collector for snapshot/diagnostic access.

func (*DB) IsClosed

func (db *DB) IsClosed() bool

IsClosed reports whether Close has finished and the DB no longer accepts work.

func (*DB) Manifest

func (db *DB) Manifest() *manifest.Manager

Manifest exposes the manifest manager for coordination components.

func (*DB) NewInternalIterator added in v0.5.0

func (db *DB) NewInternalIterator(opt *utils.Options) utils.Iterator

NewInternalIterator returns an iterator over internal keys (CF marker + user key + timestamp). Callers must interpret kv.Entry.Key using kv.SplitInternalKey.

func (*DB) NewIterator

func (db *DB) NewIterator(opt *utils.Options) utils.Iterator

NewIterator creates a DB-level iterator over user keys in the default column family.

func (*DB) RunValueLogGC

func (db *DB) RunValueLogGC(discardRatio float64) error

RunValueLogGC triggers a value log garbage collection.

func (*DB) Set

func (db *DB) Set(key, value []byte) error

Set writes a key/value pair into the default column family.

func (*DB) SetCF

func (db *DB) SetCF(cf kv.ColumnFamily, key, value []byte) error

SetCF writes a key/value pair into the specified column family.

func (*DB) SetRegionMetrics

func (db *DB) SetRegionMetrics(rm *metrics.RegionMetrics)

SetRegionMetrics attaches region metrics recorder so Stats snapshot and expvar include region state counts.

func (*DB) SetVersionedEntry added in v0.2.0

func (db *DB) SetVersionedEntry(cf kv.ColumnFamily, key []byte, version uint64, value []byte, meta byte) error

SetVersionedEntry writes a value to the specified column family using the provided version. It mirrors SetCF but allows callers to control the MVCC timestamp embedded in the internal key.

func (*DB) SetWithTTL added in v0.7.0

func (db *DB) SetWithTTL(key, value []byte, expiresAt uint64) error

SetWithTTL writes a key/value pair into the default column family with an explicit expiry timestamp.

func (*DB) WAL

func (db *DB) WAL() *wal.Manager

WAL exposes the underlying WAL manager.

type DBIterator

type DBIterator struct {
	// contains filtered or unexported fields
}

DBIterator wraps the merged LSM iterators and optionally resolves value-log pointers.

func (*DBIterator) Close

func (iter *DBIterator) Close() error

Close releases underlying iterators and returns pooled iterator context.

func (*DBIterator) Item

func (iter *DBIterator) Item() utils.Item

Item returns the currently materialized item, or nil when iterator is invalid.

func (*DBIterator) Next

func (iter *DBIterator) Next()

Next advances to the next visible key/value pair.

func (*DBIterator) Rewind

func (iter *DBIterator) Rewind()

Rewind positions the iterator at the first or last key based on scan direction.

func (*DBIterator) Seek

func (iter *DBIterator) Seek(key []byte)

Seek positions the iterator at the first key >= key in default column family order.

func (*DBIterator) Valid

func (iter *DBIterator) Valid() bool

Valid reports whether the iterator currently points at a valid item.

type FlushStatsSnapshot added in v0.6.0

type FlushStatsSnapshot struct {
	Pending       int64   `json:"pending"`
	QueueLength   int64   `json:"queue_length"`
	Active        int64   `json:"active"`
	WaitMs        float64 `json:"wait_ms"`
	LastWaitMs    float64 `json:"last_wait_ms"`
	MaxWaitMs     float64 `json:"max_wait_ms"`
	BuildMs       float64 `json:"build_ms"`
	LastBuildMs   float64 `json:"last_build_ms"`
	MaxBuildMs    float64 `json:"max_build_ms"`
	ReleaseMs     float64 `json:"release_ms"`
	LastReleaseMs float64 `json:"last_release_ms"`
	MaxReleaseMs  float64 `json:"max_release_ms"`
	Completed     int64   `json:"completed"`
}

FlushStatsSnapshot summarizes flush queue depth and stage timing.

type HotKeyStat

type HotKeyStat struct {
	Key   string `json:"key"`
	Count int32  `json:"count"`
}

HotKeyStat represents one hot key and its observed touch count.

type HotStatsSnapshot added in v0.6.0

type HotStatsSnapshot struct {
	ReadKeys  []HotKeyStat   `json:"read_keys,omitempty"`
	ReadRing  *hotring.Stats `json:"read_ring,omitempty"`
	WriteKeys []HotKeyStat   `json:"write_keys,omitempty"`
	WriteRing *hotring.Stats `json:"write_ring,omitempty"`
}

HotStatsSnapshot contains top read/write keys and optional ring internals.

type Item

type Item struct {
	// contains filtered or unexported fields
}

Item is the user-facing iterator item backed by an entry and optional vlog reader.

func (*Item) Entry

func (it *Item) Entry() *kv.Entry

Entry returns the current entry view for this iterator item.

func (*Item) ValueCopy

func (it *Item) ValueCopy(dst []byte) ([]byte, error)

ValueCopy returns a copy of the current value into dst (if provided). Mirrors Badger's semantics to aid callers expecting defensive copies.

type LSMLevelStats added in v0.4.0

type LSMLevelStats struct {
	Level              int     `json:"level"`
	TableCount         int     `json:"tables"`
	SizeBytes          int64   `json:"size_bytes"`
	ValueBytes         int64   `json:"value_bytes"`
	StaleBytes         int64   `json:"stale_bytes"`
	IngestTables       int     `json:"ingest_tables"`
	IngestSizeBytes    int64   `json:"ingest_size_bytes"`
	IngestValueBytes   int64   `json:"ingest_value_bytes"`
	ValueDensity       float64 `json:"value_density"`
	IngestValueDensity float64 `json:"ingest_value_density"`
	IngestRuns         int64   `json:"ingest_runs"`
	IngestMs           float64 `json:"ingest_ms"`
	IngestTablesCount  int64   `json:"ingest_tables_compacted"`
	MergeRuns          int64   `json:"ingest_merge_runs"`
	MergeMs            float64 `json:"ingest_merge_ms"`
	MergeTables        int64   `json:"ingest_merge_tables"`
}

LSMLevelStats captures aggregated metrics per LSM level.

type LSMStatsSnapshot added in v0.6.0

type LSMStatsSnapshot struct {
	Levels            []LSMLevelStats                 `json:"levels,omitempty"`
	ValueBytesTotal   int64                           `json:"value_bytes_total"`
	ValueDensityMax   float64                         `json:"value_density_max"`
	ValueDensityAlert bool                            `json:"value_density_alert"`
	ColumnFamilies    map[string]ColumnFamilySnapshot `json:"column_families,omitempty"`
}

LSMStatsSnapshot summarizes per-level storage shape and value-density signals.

type MemTableEngine added in v0.4.2

type MemTableEngine string

MemTableEngine selects the in-memory index implementation used by memtables.

const (
	MemTableEngineSkiplist MemTableEngine = "skiplist"
	MemTableEngineART      MemTableEngine = "art"
)

type Options

type Options struct {
	// FS provides the filesystem implementation used by DB runtime components.
	// Nil defaults to vfs.OSFS.
	FS vfs.FS

	ValueThreshold     int64
	WorkDir            string
	MemTableSize       int64
	MemTableEngine     MemTableEngine
	SSTableMaxSz       int64
	MaxBatchCount      int64
	MaxBatchSize       int64 // max batch size in bytes
	ValueLogFileSize   int
	ValueLogMaxEntries uint32
	// ValueLogBucketCount controls how many hash buckets the value log uses.
	// Values <= 1 disable bucketization.
	ValueLogBucketCount int
	// ValueLogHotBucketCount reserves this many buckets for hot keys when
	// HotRing-based routing is enabled. Values <= 0 disable hot/cold splitting.
	ValueLogHotBucketCount int
	// ValueLogHotKeyThreshold marks a key as hot once its HotRing counter reaches
	// this value. Values <= 0 disable HotRing-based routing.
	ValueLogHotKeyThreshold int32

	// ValueLogGCInterval specifies how frequently to trigger a check for value
	// log garbage collection. Zero or negative values disable automatic GC.
	ValueLogGCInterval time.Duration
	// ValueLogGCDiscardRatio is the discard ratio for a value log file to be
	// considered for garbage collection. It must be in the range (0.0, 1.0).
	ValueLogGCDiscardRatio float64
	// ValueLogGCParallelism controls how many value-log GC tasks can run in
	// parallel. Values <= 0 auto-tune based on compaction workers.
	ValueLogGCParallelism int
	// ValueLogGCReduceScore lowers GC parallelism when compaction max score meets
	// or exceeds this threshold. Values <= 0 use defaults.
	ValueLogGCReduceScore float64
	// ValueLogGCSkipScore skips GC when compaction max score meets or exceeds this
	// threshold. Values <= 0 use defaults.
	ValueLogGCSkipScore float64
	// ValueLogGCReduceBacklog lowers GC parallelism when compaction backlog meets
	// or exceeds this threshold. Values <= 0 use defaults.
	ValueLogGCReduceBacklog int
	// ValueLogGCSkipBacklog skips GC when compaction backlog meets or exceeds this
	// threshold. Values <= 0 use defaults.
	ValueLogGCSkipBacklog int

	// Value log GC sampling parameters. Ratios <= 0 fall back to defaults.
	ValueLogGCSampleSizeRatio  float64
	ValueLogGCSampleCountRatio float64
	ValueLogGCSampleFromHead   bool

	// ValueLogVerbose enables verbose logging across value-log operations.
	ValueLogVerbose bool

	WriteBatchMaxCount int
	WriteBatchMaxSize  int64

	DetectConflicts bool
	HotRingEnabled  bool
	HotRingBits     uint8
	HotRingTopK     int
	// HotRingRotationInterval enables dual-ring rotation for hotness tracking.
	// Zero disables rotation.
	HotRingRotationInterval time.Duration
	// HotRingNodeCap caps the number of tracked keys per ring. Zero disables the cap.
	HotRingNodeCap uint64
	// HotRingNodeSampleBits controls stable sampling once the cap is reached.
	// A value of 0 enforces a strict cap; larger values sample 1/2^N keys.
	HotRingNodeSampleBits uint8
	// HotRingDecayInterval controls how often HotRing halves its global counters.
	// Zero disables periodic decay.
	HotRingDecayInterval time.Duration
	// HotRingDecayShift determines how aggressively counters decay (count >>= shift).
	HotRingDecayShift uint32
	// HotRingWindowSlots controls the number of sliding-window buckets tracked per key.
	// Zero disables the sliding window.
	HotRingWindowSlots int
	// HotRingWindowSlotDuration sets the duration of each sliding-window bucket.
	HotRingWindowSlotDuration time.Duration
	// ValueLogHotRingOverride uses the dedicated ValueLogHotRing* settings instead
	// of the global HotRing configuration when routing hot value-log keys.
	ValueLogHotRingOverride bool
	// ValueLogHotRingBits controls the hash bucket count for the value-log ring.
	// Zero uses the default HotRing bucket count.
	ValueLogHotRingBits uint8
	// ValueLogHotRingRotationInterval enables rotation for the value-log ring.
	// Zero disables rotation.
	ValueLogHotRingRotationInterval time.Duration
	// ValueLogHotRingNodeCap caps the number of tracked keys per value-log ring.
	ValueLogHotRingNodeCap uint64
	// ValueLogHotRingNodeSampleBits controls stable sampling for value-log keys.
	// A value of 0 enforces a strict cap; larger values sample 1/2^N keys.
	ValueLogHotRingNodeSampleBits uint8
	// ValueLogHotRingDecayInterval controls how often the value-log ring decays counters.
	ValueLogHotRingDecayInterval time.Duration
	// ValueLogHotRingDecayShift determines decay aggressiveness for the value-log ring.
	ValueLogHotRingDecayShift uint32
	// ValueLogHotRingWindowSlots controls the number of sliding-window buckets for the value-log ring.
	ValueLogHotRingWindowSlots int
	// ValueLogHotRingWindowSlotDuration sets the duration of each value-log window bucket.
	ValueLogHotRingWindowSlotDuration time.Duration

	SyncWrites   bool
	ManifestSync bool
	// ManifestRewriteThreshold triggers a manifest rewrite when the active
	// MANIFEST file grows beyond this size (bytes). Values <= 0 disable rewrites.
	ManifestRewriteThreshold int64
	// WriteHotKeyLimit caps how many consecutive writes a single key can issue
	// before the DB returns utils.ErrHotKeyWriteThrottle. Zero disables write-path
	// throttling.
	WriteHotKeyLimit int32
	// HotWriteBurstThreshold marks a key as "hot" for batching when its write
	// frequency exceeds this count; zero disables hot write batching.
	HotWriteBurstThreshold int32
	// HotWriteBatchMultiplier scales write batch limits when a hot key is
	// detected, allowing short-term coalescing of repeated writes.
	HotWriteBatchMultiplier int
	// WriteBatchWait adds an optional coalescing delay when the commit queue is
	// momentarily empty, letting small bursts share one WAL fsync/apply pass.
	// Zero disables the delay.
	WriteBatchWait time.Duration

	// Block cache configuration for read path optimization. Cached blocks
	// target L0/L1; colder data relies on the OS page cache.
	BlockCacheSize int
	BloomCacheSize int

	// RaftLagWarnSegments determines how many WAL segments a follower can lag
	// behind the active segment before stats surfaces a warning. Zero disables
	// the alert.
	RaftLagWarnSegments int64

	// EnableWALWatchdog enables the background WAL backlog watchdog which
	// surfaces typed-record warnings and optionally runs automated segment GC.
	EnableWALWatchdog bool
	// WALAutoGCInterval controls how frequently the watchdog evaluates WAL
	// backlog for automated garbage collection.
	WALAutoGCInterval time.Duration
	// WALAutoGCMinRemovable is the minimum number of removable WAL segments
	// required before an automated GC pass will run.
	WALAutoGCMinRemovable int
	// WALAutoGCMaxBatch bounds how many WAL segments are removed during a single
	// automated GC pass.
	WALAutoGCMaxBatch int
	// WALTypedRecordWarnRatio triggers a typed-record warning when raft records
	// constitute at least this fraction of WAL writes. Zero disables ratio-based
	// warnings.
	WALTypedRecordWarnRatio float64
	// WALTypedRecordWarnSegments triggers a typed-record warning when the number
	// of WAL segments containing raft records exceeds this threshold. Zero
	// disables segment-count warnings.
	WALTypedRecordWarnSegments int64

	// DiscardStatsFlushThreshold controls how many discard-stat updates must be
	// accumulated before they are flushed back into the LSM. Zero keeps the
	// default threshold.
	DiscardStatsFlushThreshold int

	// NumCompactors controls how many background compaction workers are spawned.
	// Zero uses an auto value derived from the host CPU count.
	NumCompactors int
	// NumLevelZeroTables controls when write throttling kicks in and feeds into
	// the compaction priority calculation. Zero falls back to the legacy default.
	NumLevelZeroTables int
	// IngestCompactBatchSize decides how many L0 tables to promote into the
	// ingest buffer per compaction cycle. Zero falls back to the legacy default.
	IngestCompactBatchSize int
	// IngestBacklogMergeScore triggers an ingest-merge task when the ingest
	// backlog score exceeds this threshold. Zero keeps the default (2.0).
	IngestBacklogMergeScore float64

	// CompactionValueWeight adjusts how aggressively the scheduler prioritises
	// levels whose entries reference large value log payloads. Higher values
	// make the compaction picker favour levels with high ValuePtr density.
	CompactionValueWeight float64

	// CompactionValueAlertThreshold triggers stats alerts when a level's
	// value-density (value bytes / total bytes) exceeds this ratio.
	CompactionValueAlertThreshold float64

	// IngestShardParallelism caps how many ingest shards can be compacted in a
	// single ingest-only pass. A value <= 0 falls back to 1 (sequential).
	IngestShardParallelism int
}

Options holds the top-level database configuration.

func NewDefaultOptions

func NewDefaultOptions() *Options

NewDefaultOptions returns the default option set.

type RaftStatsSnapshot added in v0.6.0

type RaftStatsSnapshot struct {
	GroupCount       int    `json:"group_count"`
	LaggingGroups    int    `json:"lagging_groups"`
	MinLogSegment    uint32 `json:"min_log_segment"`
	MaxLogSegment    uint32 `json:"max_log_segment"`
	MaxLagSegments   int64  `json:"max_lag_segments"`
	LagWarnThreshold int64  `json:"lag_warn_threshold"`
	LagWarning       bool   `json:"lag_warning"`
}

RaftStatsSnapshot summarizes raft log lag across tracked groups.

type RegionStatsSnapshot added in v0.6.0

type RegionStatsSnapshot struct {
	Total     int64 `json:"total"`
	New       int64 `json:"new"`
	Running   int64 `json:"running"`
	Removing  int64 `json:"removing"`
	Tombstone int64 `json:"tombstone"`
	Other     int64 `json:"other"`
}

RegionStatsSnapshot reports region counts grouped by region state.

type Stats

type Stats struct {
	// contains filtered or unexported fields
}

Stats owns periodic runtime metric collection and snapshot publication.

func (*Stats) SetRegionMetrics

func (s *Stats) SetRegionMetrics(rm *metrics.RegionMetrics)

SetRegionMetrics attaches region metrics recorder used in snapshots.

func (*Stats) Snapshot

func (s *Stats) Snapshot() StatsSnapshot

Snapshot returns a point-in-time metrics snapshot without mutating state.

func (*Stats) StartStats

func (s *Stats) StartStats()

StartStats runs periodic collection of internal backlog metrics.

type StatsSnapshot

type StatsSnapshot struct {
	Entries    int64                             `json:"entries"`
	Flush      FlushStatsSnapshot                `json:"flush"`
	Compaction CompactionStatsSnapshot           `json:"compaction"`
	ValueLog   ValueLogStatsSnapshot             `json:"value_log"`
	WAL        WALStatsSnapshot                  `json:"wal"`
	Raft       RaftStatsSnapshot                 `json:"raft"`
	Write      WriteStatsSnapshot                `json:"write"`
	Region     RegionStatsSnapshot               `json:"region"`
	Hot        HotStatsSnapshot                  `json:"hot"`
	Cache      CacheStatsSnapshot                `json:"cache"`
	LSM        LSMStatsSnapshot                  `json:"lsm"`
	Transport  transportpkg.GRPCTransportMetrics `json:"transport"`
	Redis      metrics.RedisSnapshot             `json:"redis"`
}

StatsSnapshot captures a point-in-time view of internal backlog metrics.

type ValueLogStatsSnapshot added in v0.6.0

type ValueLogStatsSnapshot struct {
	Segments       int                        `json:"segments"`
	PendingDeletes int                        `json:"pending_deletes"`
	DiscardQueue   int                        `json:"discard_queue"`
	Heads          map[uint32]kv.ValuePtr     `json:"heads,omitempty"`
	GC             metrics.ValueLogGCSnapshot `json:"gc"`
}

ValueLogStatsSnapshot reports value-log segment status and GC counters.

type WALStatsSnapshot added in v0.6.0

type WALStatsSnapshot struct {
	ActiveSegment           int64             `json:"active_segment"`
	SegmentCount            int64             `json:"segment_count"`
	ActiveSize              int64             `json:"active_size"`
	SegmentsRemoved         uint64            `json:"segments_removed"`
	RecordCounts            wal.RecordMetrics `json:"record_counts"`
	SegmentsWithRaftRecords int               `json:"segments_with_raft_records"`
	RemovableRaftSegments   int               `json:"removable_raft_segments"`
	TypedRecordRatio        float64           `json:"typed_record_ratio"`
	TypedRecordWarning      bool              `json:"typed_record_warning"`
	TypedRecordReason       string            `json:"typed_record_reason,omitempty"`
	AutoGCRuns              uint64            `json:"auto_gc_runs"`
	AutoGCRemoved           uint64            `json:"auto_gc_removed"`
	AutoGCLastUnix          int64             `json:"auto_gc_last_unix"`
}

WALStatsSnapshot captures WAL head position, record mix, and watchdog status.

type WriteStatsSnapshot added in v0.6.0

type WriteStatsSnapshot struct {
	QueueDepth       int64   `json:"queue_depth"`
	QueueEntries     int64   `json:"queue_entries"`
	QueueBytes       int64   `json:"queue_bytes"`
	AvgBatchEntries  float64 `json:"avg_batch_entries"`
	AvgBatchBytes    float64 `json:"avg_batch_bytes"`
	AvgRequestWaitMs float64 `json:"avg_request_wait_ms"`
	AvgValueLogMs    float64 `json:"avg_vlog_ms"`
	AvgApplyMs       float64 `json:"avg_apply_ms"`
	BatchesTotal     int64   `json:"batches_total"`
	ThrottleActive   bool    `json:"throttle_active"`
	HotKeyLimited    uint64  `json:"hot_key_limited"`
}

WriteStatsSnapshot tracks write-path queue pressure, latency, and throttling.

Directories

Path Synopsis
cmd
nokv command
nokv-config command
nokv-redis command
Package file provides low-level file and mmap primitives shared by WAL, vlog, and SST layers.
Package file provides low-level file and mmap primitives shared by WAL, vlog, and SST layers.
lsm
Package manifest persists SST, WAL checkpoint, vlog, and raft metadata.
Package manifest persists SST, WAL checkpoint, vlog, and raft metadata.
pd
tso
kv
Package vfs provides a tiny filesystem abstraction and fault-injection wrapper.
Package vfs provides a tiny filesystem abstraction and fault-injection wrapper.
Package vlog implements the value-log segment manager and IO helpers.
Package vlog implements the value-log segment manager and IO helpers.
Package wal implements the write-ahead log manager and replay logic.
Package wal implements the write-ahead log manager and replay logic.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL