Skip to content

codedb 0.2.5825

Latest

Choose a tag to compare

@justrach justrach released this 12 Jun 15:48
· 1 commit to release/0.2.5825 since this release

A retrieval-quality, capability, and speed cut. 0.2.5825 closes out a long audit cycle (133 commits since 0.2.5824) and ships a sustained latency pass driven by 2,467 real production query-log calls — the search hot path is ~4–8× faster, repeat searches return in microseconds, and the single biggest production-tail bug (whole-repo tier-3 scans after a snapshot restore) is gone: negative searches drop 9.2 ms → 0.5–0.9 ms.

⚡ How much faster?

Every number below is a real measurement from this cycle (commit messages carry the full methodology).

Search latency

Path Before After Change
searchContent hot path (#611) 65–107 µs/query 7.4–28.7 µs/query ~4–8×
Repeat search (result LRU hit, #613) 20.7 µs 2.0 µs ~10×
First MCP search call after startup (warmup, #613) 21.8 ms (21–40 ms variance) 6.3 ms (6.2–6.5 ms) ~3.5×, stable
Fall-through / negative search after snapshot restore (#615) 9.2 ms (whole-repo scan, recall_complete=false) 0.5–0.9 ms (recall_complete=true) ~10–18×
Symbol lookup with a complete index (#613) ~6 ms/call 50–100 ns ~60,000×
Zero-hit queries, 20k-file corpus, CODEDB_TRIGRAM_CAP uncapped (#615) 7.1 ms 1.4 ms ~5× (opt-in: +110 MB peak RSS, +300 ms index time)

Per-query micro-benchmarks (codedb repo, c_allocator, min-of-N, uncached — the benchmark pins CODEDB_NO_SEARCH_CACHE=1 so rows stay comparable across versions):

Query 0.2.5824 0.2.5825 Speedup
middleware 88 µs 10.2 µs 8.6×
database 65 µs 7.4 µs 8.8×
error 107 µs 19.6 µs 5.5×
authentication 50 µs (mid-cycle) 28.7 µs 1.7×
error (cache hit) 20.7 µs 2.0 µs 10×

How: line-offset cache instead of per-query line rescans, doc_id-grouped postings with a contiguous-run fast path (per-hit work drops to a doc_id compare), packed-u64-key sorts (no string compares or 40-byte struct moves inside the sort), rare-byte SIMD scan anchors (stop verifying authentication at every a), direct-address doc slots, symbol-length bitmasks that skip whole files, init-time path classification (was ~10 path tokenizations per path per rerank), memoized per-path rerank facts, and one outline fetch per candidate.

Memory & load path

Path Before After
Snapshot fast-load, openclaw 13,654 files (#564) 60 ms 40 ms (−33%)
Pass C heap during load (#564) +62.5 MB +20.5 MB (−67%)
One-shot search physical footprint (#564) 132.7 MB 89.2 MB (−33%)
Max RSS, one-shot search (#564) 244 MB 200 MB
codedb <dir> status (#553) full index materialized — a multi-GB resident process that never exited metadata-only (reported by @lekt9 🙏)
Background warmup steady-state cost (#613) ~70 ms one-time background CPU, +4.4 MB RSS (caches hard-capped at 4 MB each)

The production numbers that drove it

A 2,467-call production query log showed: search p90 30 ms with occasional 2-second outliers, codedb_find median 4.5 ms / p90 17.7 ms, and 62% of calls being exact repeats of an earlier (tool, query) pair. All three tails are addressed: the p90/outliers traced to the #615 scan-set bug plus the 50 ms–2 s word-index rebuild that used to land on an innocent first query (now pre-paid by the warmup thread), the codedb_find tail was the O(files × symbols) safety scan (now gated), and the repeats now hit microsecond caches.

🔥 The big one: tier-3 scan-set reconciliation (#615)

Snapshot restore parks every file in skip_trigram_files (it can't know what the disk trigram index covers), and two compounding failures meant the set never emptied on the standard serve/mcp/cli-daemon startup path:

  1. Nothing pruned the set when the disk trigram index was later mmap-loaded.
  2. The snapshot freshness pass reindexes changed files into the heap trigram before the disk-load gate runs — and that gate early-returned on any heap entry. One dirty file blocked the disk trigram load for the whole repo.

Net effect: tier 3 content-scanned the entire project on every fall-through query, with recall_complete=false. Measured live: 613/616 files in the scan set. After the fix: 0.

All trigram replacement now funnels through adoptTrigramIndex / adoptTrigramBase (swap, bump the search generation, prune the skip set), and the mmap load keeps freshness-reindexed files as a masking overlay so their newer content wins over stale base entries.

⚡ Result caches + background warmup (#613)

  • Whole-query result LRUs for searchContent, renderPlainSearch (MCP fast path), and the BM25 ranked path — 64 entries / 4 MB each, validated against both the search generation and a fingerprint of the nine ranking kill-switch env vars. CODEDB_NO_SEARCH_CACHE=1 disables.
  • Background warmup: serve/mcp/cli-daemon build + persist the word index off the query path and replay the most-repeated queries from queries.log — 62% of production calls are exact repeats of an earlier (tool, query) pair, so the caches are warm before your first real call. CODEDB_NO_WARMUP=1 disables; skipped under CODEDB_LOW_MEMORY.
  • Race fix: generation bumps moved inside the exclusive lock — a concurrent search can no longer cache pre-mutation results under the post-mutation generation.

🧠 Ranking: query-specific graph signals (#550, #546, #554)

  • Call-graph distance (#608) — files near the matched symbols in the resolved call graph get a query-specific boost (CODEDB_NO_GRAPH_DISTANCE opts out).
  • Git co-change (#609) — a bounded history pass (500 commits, ≤32-file commits, top-8 partners) boosts files that historically change together (CODEDB_NO_COCHANGE opts out).
  • Negative lexical file-frequency penalty (#554) — mention-everywhere terms stop dragging hub files up.
  • Multi-word CLI search is ranked end-to-end (#546) — incl. the first cold run; tooling paths (bench/scripts/website/install) rank below src implementation (#557), basename test files get the test penalty (#580), and mention-dense tooling files can't saturate past the path prior (#598).

🆕 Features

  • codedb_callpath — shortest resolved call chain between two symbols, each hop as path:name@line (#531).
  • PageRank graph centrality in ranked search (replaces in-degree; CODEDB_IN_DEGREE_CENTRALITY reverts) (#531).
  • codedb_context max_tokens — value-ordered section packing under a token budget, byte-identical output without the arg (#610).
  • Richer codedb_symbol — kind / prefix / glob / fuzzy filters, optional source body per hit.
  • format=json + paths_only + path_glob on search — structured output with provenance meta, ~50% fewer tokens for broad surveys.
  • codedb_changes in the CLI (#578), CODEDB_TRIGRAM_CAP for big-corpus operators (#615), CODEDB_ALLOW_TEMP for CI harnesses on temp checkouts (#538).

🛡️ Correctness & hardening

  • Search recall after a snapshot load (#537, #539): restored files are searchable again; call-graph edges into restored files are back (#537b).
  • Store hardening (#597, #603): no unlocked diff writes, data-log compaction, clean failure paths.
  • mmap overlay (#593, #600): overlay edits mask stale base entries; writeToDisk persists merged state.
  • Word index (#583, #585, #606): stale postings dropped on disk load; doc_id slots reused — bounded memory in long-lived daemons.
  • ContentCache (#584, #596): probe-window reachability + byte budget.
  • OOM-safe indexing (#594), per-project flock for cli-daemon spawn (#592), comment/string-aware call-site extraction (#562, #572).
  • Secret filtering (#589, #572): id_ecdsa / id_dsa / *_sk FIDO keys, *.env variants, .git-credentials blocked from indexing and search.
  • TS/JS dependency graph (#540#543, #548): multi-line + re-export imports, relative-path resolution, no bogus deps from strings.
  • A dozen CLI/tool UX fixes (#558, #560, #566, #568#570, #573, #576, #588) — every one landed with a failing test first.

🙏 Contributors

  • @nsxdavid — TS/JS dependency-graph fixes (#542, #543)
  • @lekt9 — reported the resident-status-process leak (#553), now metadata-only
  • @idea404 — PR #535 (local fallback when api.wiki.codes is unreachable), under review for the next cut

Full details in the CHANGELOG.

Install

curl -fsSL https://codedb.codegraff.com/install.sh | sh

or npx -y codedeebee mcp

Platform Asset Signed
macOS ARM64 (Apple Silicon) codedb-darwin-arm64 ✅ codesigned + notarized
macOS x86_64 (Intel) codedb-darwin-x86_64 temporarily unsigned (#504)
Linux ARM64 codedb-linux-arm64
Linux x86_64 codedb-linux-x86_64