find: faster + smarter fuzzy file ranking (prefilter + SIMD + symbol fast-path)#526
Conversation
…eval-identical) codedb_find ran the O(query*path) Smith-Waterman DP in fuzzyScore once per file — ~250ms for long, symbol-like queries that match no filename (e.g. "getTokenProvider" over openclaw's 13.6k files); real filename queries that hit the exact-match fast path are already ~2ms. Two retrieval-identical changes (verified recall@10/@30 = 1.00, exact-order, vs the release baseline over 16 representative queries): - Presence pre-filter: issue #518 made a >=60% in-order LCS a hard requirement; unordered char presence is an upper bound on that LCS, so a file missing >40% of the query's chars can be rejected in O(path) without the DP. Skips the DP for files the LCS floor would reject anyway — no result changes. - Tightened DP: hoist the per-row lowercased query char, precompute path lowercase + word-boundary flags once, swap the prev/curr rows by pointer instead of two @memcpy per row, and fold the matched-row scan into the main loop. Same arithmetic, ~1.1x. Factored the per-file multi-part scoring into scoreAllParts (shared, no behavior change). 699 tests green. Note: a cheap candidate-cap variant (greedy proxy -> DP top-N) was prototyped and rejected — it reached ~10x but dropped recall@10 to 0.85 (and 0.00 on "getTokenProvider"/"TokenProvider"), because the greedy proxy can't model the DP's mismatch tolerance. A retrieval-faithful big speedup needs SIMD-across-files (next commit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…val-identical) fuzzyFindFiles ran the Smith-Waterman DP one file at a time. Refactor the scalar path into fuzzyDP (the hot loop) + fuzzyFinalize (floor/LCS/bonus tail), then add fuzzyScoreBatch: the same DP vectorized so each SIMD lane scores a *different* file against the same query (inter-sequence parallelism, FZ_LANES=8). Shorter paths are length-masked per column so padding never affects best_score; the shared fuzzyFinalize tail keeps post-processing identical. Single-part queries (the common case) collect presence-prefilter survivors and score them in batches of 8; multi-part falls back to scalar. All DP values are sums of exactly-representable small integers, so the SIMD result is bit-identical to scalar — verified by a new test (fuzzyScoreBatch == fuzzyScore over a matrix of queries x paths, incl. a full 8-lane mixed-length batch) and end-to-end on openclaw (13.6k files, 16 queries): recall@10 = recall@30 = 1.00, exact result order preserved vs the release baseline. ReleaseFast, openclaw, the long symbol-like queries that miss the exact-filename fast path (the only slow ones; real filename queries are already <0.5ms): getTokenProvider 41 -> 23ms (1.80x), handleRequest 33 -> 19ms (1.78x), embedded 22 -> 9ms (2.39x), subscribe 25 -> 14ms (1.82x). 8 lanes beat 16 on NEON (register pressure). 699+1 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
…orrect) The slow codedb_find queries were symbol-like identifiers (getTokenProvider, handleRequest) typed into a *fuzzy filename* search — they match no filename, so they paid the full fuzzy scan AND returned useless filename guesses instead of the definition the caller wanted. Add a symbol fast-path in handleFind, after the exact-filename check and before the fuzzy scan: when the query is a compound identifier (camelCase/PascalCase or snake_case — never ALL-CAPS, dotted, slashed, spaced, or <4 chars, so real filename searches are untouched), resolve it against the symbol index and return the definition site(s); a non-symbol identifier falls through to the (now SIMD-accelerated) fuzzy scan. renderSymbolDefsFast uses findAllSymbols (the COMPLETE lookup incl. the outline safety scan), not symbol_index alone — symbol_index is incomplete for some languages (indexes ~no TypeScript functions; symbol_index-only scored 0/40 real openclaw symbols). Definition kinds rank above import/comment usages; usages show only when no real def is indexed. Returns false without writing on a miss. Measured (ReleaseFast, openclaw 13.6k files): 40/40 real symbols resolve via the fast-path at ~1.3ms (def location returned) vs ~29ms for the same queries on the fuzzy path — ~22x faster and more useful. Filename queries (auth, config) keep the <0.5ms exact/fuzzy path. New test: classifier truth table + renderSymbolDefsFast resolves a def and returns false (no output) on a miss. 700+ tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Added: symbol fast-path (3rd commit, 41f79d4)The slow
Uses the complete Measured (ReleaseFast, openclaw 13.6k files): 40/40 real symbols resolve via the fast-path at ~1.3 ms (returning the def location) vs ~29 ms on the fuzzy path — ~22× faster and more useful. Filename queries ( New test: classifier truth table + Net effect of this PR on
|
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
What makes it faster (and what's unchanged)Four mechanisms, in the order
Verification
|
…, CLI hardening, ReScript - release_info.semver 0.2.5823 -> 0.2.5824 (the version codedb reports and update.zig compares against; build.zig.zon was already 0.2.5824). - CHANGELOG: document the warm CLI daemon (#525), faster fuzzy find (#526), CLI hardening (#529), ReScript .res/.resi (#532), and audit fixes (#530) alongside the existing code-graph / snapshot-load sections. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
codedb_find(fuzzy filename search) ran a per-file Smith-Waterman DP over every file in the tree. On large repos the long, symbol-like queries that don't hit the exact-filename fast path were slow (openclaw, 13.6k files: ~250ms Debug / ~41ms ReleaseFast forgetTokenProvider). Real filename queries (auth,config,index) were already <0.5ms — only the degenerate "used find as a symbol lookup" case was slow.This branch speeds up that case with zero change to which files come back or their order, verified on both a unit matrix and end-to-end on openclaw (recall@10 = recall@30 = 1.00, exact result order preserved vs the release baseline across 16 representative queries).
Two commits
1. Sound presence prefilter + tightened scalar DP (
55df894)@memcpy), folded the matched-row scan into the main loop.2. SIMD-across-files Smith-Waterman (
0e954b6)fuzzyDP(hot loop) +fuzzyFinalize(floor/LCS/bonus tail).fuzzyScoreBatch: the same DP vectorized so each of 8 SIMD lanes scores a different file against the same query (inter-sequence parallelism). Shorter paths are length-masked per column so padding never affectsbest_score/matched_chars; the sharedfuzzyFinalizetail keeps post-processing identical.Measured (ReleaseFast, openclaw 13.6k files, interleaved vs release baseline)
8 lanes beat 16 on Apple-Silicon NEON (register pressure). The remaining gap from a theoretical 4× is the per-batch transpose plus the DP's serial per-column dependency.
Tests
New
test "fuzzy SIMD batch scorer matches scalar fuzzyScore exactly"comparesfuzzyScoreBatchtofuzzyScoreover a query×path matrix (incl. a full 8-lane mixed-length batch). Full suite green.Investigated and rejected
A cheap candidate cap (greedy proxy → DP top-N) reached ~10× but dropped recall@10 to 0.85 (0.00 on
getTokenProvider) — the greedy proxy can't model Smith-Waterman's mismatch tolerance. Dropped in favor of the retrieval-identical SIMD path.🤖 Generated with Claude Code