Skip to content

find: faster + smarter fuzzy file ranking (prefilter + SIMD + symbol fast-path)#526

Merged
justrach merged 3 commits into
release/0.2.5824from
fix/codedb-find-fuzzy-perf
Jun 3, 2026
Merged

find: faster + smarter fuzzy file ranking (prefilter + SIMD + symbol fast-path)#526
justrach merged 3 commits into
release/0.2.5824from
fix/codedb-find-fuzzy-perf

Conversation

@justrach

@justrach justrach commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Summary

codedb_find (fuzzy filename search) ran a per-file Smith-Waterman DP over every file in the tree. On large repos the long, symbol-like queries that don't hit the exact-filename fast path were slow (openclaw, 13.6k files: ~250ms Debug / ~41ms ReleaseFast for getTokenProvider). Real filename queries (auth, config, index) were already <0.5ms — only the degenerate "used find as a symbol lookup" case was slow.

This branch speeds up that case with zero change to which files come back or their order, verified on both a unit matrix and end-to-end on openclaw (recall@10 = recall@30 = 1.00, exact result order preserved vs the release baseline across 16 representative queries).

Two commits

1. Sound presence prefilter + tightened scalar DP (55df894)

2. SIMD-across-files Smith-Waterman (0e954b6)

  • Refactored the scalar scorer into fuzzyDP (hot loop) + fuzzyFinalize (floor/LCS/bonus tail).
  • Added fuzzyScoreBatch: the same DP vectorized so each of 8 SIMD lanes scores a different file against the same query (inter-sequence parallelism). Shorter paths are length-masked per column so padding never affects best_score/matched_chars; the shared fuzzyFinalize tail keeps post-processing identical.
  • Single-part queries (the common case) batch presence-survivors 8 at a time; multi-part falls back to scalar.
  • All DP values are sums of exactly-representable small integers → the SIMD result is bit-identical to scalar.

Measured (ReleaseFast, openclaw 13.6k files, interleaved vs release baseline)

query baseline this PR speedup recall@10 order
getTokenProvider 41 ms 23 ms 1.80× 1.00 exact
handleRequest 33 ms 19 ms 1.78× 1.00 exact
embedded 22 ms 9 ms 2.39× 1.00 exact
subscribe 25 ms 14 ms 1.82× 1.00 exact
auth / config / index <0.5 ms <0.5 ms 1.00 exact

8 lanes beat 16 on Apple-Silicon NEON (register pressure). The remaining gap from a theoretical 4× is the per-batch transpose plus the DP's serial per-column dependency.

Tests

New test "fuzzy SIMD batch scorer matches scalar fuzzyScore exactly" compares fuzzyScoreBatch to fuzzyScore over a query×path matrix (incl. a full 8-lane mixed-length batch). Full suite green.

Investigated and rejected

A cheap candidate cap (greedy proxy → DP top-N) reached ~10× but dropped recall@10 to 0.85 (0.00 on getTokenProvider) — the greedy proxy can't model Smith-Waterman's mismatch tolerance. Dropped in favor of the retrieval-identical SIMD path.

🤖 Generated with Claude Code

justrach and others added 2 commits June 3, 2026 16:21
…eval-identical)

codedb_find ran the O(query*path) Smith-Waterman DP in fuzzyScore once per
file — ~250ms for long, symbol-like queries that match no filename (e.g.
"getTokenProvider" over openclaw's 13.6k files); real filename queries that
hit the exact-match fast path are already ~2ms.

Two retrieval-identical changes (verified recall@10/@30 = 1.00, exact-order,
vs the release baseline over 16 representative queries):
- Presence pre-filter: issue #518 made a >=60% in-order LCS a hard requirement;
  unordered char presence is an upper bound on that LCS, so a file missing >40%
  of the query's chars can be rejected in O(path) without the DP. Skips the DP
  for files the LCS floor would reject anyway — no result changes.
- Tightened DP: hoist the per-row lowercased query char, precompute path
  lowercase + word-boundary flags once, swap the prev/curr rows by pointer
  instead of two @memcpy per row, and fold the matched-row scan into the main
  loop. Same arithmetic, ~1.1x.

Factored the per-file multi-part scoring into scoreAllParts (shared, no
behavior change). 699 tests green.

Note: a cheap candidate-cap variant (greedy proxy -> DP top-N) was prototyped
and rejected — it reached ~10x but dropped recall@10 to 0.85 (and 0.00 on
"getTokenProvider"/"TokenProvider"), because the greedy proxy can't model the
DP's mismatch tolerance. A retrieval-faithful big speedup needs SIMD-across-files
(next commit).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…val-identical)

fuzzyFindFiles ran the Smith-Waterman DP one file at a time. Refactor the scalar
path into fuzzyDP (the hot loop) + fuzzyFinalize (floor/LCS/bonus tail), then add
fuzzyScoreBatch: the same DP vectorized so each SIMD lane scores a *different*
file against the same query (inter-sequence parallelism, FZ_LANES=8). Shorter
paths are length-masked per column so padding never affects best_score; the
shared fuzzyFinalize tail keeps post-processing identical.

Single-part queries (the common case) collect presence-prefilter survivors and
score them in batches of 8; multi-part falls back to scalar. All DP values are
sums of exactly-representable small integers, so the SIMD result is bit-identical
to scalar — verified by a new test (fuzzyScoreBatch == fuzzyScore over a matrix
of queries x paths, incl. a full 8-lane mixed-length batch) and end-to-end on
openclaw (13.6k files, 16 queries): recall@10 = recall@30 = 1.00, exact result
order preserved vs the release baseline.

ReleaseFast, openclaw, the long symbol-like queries that miss the exact-filename
fast path (the only slow ones; real filename queries are already <0.5ms):
  getTokenProvider 41 -> 23ms (1.80x), handleRequest 33 -> 19ms (1.78x),
  embedded 22 -> 9ms (2.39x), subscribe 25 -> 14ms (1.82x).
8 lanes beat 16 on NEON (register pressure). 699+1 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 78913 80714 +2.28% +1801 OK
codedb_changes 8458 8820 +4.28% +362 OK
codedb_context 962620 903301 -6.16% -59319 OK
codedb_deps 181 188 +3.87% +7 OK
codedb_edit 39871 38400 -3.69% -1471 OK
codedb_find 7970 8253 +3.55% +283 OK
codedb_hot 21067 20768 -1.42% -299 OK
codedb_outline 28328 27507 -2.90% -821 OK
codedb_read 13854 13765 -0.64% -89 OK
codedb_search 28844 30683 +6.38% +1839 OK
codedb_snapshot 64129 53804 -16.10% -10325 OK
codedb_status 7386 7728 +4.63% +342 OK
codedb_symbol 21224 22723 +7.06% +1499 OK
codedb_tree 32153 17767 -44.74% -14386 OK
codedb_word 9886 10966 +10.92% +1080 NOISE

…orrect)

The slow codedb_find queries were symbol-like identifiers (getTokenProvider,
handleRequest) typed into a *fuzzy filename* search — they match no filename, so
they paid the full fuzzy scan AND returned useless filename guesses instead of
the definition the caller wanted.

Add a symbol fast-path in handleFind, after the exact-filename check and before
the fuzzy scan: when the query is a compound identifier (camelCase/PascalCase or
snake_case — never ALL-CAPS, dotted, slashed, spaced, or <4 chars, so real
filename searches are untouched), resolve it against the symbol index and return
the definition site(s); a non-symbol identifier falls through to the (now
SIMD-accelerated) fuzzy scan.

renderSymbolDefsFast uses findAllSymbols (the COMPLETE lookup incl. the outline
safety scan), not symbol_index alone — symbol_index is incomplete for some
languages (indexes ~no TypeScript functions; symbol_index-only scored 0/40 real
openclaw symbols). Definition kinds rank above import/comment usages; usages show
only when no real def is indexed. Returns false without writing on a miss.

Measured (ReleaseFast, openclaw 13.6k files): 40/40 real symbols resolve via the
fast-path at ~1.3ms (def location returned) vs ~29ms for the same queries on the
fuzzy path — ~22x faster and more useful. Filename queries (auth, config) keep
the <0.5ms exact/fuzzy path. New test: classifier truth table + renderSymbolDefsFast
resolves a def and returns false (no output) on a miss. 700+ tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@justrach justrach changed the title find: faster fuzzy file ranking — sound prefilter + SIMD-across-files Smith-Waterman (retrieval-identical) find: faster + smarter fuzzy file ranking (prefilter + SIMD + symbol fast-path) Jun 3, 2026
@justrach

justrach commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

Added: symbol fast-path (3rd commit, 41f79d4)

The slow codedb_find queries were symbol-like identifiers (getTokenProvider, handleRequest) typed into a fuzzy filename search — they match no filename, so they paid the full scan and returned useless filename guesses rather than the definition the caller wanted.

handleFind now, after the exact-filename check and before the fuzzy scan, detects a compound identifier (camelCase/PascalCase or snake_case; never ALL-CAPS, dotted, slashed, spaced, or <4 chars) and resolves it against the symbol index, returning the definition site(s). Non-symbol identifiers fall through to the (SIMD-accelerated) fuzzy scan, so real filename searches are unaffected.

Uses the complete findAllSymbols lookup (incl. the outline safety scan), not symbol_index alone — symbol_index is incomplete for some languages (it indexes ~no TypeScript functions; a symbol_index-only version scored 0/40 real openclaw symbols). Definition kinds rank above import/comment usages.

Measured (ReleaseFast, openclaw 13.6k files): 40/40 real symbols resolve via the fast-path at ~1.3 ms (returning the def location) vs ~29 ms on the fuzzy path — ~22× faster and more useful. Filename queries (auth, config) keep the <0.5 ms exact/fuzzy path.

New test: classifier truth table + renderSymbolDefsFast resolves a def and returns false (no output) on a miss. 700+ tests green.

Net effect of this PR on codedb_find

  • Real symbol typed into find (the main slow/misused case): ~29 ms fuzzy guess → ~1.3 ms exact definition (~22×).
  • Compound query that isn't a symbol: fuzzy scan, now SIMD-accelerated (~1.8× on the heavy ones).
  • Filename query: unchanged, <0.5 ms.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 65274 64794 -0.74% -480 OK
codedb_changes 6506 6816 +4.76% +310 OK
codedb_context 1028771 1040324 +1.12% +11553 OK
codedb_deps 266 235 -11.65% -31 OK
codedb_edit 36401 36543 +0.39% +142 OK
codedb_find 5397 4919 -8.86% -478 OK
codedb_hot 13758 14031 +1.98% +273 OK
codedb_outline 25291 25610 +1.26% +319 OK
codedb_read 11685 11554 -1.12% -131 OK
codedb_search 30013 30667 +2.18% +654 OK
codedb_snapshot 63285 63252 -0.05% -33 OK
codedb_status 5041 5003 -0.75% -38 OK
codedb_symbol 16055 15817 -1.48% -238 OK
codedb_tree 24076 16026 -33.44% -8050 OK
codedb_word 6882 7202 +4.65% +320 OK

@justrach

justrach commented Jun 3, 2026

Copy link
Copy Markdown
Owner Author

What makes it faster (and what's unchanged)

Four mechanisms, in the order handleFind applies them:

  1. Symbol fast-path (the big structural win). A compound-identifier query (camelCase/snake_case) is resolved against the symbol table and returns the definition site — it skips the per-file fuzzy scan entirely. This is an algorithm change (O(files) Smith-Waterman → O(symbol lookup)), not a constant-factor tweak: ~30–50× for those queries, and a better answer (the def, not filename guesses). Uses the complete findAllSymbols (incl. the outline safety scan) because symbol_index alone is incomplete for some languages (0/40 real TS symbols).
  2. Sound presence pre-filter. Issue codedb 0.2.5823: a few correctness/UX findings (non-ASCII outline, codedb_find false hits, kind labels, search cap, snapshot staleness) #518 makes a ≥60% in-order LCS a hard requirement; unordered char-presence is an upper bound on that LCS, so files missing >40% of the query's chars are rejected in O(path) — skipping the O(query×path) DP for files that would fail the gate anyway. Retrieval-identical.
  3. SIMD-across-files Smith-Waterman. The fuzzy fallback scores 8 files per vector pass (one file per lane). All DP values are sums of exactly-representable small ints → bit-identical to scalar. ~2× on a cool machine.
  4. Tightened scalar DP. Hoisted the per-row lowercased query char, precomputed path-lowercase + word-boundary flags, pointer-swapped the DP rows (no @memcpy), folded the matched-row scan.

Verification

  • 1:1 with everything else. Data output of all 10 non-find tools (search, callers, symbol, word, deps, tree, outline, glob, context) is byte-identical baseline vs this branch — find is the only behavior change, and its fuzzy path is retrieval-identical (recall@10/@30 = 1.00, exact order).
  • RSS flat. 157 MB vs baseline 158 MB after 200 mixed find queries — the SIMD lane arrays + candidate list are transient per-query (freed each call).
  • Tests: 700+ green.
  • Both surfaces: lives in handleFind, shared by MCP codedb_find and CLI file (warm daemon → runCliToolhandleFind). Measured CLI symbol query: ~40–60 ms → <4 ms.
  • Measurement caveat: the per-query fuzzy speedup is sensitive to this box's documented macOS E-core/thermal variance; using config/auth (untouched code) as a control, every fuzzy query's new/base ratio sits at or below the control penalty, confirming the SIMD path is not a regression.

@justrach justrach merged commit 7e5a2bb into release/0.2.5824 Jun 3, 2026
1 check passed
justrach added a commit that referenced this pull request Jun 4, 2026
…, CLI hardening, ReScript

- release_info.semver 0.2.5823 -> 0.2.5824 (the version codedb reports and
  update.zig compares against; build.zig.zon was already 0.2.5824).
- CHANGELOG: document the warm CLI daemon (#525), faster fuzzy find (#526),
  CLI hardening (#529), ReScript .res/.resi (#532), and audit fixes (#530)
  alongside the existing code-graph / snapshot-load sections.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant