Skip to content

Improve nav-to fuzzy pre-filtering with q-gram bigram bitset and corrected length thresholds#82457

Merged
JoeRobich merged 16 commits into
dotnet:mainfrom
CyrusNajmabadi:features/navigate-to-fuzzy-bigram-bitset
Feb 25, 2026
Merged

Improve nav-to fuzzy pre-filtering with q-gram bigram bitset and corrected length thresholds#82457
JoeRobich merged 16 commits into
dotnet:mainfrom
CyrusNajmabadi:features/navigate-to-fuzzy-bigram-bitset

Conversation

@CyrusNajmabadi

@CyrusNajmabadi CyrusNajmabadi commented Feb 19, 2026

Copy link
Copy Markdown
Contributor

Followup to #82431

Summary

The NavigateTo pre-filter gates whether fuzzy (edit-distance) matching is attempted for a document. Previously it used only a symbol-name-length bitset with a fixed ±2 delta — too coarse in practice. In a large file with many 6-character symbols, searching for "FooBar" (also 6 characters) would pass the length check for every document, forcing an expensive full-scan fuzzy match against every symbol in those documents even when the character content is completely different.

This PR improves the fuzzy pre-filter in three ways:

  1. Correct the length check thresholds to match what WordSimilarityChecker actually enforces: ±1 for pattern lengths 3–4, ±2 for 5+, and no fuzzy matching at all for patterns shorter than 3. Previously, the pre-filter used a blanket ±2 for all lengths and didn't reject short patterns, which meant it was more permissive than the actual fuzzy matcher.

  2. Add an exact bigram bitset (38×38 = 1444 bits, 184 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use the q-gram count lemma (Ukkonen, 1992) to require a minimum number of shared bigrams: min_shared = |pattern| - 1 - 2k, where k is the edit distance threshold. If fewer pattern bigrams match, the document is skipped for fuzzy matching entirely.

  3. Share threshold logic by extracting WordSimilarityChecker.GetThreshold(int) and MinFuzzyLength so the pre-filter and the actual fuzzy matcher use the same constants instead of duplicating magic numbers.

Motivation

Consider a codebase with thousands of documents, many containing 6-character symbol names. When the user types "FooBar" in NavigateTo:

  • Before: The length check passes for every document that has any symbol of length 4–8 (±2). The caller then creates a PatternMatcher with fuzzy matching enabled and runs it against every symbol in those documents. Most of these comparisons compute edit distance only to conclude there's no match.

  • After: The length check passes (length 6, threshold ±2, checks 4–8). But the bigram check then asks: do at least 1 of the 5 bigrams ("fo","oo","ob","ba","ar") exist in this document? For documents with symbols like "XyzWvq", none of these bigrams are stored, so the document is skipped without ever creating a PatternMatcher.

Bigram bitset design

Characters are mapped to a 38-element alphabet:

  • az → 0..25
  • 09 → 26..35
  • _ → 36
  • everything else (Unicode) → 37 ("other" bucket)

This gives exact membership for the 37 most common identifier characters and a single overflow bucket for rare Unicode characters. The bitset is a ulong[23] (184 bytes) — compact enough to store per-document with negligible memory overhead.

Filtering effectiveness by pattern length

Pattern length k (threshold) min_shared bigrams Filtering power
< 3 Fuzzy disabled entirely
3 1 0 No filtering (always passes)
4 1 1 Need ≥ 1 of 3 bigrams
5 2 0 No filtering (always passes)
6 2 1 Need ≥ 1 of 5 bigrams
7 2 2 Need ≥ 2 of 6 bigrams
8 2 3 Need ≥ 3 of 7 bigrams
10 2 5 Need ≥ 5 of 9 bigrams

Lengths 3 and 5 get no bigram filtering benefit (min_shared = 0). For length 3, we should strongly consider whether fuzzy matching itself is too permissive to even support — with an edit distance threshold of 1, a 3-character pattern like "abc" would fuzzy-match "xbc", "axc", "abx", "ab", "abcd", etc. This is a potential follow-up.

Reference

Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." Theoretical Computer Science, 92(1), 191–211. DOI: 10.1016/0304-3975(92)90143-4

The q-gram count lemma states: each edit operation can destroy at most q q-grams from a string. Therefore, if edit_distance(s, t) ≤ k, then at least |s| - 1 - q·k of s's q-gram positions must have a matching q-gram in t. For bigrams (q=2): min_shared = |pattern| - 1 - 2k.

Benchmarks:

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
'SameLen: LengthCheck pass (false positive)' 0.9901 ns 0.1123 ns 0.0292 ns - - - -
'VariedLen: LengthCheck pass' 0.1914 ns 0.0710 ns 0.0184 ns - - - -
'SameLen: BigramCheck reject (true negative!)' 7.1073 ns 0.5302 ns 0.1377 ns - - - -
'SameLen: BigramCheck pass (true positive)' 7.1716 ns 0.4179 ns 0.0647 ns - - - -
'SameLen: Combined reject (bigram saves)' 28.2908 ns 2.1554 ns 0.3336 ns - - - -
'SameLen: Combined pass' 28.7838 ns 1.3179 ns 0.2039 ns - - - -
'SameLen: BigramCheck reject len=8' 10.4243 ns 1.4520 ns 0.2247 ns - - - -
'VariedLen: BigramCheck reject len=10' 14.9580 ns 0.7782 ns 0.1204 ns - - - -

Cyrus Najmabadi and others added 11 commits February 19, 2026 10:31
…e container matching always non-fuzzy

Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since
container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first
then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove
CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now
accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the
prefilter tells the caller whether fuzzy matching is worth attempting.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical update of all call sites following the PatternMatcher refactor:
- DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false
- DocumentOutlineViewModel: add missing using on PatternMatcher
- PatternMatcherTests: explicit allowFuzzyMatching: false

Co-authored-by: Cursor <cursoragent@cursor.com>
Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts,
AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based
processing in the new NavigateTo prefilter code without allocating substrings.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads.
Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and
ProbablyContains(char) to delegate to the span overloads. Add doc remark on
ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching
optimization from the string overload. This enables allocation-free bloom filter
operations in the NavigateTo prefilter.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own
NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass).
This allows the lightweight filter data to be loaded independently of the heavyweight
TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the
filter never need to load the full index.

Bump serialization format checksum from "51" to "52" to invalidate stale indices.

- New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs
- TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code

Co-authored-by: Cursor <cursoragent@cursor.com>
…grams, and length bitset

The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex.
Contains five bloom filters and a length bitset for fast document-level rejection:

- _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar")
- _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for
  "GooBarQuux"), enabling non-contiguous CamelCase matching
- _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar"
  for "GooBar"), used by a DP algorithm for all-lowercase patterns
- _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in
  "Readline")
- _containerFilter: hump chars from fully-qualified container names
- _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering

For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match
a stored hump prefix, avoiding the exponential capitalization enumeration.

Includes extensive documentation with examples throughout.

Co-authored-by: Cursor <cursoragent@cursor.com>
Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and
cached document search paths:

- InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync,
  call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex.
  Pass the allowFuzzyMatching signal from the filter to the PatternMatcher.
- CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the
  full index in the parallel ForEachAsync loop.
- Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs.

Co-authored-by: Cursor <cursoragent@cursor.com>
~1030 lines of declarative theory-based tests covering all match kinds and pre-filter
behaviors. Organized into regions:

- Positive tests: verify CouldContainNavigateToMatch returns true for all supported
  PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous
  variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case
  and all-lowercase patterns
- Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths
  don't match
- CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed
- Multiple symbols: document-level filter matches any symbol
- Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal
- Individual filter checks via TestAccessor (hump, DP, trigram, length)
- Container matching tests

Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows
matches through the full NavigateTo pipeline:

- CamelCase hump bigram (GB -> GooBar)
- All-lowercase DP hump prefix (goo -> GooBar)
- Trigram substring (line -> Readline)
- Fuzzy match enabled by length check (ToEror -> ToError)
- No match when all checks fail (XyzXyzXyzXyz)

Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance
across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified).

Co-authored-by: Cursor <cursoragent@cursor.com>
…th thresholds

The existing fuzzy pre-filter used only a symbol-name-length bitset with a
fixed ±2 delta, which was too coarse — in large files, many symbols share
similar lengths, causing false positives that force expensive full-scan
fuzzy matching.

This commit improves the pre-filter in three ways:

1. Fix LengthCheckPasses to use WordSimilarityChecker.GetThreshold (±1 for
   pattern lengths 3–4, ±2 for 5+) and reject patterns < MinFuzzyLength (3),
   matching the actual fuzzy matching behavior.

2. Add a 37×37 exact bigram bitset (176 bytes per document) storing all
   lowercased 2-character sliding windows of symbol names. At query time,
   use Ukkonen's q-gram count lemma to compute a minimum shared bigram
   count: min_shared = |pattern| - 1 - 2k. If fewer pattern bigrams match,
   fuzzy matching is skipped for that document.

3. Extract WordSimilarityChecker.GetThreshold(int) overload and MinFuzzyLength
   constant so both the pre-filter and the actual fuzzy matcher share the
   same threshold logic.

Reference: Ukkonen, E. (1992). "Approximate string-matching with q-grams
and maximal matches." Theoretical Computer Science, 92(1), 191–211.
https://doi.org/10.1016/0304-3975(92)90143-4

Co-authored-by: Cursor <cursoragent@cursor.com>
@dotnet-policy-service dotnet-policy-service Bot added Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode labels Feb 19, 2026
@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun @dibarbet this can be reviewed starting from commit 635cbc2.

@CyrusNajmabadi

CyrusNajmabadi commented Feb 19, 2026

Copy link
Copy Markdown
Contributor Author

Future consideration: fuzzy pre-filter effectiveness for short patterns

The bigram pre-filter (q-gram count lemma) has blind spots for short patterns due to the relationship between pattern length, edit distance threshold k, and the minimum shared bigrams formula (|pattern| - 1 - 2k):

Current behavior (GetThreshold: k=1 for length ≤ 4, k=2 for length ≥ 5)

Pattern length k Length window min_shared bigrams Filtering power
3 1 ±1 (2–4) 0 None — always passes
4 1 ±1 (3–5) 1 Weak — need ≥1 of 3
5 2 ±2 (3–7) 0 None — always passes
6 2 ±2 (4–8) 1 Weak — need ≥1 of 5
7 2 ±2 (5–9) 2 Moderate
8+ 2 ±2 3+ Good

Proposed: k=1 for length ≤ 6, k=2 for length ≥ 7

Pattern length k Length window min_shared bigrams Filtering power
3 1 ±1 (2–4) 0 None — always passes
4 1 ±1 (3–5) 1 Weak — need ≥1 of 3
5 1 ±1 (4–6) 2 Moderate — need ≥2 of 4
6 1 ±1 (5–7) 3 Strong — need ≥3 of 5
7 2 ±2 (5–9) 2 Moderate
8+ 2 ±2 3+ Good

The key wins are at lengths 5 and 6, which go from zero/weak filtering to moderate/strong. The trade-off is less fuzzy tolerance (1 edit instead of 2) for 5–6 character patterns, but 2 edits on a 5-character string is 40% different — arguably too aggressive for useful fuzzy matching anyway.

Additionally, we should consider raising MinFuzzyLength from 3 to 4 (or even 5). A 3-letter fuzzy match with k=1 means "Foo" matches "Goo", "Boo", "For", "Fo", "Food", etc. — extremely permissive with zero bigram selectivity. Raising it would reduce noise with minimal loss of useful matches.

These are independent improvements and don't need to block this PR — just noting them for future work.

--

Note: i've implemented this change here: #82459

@davkean

davkean commented Feb 19, 2026

Copy link
Copy Markdown
Member

Be warned, this contributor is a known troll.

/// Maximum allowed edit distance for a fuzzy match given the source length. Shorter strings
/// get a tighter threshold (1) to avoid excessive spurious hits; longer strings get a looser
/// threshold (2) to tolerate more typos.
/// </summary>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

@CyrusNajmabadi CyrusNajmabadi force-pushed the features/navigate-to-fuzzy-bigram-bitset branch 2 times, most recently from 52b2bf4 to f9b3637 Compare February 19, 2026 15:30
@CyrusNajmabadi CyrusNajmabadi force-pushed the features/navigate-to-fuzzy-bigram-bitset branch from f9b3637 to 80b1594 Compare February 19, 2026 15:32
public static NavigateToSearchInfo Create(IReadOnlyList<DeclaredSymbolInfo> infos)
{
if (infos.Count == 0)
return default;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided that this optimization just wasn't worth it. esp. as it meant i needed null checks all over the place.

@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review February 20, 2026 19:59
@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner February 20, 2026 19:59
@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun @dibarbet @JoeRobich ptal :)

@CyrusNajmabadi CyrusNajmabadi force-pushed the features/navigate-to-fuzzy-bigram-bitset branch from 7625fb1 to b90368d Compare February 20, 2026 20:28
@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun @dibarbet @JoeRobich ptal :)

Thanks!

@dibarbet dibarbet self-assigned this Feb 24, 2026
var idx = FuzzyBigramCharIndex(char.ToLowerInvariant(pattern[i])) * FuzzyBigramAlphabetSize
+ FuzzyBigramCharIndex(char.ToLowerInvariant(pattern[i + 1]));
if ((_fuzzyBigramBitset[idx >> 6] & (1UL << (idx & 63))) != 0)
count++;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - could return early here if count >= minShared, and outside the loop return false

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Will move into followup.

@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun @dibarbet @JoeRobich @roslyn-ide please merge. Tnx!

@JoeRobich JoeRobich merged commit a80bb70 into dotnet:main Feb 25, 2026
25 checks passed
@dotnet-policy-service dotnet-policy-service Bot added this to the Next milestone Feb 25, 2026
JoeRobich pushed a commit that referenced this pull request Feb 27, 2026
Followup to #82457

## Summary (part 1)

Changes `WordSimilarityChecker.GetThreshold` from `length <= 4 ? 1 : 2`
to `length <= 5 ? 1 : 2`, so that 5-character patterns now require edit
distance ≤ 1 instead of ≤ 2 for a fuzzy match.

## Motivation

The original threshold tiers (`≤4 → 1`, `≥5 → 2`) were chosen by me
10-15 years ago based on gut feeling and a small amount of manual
testing. They work well in practice, but the jump to k=2 at length 5
creates a blind spot in the bigram pre-filter introduced in #82457: the
q-gram count lemma gives `min_shared = 5 - 1 - 2*2 = 0`, meaning the
bigram filter *cannot eliminate any candidates* for 5-character
patterns.

The industry standard for this exact problem is Elasticsearch/Lucene's
[`fuzziness:
AUTO`](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness)
setting, which uses:

| Pattern length | Max edit distance |
|---|---|
| 1–2 | 0 (no fuzzy matching) |
| 3–5 | 1 |
| 6+ | 2 |

This is grounded in [Damerau's
finding](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)
that approximately 80% of human misspellings are edit distance 1, and
that allowing 2 edits on short strings produces too many false positives
(e.g., two edits can transform a 5-character word into something
completely unrelated).

## What changes

Only the length-5 case is affected:

| Pattern length | Before (k) | After (k) |
|---|---|---|
| 1–2 | N/A (`MinFuzzyLength = 3`) | N/A (unchanged) |
| 3–4 | 1 | 1 (unchanged) |
| **5** | **2** | **1** |
| 6+ | 2 | 2 (unchanged) |

`MinFuzzyLength = 3` is deliberately kept — below that length, fuzzy
matching produces too many spurious hits regardless of threshold.

## Effect on pre-filter

With the new threshold, the bigram pre-filter becomes effective at
length 5. The `min_shared` column shows the minimum number of pattern
bigrams that must appear in the candidate for the q-gram count lemma to
permit a match (`|pattern| - 1 - 2k`). Higher values mean stronger
filtering:

| Pattern length | Before: k, min_shared | After: k, min_shared |
Filtering strength |
|---|---|---|---|
| 3 | k=1, 0 | k=1, 0 | None (always passes) |
| 4 | k=1, 1 (of 3 bigrams) | k=1, 1 (of 3 bigrams) | Weak |
| **5** | **k=2, 0** | **k=1, 2 (of 4 bigrams)** | **None -> Moderate**
|
| 6 | k=2, 1 (of 5 bigrams) | k=2, 1 (of 5 bigrams) | Weak |
| 7 | k=2, 2 (of 6 bigrams) | k=2, 2 (of 6 bigrams) | Moderate |
| 8 | k=2, 3 (of 7 bigrams) | k=2, 3 (of 7 bigrams) | Strong |
| 10+ | k=2, increasingly strong | k=2, increasingly strong | Very
strong |

## Risk

Exceptionally low. This is a one-line behavioral change that tightens a
fuzzy match heuristic for a single pattern length. Fuzzy matching is
already a best-effort "did you mean?" feature — slightly fewer fuzzy
matches at length 5 is the expected and desired outcome.

## Summary (part 2)

The `allowFuzzyMatching` boolean was threaded through 7 locations in the
pattern matching pipeline, yet only 2 of 6 callers ever enabled it.
`ContainerPatternMatcher` hardcoded it to `false` everywhere. The
interleaving of fuzzy and non-fuzzy logic made the code harder to follow
than necessary.

## Design

Each matcher now has a single responsibility:

- **`SimplePatternMatcher`**: exact, prefix, camelCase, substring
(non-fuzzy)
- **`FuzzyPatternMatcher`** (new): edit-distance only via
`WordSimilarityChecker`
- **`CompoundPatternMatcher`** (new): composes sub-matchers, tries each
in order, short-circuits on first match
- **`ContainerPatternMatcher`**: unchanged

A new `[Flags] enum PatternMatcherKind { None, Standard, Fuzzy }`
controls which strategies to use. The factory returns the appropriate
matcher (or compound) based on the flags. Callers just pass the enum:

```csharp
using var matcher = PatternMatcher.CreatePatternMatcher(
    pattern, includeMatchedSpans: true, PatternMatcherKind.Standard | PatternMatcherKind.Fuzzy);
```

`NavigateToSearchIndex.CouldContainNavigateToMatch` now returns
`PatternMatcherKind` (instead of `bool` + two out-params), which flows
directly into the factory.

The base `PatternMatcher.AddMatches` is now non-abstract and handles
`SkipMatch` centrally; subclasses implement `AddMatchesWorker`.

---------

Co-authored-by: Cyrus Najmabadi <Cyrus Najmabadi>
Co-authored-by: Cursor <cursoragent@cursor.com>
@jjonescz jjonescz modified the milestones: Next, 18.6 Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants