Tighten fuzzy edit-distance threshold for length-5 patterns by CyrusNajmabadi · Pull Request #82459 · dotnet/roslyn

CyrusNajmabadi · 2026-02-19T13:13:23Z

Followup to #82457

Summary (part 1)

Changes WordSimilarityChecker.GetThreshold from length <= 4 ? 1 : 2 to length <= 5 ? 1 : 2, so that 5-character patterns now require edit distance ≤ 1 instead of ≤ 2 for a fuzzy match.

Motivation

The original threshold tiers (≤4 → 1, ≥5 → 2) were chosen by me 10-15 years ago based on gut feeling and a small amount of manual testing. They work well in practice, but the jump to k=2 at length 5 creates a blind spot in the bigram pre-filter introduced in #82457: the q-gram count lemma gives min_shared = 5 - 1 - 2*2 = 0, meaning the bigram filter cannot eliminate any candidates for 5-character patterns.

The industry standard for this exact problem is Elasticsearch/Lucene's fuzziness: AUTO setting, which uses:

Pattern length	Max edit distance
1–2	0 (no fuzzy matching)
3–5	1
6+	2

This is grounded in Damerau's finding that approximately 80% of human misspellings are edit distance 1, and that allowing 2 edits on short strings produces too many false positives (e.g., two edits can transform a 5-character word into something completely unrelated).

What changes

Only the length-5 case is affected:

Pattern length	Before (k)	After (k)
1–2	N/A (`MinFuzzyLength = 3`)	N/A (unchanged)
3–4	1	1 (unchanged)
5	2	1
6+	2	2 (unchanged)

MinFuzzyLength = 3 is deliberately kept — below that length, fuzzy matching produces too many spurious hits regardless of threshold.

Effect on pre-filter

With the new threshold, the bigram pre-filter becomes effective at length 5. The min_shared column shows the minimum number of pattern bigrams that must appear in the candidate for the q-gram count lemma to permit a match (|pattern| - 1 - 2k). Higher values mean stronger filtering:

Pattern length	Before: k, min_shared	After: k, min_shared	Filtering strength
3	k=1, 0	k=1, 0	None (always passes)
4	k=1, 1 (of 3 bigrams)	k=1, 1 (of 3 bigrams)	Weak
5	k=2, 0	k=1, 2 (of 4 bigrams)	None -> Moderate
6	k=2, 1 (of 5 bigrams)	k=2, 1 (of 5 bigrams)	Weak
7	k=2, 2 (of 6 bigrams)	k=2, 2 (of 6 bigrams)	Moderate
8	k=2, 3 (of 7 bigrams)	k=2, 3 (of 7 bigrams)	Strong
10+	k=2, increasingly strong	k=2, increasingly strong	Very strong

Risk

Exceptionally low. This is a one-line behavioral change that tightens a fuzzy match heuristic for a single pattern length. Fuzzy matching is already a best-effort "did you mean?" feature — slightly fewer fuzzy matches at length 5 is the expected and desired outcome.

Summary (part 2)

The allowFuzzyMatching boolean was threaded through 7 locations in the pattern matching pipeline, yet only 2 of 6 callers ever enabled it. ContainerPatternMatcher hardcoded it to false everywhere. The interleaving of fuzzy and non-fuzzy logic made the code harder to follow than necessary.

Design

Each matcher now has a single responsibility:

SimplePatternMatcher: exact, prefix, camelCase, substring (non-fuzzy)
FuzzyPatternMatcher (new): edit-distance only via WordSimilarityChecker
CompoundPatternMatcher (new): composes sub-matchers, tries each in order, short-circuits on first match
ContainerPatternMatcher: unchanged

A new [Flags] enum PatternMatcherKind { None, Standard, Fuzzy } controls which strategies to use. The factory returns the appropriate matcher (or compound) based on the flags. Callers just pass the enum:

using var matcher = PatternMatcher.CreatePatternMatcher(
    pattern, includeMatchedSpans: true, PatternMatcherKind.Standard | PatternMatcherKind.Fuzzy);

NavigateToSearchIndex.CouldContainNavigateToMatch now returns PatternMatcherKind (instead of bool + two out-params), which flows directly into the factory.

The base PatternMatcher.AddMatches is now non-abstract and handles SkipMatch centrally; subclasses implement AddMatchesWorker.

…e container matching always non-fuzzy Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the prefilter tells the caller whether fuzzy matching is worth attempting. Co-authored-by: Cursor <cursoragent@cursor.com>

Mechanical update of all call sites following the PatternMatcher refactor: - DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false - DocumentOutlineViewModel: add missing using on PatternMatcher - PatternMatcherTests: explicit allowFuzzyMatching: false Co-authored-by: Cursor <cursoragent@cursor.com>

Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts, AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based processing in the new NavigateTo prefilter code without allocating substrings. Co-authored-by: Cursor <cursoragent@cursor.com>

Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads. Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and ProbablyContains(char) to delegate to the span overloads. Add doc remark on ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching optimization from the string overload. This enables allocation-free bloom filter operations in the NavigateTo prefilter. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass). This allows the lightweight filter data to be loaded independently of the heavyweight TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the filter never need to load the full index. Bump serialization format checksum from "51" to "52" to invalidate stale indices. - New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs - TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code Co-authored-by: Cursor <cursoragent@cursor.com>

…grams, and length bitset The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex. Contains five bloom filters and a length bitset for fast document-level rejection: - _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar") - _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for "GooBarQuux"), enabling non-contiguous CamelCase matching - _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar" for "GooBar"), used by a DP algorithm for all-lowercase patterns - _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in "Readline") - _containerFilter: hump chars from fully-qualified container names - _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match a stored hump prefix, avoiding the exponential capitalization enumeration. Includes extensive documentation with examples throughout. Co-authored-by: Cursor <cursoragent@cursor.com>

Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and cached document search paths: - InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync, call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex. Pass the allowFuzzyMatching signal from the filter to the PatternMatcher. - CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the full index in the parallel ForEachAsync loop. - Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs. Co-authored-by: Cursor <cursoragent@cursor.com>

~1030 lines of declarative theory-based tests covering all match kinds and pre-filter behaviors. Organized into regions: - Positive tests: verify CouldContainNavigateToMatch returns true for all supported PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case and all-lowercase patterns - Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths don't match - CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed - Multiple symbols: document-level filter matches any symbol - Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal - Individual filter checks via TestAccessor (hump, DP, trigram, length) - Container matching tests Co-authored-by: Cursor <cursoragent@cursor.com>

End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows matches through the full NavigateTo pipeline: - CamelCase hump bigram (GB -> GooBar) - All-lowercase DP hump prefix (goo -> GooBar) - Trigram substring (line -> Readline) - Fuzzy match enabled by length check (ToEror -> ToError) - No match when all checks fail (XyzXyzXyzXyz) Co-authored-by: Cursor <cursoragent@cursor.com>

BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>

…th thresholds The existing fuzzy pre-filter used only a symbol-name-length bitset with a fixed ±2 delta, which was too coarse — in large files, many symbols share similar lengths, causing false positives that force expensive full-scan fuzzy matching. This commit improves the pre-filter in three ways: 1. Fix LengthCheckPasses to use WordSimilarityChecker.GetThreshold (±1 for pattern lengths 3–4, ±2 for 5+) and reject patterns < MinFuzzyLength (3), matching the actual fuzzy matching behavior. 2. Add a 37×37 exact bigram bitset (176 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use Ukkonen's q-gram count lemma to compute a minimum shared bigram count: min_shared = |pattern| - 1 - 2k. If fewer pattern bigrams match, fuzzy matching is skipped for that document. 3. Extract WordSimilarityChecker.GetThreshold(int) overload and MinFuzzyLength constant so both the pre-filter and the actual fuzzy matcher share the same threshold logic. Reference: Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." Theoretical Computer Science, 92(1), 191–211. https://doi.org/10.1016/0304-3975(92)90143-4 Co-authored-by: Cursor <cursoragent@cursor.com>

Extract edit-distance (fuzzy) matching from the shared PatternMatcher pipeline into a standalone FuzzyPatternMatcher class. Previously the allowFuzzyMatching bool threaded through 7 locations (factory, constructor, PatternSegment, TextChunk, AddMatches, MatchPatternSegment, MatchPatternChunk). Now each matcher has a single responsibility: - SimplePatternMatcher: exact, prefix, camelCase, substring (non-fuzzy) - FuzzyPatternMatcher: edit-distance only (WordSimilarityChecker) - ContainerPatternMatcher: dot-separated container matching (non-fuzzy) Callers compose them: try non-fuzzy first, fuzzy only as fallback — no redundant work. NavigateToSearchInfo now returns two independent signals (couldNonFuzzyMatch, couldFuzzyMatch) instead of a single bool. Also refactors base PatternMatcher: AddMatches is now non-abstract and handles SkipMatch centrally, delegating to abstract AddMatchesWorker. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the two-boolean prefilter signal (couldNonFuzzyMatch, couldFuzzyMatch) and manual matcher composition with a [Flags] enum PatternMatcherKind { None, Standard, Fuzzy } that flows from the prefilter through the factory. CreatePatternMatcher now accepts a PatternMatcherKind parameter: - Standard only -> SimplePatternMatcher - Fuzzy only -> FuzzyPatternMatcher - Standard | Fuzzy -> CompoundPatternMatcher (tries each in order, short-circuits on first match) CompoundPatternMatcher takes ReadOnlySpan<PatternMatcher> and manages an internal ArrayBuilder, freed on dispose. Callers no longer need to manually compose matchers — they pass the enum and get the right thing back. NavigateToSearchIndex.CouldContainNavigateToMatch now returns PatternMatcherKind instead of bool + two out params. Co-authored-by: Cursor <cursoragent@cursor.com>

Change GetThreshold from `length <= 4 ? 1 : 2` to `length <= 5 ? 1 : 2`, aligning with Elasticsearch/Lucene's fuzziness:AUTO tiers (0 edits for 1-2 chars, 1 edit for 3-5 chars, 2 edits for 6+). This eliminates the bigram pre-filter blind spot at length 5 where min_shared was 0.

…o-fuzzy-bigram-bitset

…/separate-fuzzy-pattern-matcher

…e-fuzzy-pattern-matcher

…uzzy-pattern-matcher

…zzy-threshold

CyrusNajmabadi · 2026-02-26T07:51:13Z

@ToddGrun @dibarbet @JoeRobich this is ready for review.

CyrusNajmabadi · 2026-02-27T08:22:36Z

Thanks @JoeRobich this is also the last pr I expect for this area.

Cyrus Najmabadi and others added 14 commits February 19, 2026 10:31

Add NavigateTo pre-filter benchmarks

3c67904

BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>

Remove 'default' state. Refactor for clarity

cbe4330

github-actions Bot added the Area-IDE label Feb 19, 2026

dotnet-policy-service Bot added Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode labels Feb 19, 2026

CyrusNajmabadi force-pushed the features/tighten-fuzzy-threshold branch 2 times, most recently from 5293f5b to 978b7f2 Compare February 19, 2026 13:20

CyrusNajmabadi force-pushed the features/tighten-fuzzy-threshold branch from 978b7f2 to ecc518a Compare February 19, 2026 13:22

CyrusNajmabadi mentioned this pull request Feb 19, 2026

Improve nav-to fuzzy pre-filtering with q-gram bigram bitset and corrected length thresholds #82457

Merged

Cyrus Najmabadi and others added 8 commits February 19, 2026 16:31

Remove 'default' state. Refactor for clarity

80b1594

Deal with VB verbatim strings

cb3134a

Merge remote-tracking branch 'upstream/main' into features/navigate-t…

74e1f8c

…o-fuzzy-bigram-bitset

Move to immutable arrays

f189662

Simplify nrt

a944a52

reorder

b90368d

Merge branch 'features/navigate-to-fuzzy-bigram-bitset' into features…

c8c9653

…/separate-fuzzy-pattern-matcher

Merge branch 'features/tighten-fuzzy-threshold' into features/separat…

19d90a9

…e-fuzzy-pattern-matcher

CyrusNajmabadi added 5 commits February 25, 2026 09:36

Merge remote-tracking branch 'upstream/main' into features/separate-f…

e0aca37

…uzzy-pattern-matcher

Renames and algorithm tweak

046901b

Merge remote-tracking branch 'upstream/main' into features/tighten-fu…

aa9340f

…zzy-threshold

rename

3508442

Revert

3cc7d8c

CyrusNajmabadi marked this pull request as ready for review February 26, 2026 07:50

CyrusNajmabadi requested a review from a team as a code owner February 26, 2026 07:51

JoeRobich approved these changes Feb 26, 2026

View reviewed changes

JoeRobich merged commit 74d07c0 into dotnet:main Feb 27, 2026
25 checks passed

dotnet-policy-service Bot added this to the Next milestone Feb 27, 2026

This was referenced Feb 27, 2026

[release/10.0.3xx] Source code updates from dotnet/roslyn dotnet/dotnet#5091

Merged

[main] Source code updates from dotnet/roslyn dotnet/dotnet#5006

Merged

dotnet-bot mentioned this pull request Mar 3, 2026

[Automated] PRs inserted in VS build main-11602.225 #82592

Closed

dotnet-maestro Bot mentioned this pull request Mar 5, 2026

[release/11.0.1xx-preview2] Source code updates from dotnet/roslyn dotnet/dotnet#5237

Merged

dotnet-bot mentioned this pull request Mar 14, 2026

[Automated] PRs inserted in VS build feature.debugger.wix5-11613.132 #82758

Closed

jjonescz modified the milestones: Next, 18.6 Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten fuzzy edit-distance threshold for length-5 patterns#82459

Tighten fuzzy edit-distance threshold for length-5 patterns#82459
JoeRobich merged 28 commits into
dotnet:mainfrom
CyrusNajmabadi:features/tighten-fuzzy-threshold

CyrusNajmabadi commented Feb 19, 2026 •

edited

Loading

Uh oh!

CyrusNajmabadi commented Feb 26, 2026

Uh oh!

Uh oh!

CyrusNajmabadi commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CyrusNajmabadi commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary (part 1)

Motivation

What changes

Effect on pre-filter

Risk

Summary (part 2)

Design

Uh oh!

CyrusNajmabadi commented Feb 26, 2026

Uh oh!

Uh oh!

CyrusNajmabadi commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CyrusNajmabadi commented Feb 19, 2026 •

edited

Loading