Skip to content

Improve navigate-to filtering to avoid scanning documents unnecessarily.#82431

Merged
JoeRobich merged 10 commits into
dotnet:mainfrom
CyrusNajmabadi:features/navigate-to-bloom-filter
Feb 20, 2026
Merged

Improve navigate-to filtering to avoid scanning documents unnecessarily.#82431
JoeRobich merged 10 commits into
dotnet:mainfrom
CyrusNajmabadi:features/navigate-to-bloom-filter

Conversation

@CyrusNajmabadi

@CyrusNajmabadi CyrusNajmabadi commented Feb 17, 2026

Copy link
Copy Markdown
Contributor

NavigateTo: Add document pre-filtering to skip non-matching files early

Summary

Adds a new lightweight NavigateToSearchIndex that lets NavigateTo skip documents that can't match the user's query — without ever loading the full declared-symbol data. For documents that pass, the filter also controls whether fuzzy matching is enabled, avoiding expensive edit-distance work when no symbol is close in length.

Speedup is anywhere from 15,000x better in the worst case, to 250,000x better in the best case. This is for when we can filter the doc out. When we can't filter the doc out, we spend at worst 300ns on an initial check, which is then dwarfed by millions of ns of actual searching cost. So there's no appreciable cost for documents where there is a hit.

The worst case (only 15,000x speedup) is for users with an all-lowercase NavTo pattern (sorry @dibarbet). Users with any capital letters in their pattern will get the 250,000x improvement.

Why

Today NavigateTo loads the full TopLevelSyntaxTreeIndex for every document and runs PatternMatcher against every symbol. Most documents don't match, so this is wasted work. By putting a small pre-filter in a separate, independently-loaded index, we can reject the vast majority of documents for the cost of loading a few bloom filters and frozen sets.

Two-index architecture

NavigateToSearchIndex extends AbstractSyntaxIndex and is persisted/loaded independently from TopLevelSyntaxTreeIndex. The search pipeline becomes:

  1. Load NavigateToSearchIndex (tiny: bloom filters, frozen sets, a 64-bit bitset).
  2. If the filter rejects, skip — never load TopLevelSyntaxTreeIndex.
  3. On pass, load TopLevelSyntaxTreeIndex (large: all DeclaredSymbolInfos) and do real matching.

Both indices are built from the same syntax tree walk but stored separately. Both search paths (in-process and cached/pre-solution-load) follow this pattern.

Filter data

Five pieces of data per document:

Filter Type Purpose
Hump set FrozenSet<string> Hump initials + ordered bigrams. Rejects mixed-case CamelCase queries like GB, WrLi.
Hump prefix filter BloomFilter Lowercased hump prefixes. Handles all-lowercase queries like wrli via a DP pass.
Trigram filter BloomFilter 3-char sliding windows. Rejects substring queries like line.
Container char set FrozenSet<char> Container hump initials. Rejects container-qualified queries.
Length bitset ulong Which symbol name lengths exist. Gates fuzzy matching (±2).

The hump set and container set use exact FrozenSets — the domain is small (for English: ~700 hump strings, a handful of container chars; Unicode adds more but it stays bounded by actual document content). Bloom filters are used for hump prefixes and trigrams where the domain is unbounded.

Key design choices

Bigrams over single chars: Storing ordered pairs of hump initials (GB, GQ, BQ for GooBarQuux) is far more selective than single characters. A lone G would match too many documents.

DP for all-lowercase: Instead of enumerating 2^n capitalizations, we store lowercased hump prefixes and run an O(n × max_hump_length) DP to check if the pattern can be split into valid hump prefixes. Inspired by the Metals bloom filter approach.

Split fuzzy signal: ProbablyContainsMatch returns both "should we search?" and "should we allow fuzzy?" separately. When no symbol has a name length within ±2, fuzzy matching is disabled for that document.

Testing

300 unit tests in NavigateToSearchIndexTests cover each filter individually (including guaranteed-rejection cases for exact sets, cross-symbol behavior, the DP break optimization) and the top-level ProbablyContainsMatch entry point (cross-validated against PatternMatcher). 5 integration tests in NavigateToTests.cs exercise the full pipeline.

Benchmark

BenchmarkDotNet micro-benchmarks for many scenarios.

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
'FullScan (10k symbols, no pre-filter)' 4,679,488.12 ns 253,728.513 ns 39,264.768 ns 1.000 - - - 320 B
'Realistic: CamelCase miss' 19.44 ns 0.736 ns 0.191 ns 0.000 - - - -
'Realistic: CamelCase hit' 22.17 ns 1.971 ns 0.305 ns 0.000 - - - -
'Realistic: lowercase hit (hump-prefix DP)' 299.48 ns 12.928 ns 2.001 ns 0.000 - - - -
'Realistic: lowercase miss (hump-prefix DP)' 147.36 ns 3.946 ns 1.025 ns 0.000 - - - -
'Realistic: lowercase trigram hit' 185.78 ns 4.812 ns 0.745 ns 0.000 - - - -
'Realistic: CamelCase trigram hit' 20.72 ns 2.973 ns 0.772 ns 0.000 - - - -
'Realistic: lowercase trigram miss' 154.92 ns 9.421 ns 2.447 ns 0.000 - - - -
'Realistic: CamelCase trigram miss' 17.21 ns 0.157 ns 0.024 ns 0.000 - - - -
'Realistic: container hit' 40.27 ns 2.326 ns 0.604 ns 0.000 - - - -
'Realistic: container miss' 33.53 ns 1.688 ns 0.438 ns 0.000 - - - -
'Realistic: lowercase verbatim hit' 299.95 ns 9.907 ns 2.573 ns 0.000 - - - -
'Realistic: CamelCase verbatim hit' 23.09 ns 1.002 ns 0.155 ns 0.000 - - - -
'Realistic: lowercase multi-word hit' 238.93 ns 8.594 ns 1.330 ns 0.000 - - - -
'Realistic: CamelCase multi-word hit' 34.16 ns 2.226 ns 0.578 ns 0.000 - - - -
'StressAll: hit' 21.53 ns 1.637 ns 0.425 ns 0.000 - - - -
'StressAll: miss' 21.24 ns 0.560 ns 0.146 ns 0.000 - - - -
'StressHumpSet: CamelCase hit' 21.10 ns 1.054 ns 0.274 ns 0.000 - - - -
'StressHumpSet: CamelCase miss' 21.10 ns 1.152 ns 0.299 ns 0.000 - - - -
'StressHumpPrefix: lowercase hit' 186.21 ns 9.798 ns 1.516 ns 0.000 - - - -
'StressHumpPrefix: lowercase miss' 219.83 ns 3.192 ns 0.829 ns 0.000 - - - -
'StressTrigram: lowercase hit' 177.90 ns 2.970 ns 0.771 ns 0.000 - - - -
'StressTrigram: lowercase miss' 156.88 ns 3.827 ns 0.994 ns 0.000 - - - -
'StressContainer: container hit' 29.41 ns 1.238 ns 0.322 ns 0.000 - - - -
'StressContainer: container miss' 32.27 ns 1.977 ns 0.513 ns 0.000 - - - -
dotnet run --project src/Tools/IdeCoreBenchmarks/IdeCoreBenchmarks.csproj -f net10.0 -c Release -- --filter '*PreFilter*'

@dotnet-policy-service dotnet-policy-service Bot added Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode labels Feb 17, 2026
@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review February 17, 2026 16:03
@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner February 17, 2026 16:03
public void TestCachingOfPriorResult()
{
using var matcher = PatternMatcher.CreatePatternMatcher("Goo", includeMatchedSpans: true, allowFuzzyMatching: true);
using var matcher = PatternMatcher.CreatePatternMatcher("Goo", culture: null, includeMatchedSpans: true, PatternMatchKinds.All);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched away from optional paraemters. and made teh bool into an enum as we want to control both if we do fuzzy matching or non-fuzzy matching.

@CyrusNajmabadi CyrusNajmabadi marked this pull request as draft February 17, 2026 16:50
@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

Not ready for review quite yet.

public void TestCachingOfPriorResult()
{
using var matcher = PatternMatcher.CreatePatternMatcher("Goo", includeMatchedSpans: true, allowFuzzyMatching: true);
using var matcher = PatternMatcher.CreatePatternMatcher("Goo", culture: null, includeMatchedSpans: true);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a core aspect of this change was removing the flag from the creation of the matcher(where it served as a filter), and just making it an option that the callee pases in when asking for matches.

return;

if (!filterIndex.ProbablyContainsNavigateToMatch(patternName, patternContainer, out var nameMatchKinds))
return;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thrust of the change. we do a VERY quick initial filter to not even compute/load the exensive syntactic-tree-index for a doc.

// First, load the lightweight filter index to check if this document could possibly match.
// This avoids loading the much larger TopLevelSyntaxTreeIndex for non-matching documents.
var filterIndex = await NavigateToSearchIndex.GetRequiredIndexAsync(document, cancellationToken).ConfigureAwait(false);
if (!filterIndex.ProbablyContainsNavigateToMatch(patternName, patternContainer, out var allowFuzzyMatching))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: when checking the core index, we are told if we need to bother doing expensive 'fuzzy searching' when searching the full index. this helps us avoid all the work on edit distances when not necessary.

{
var patternMatcher = GetPatternMatcher(culture, includeMatchSpans);
var match = patternMatcher.GetFirstMatch(text);
var match = patternMatcher.GetFirstMatch(text, allowFuzzyMatching: false);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches value previoulsy passed to constructor below.

/// <item>A length bitset indicating which symbol name lengths exist in the document.</item>
/// </list>
/// </summary>
private readonly struct NavigateToSearchInfo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New index designed to be extremely fast, even for the types of WrLi searches that are performed in navigate to.

/// how to break on dots and match subportions of that container. Note: all matching is done in a non-fuzzy way. Fuzzy
/// matching is only performed by the <see cref="SimplePatternMatcher"/>.
/// </summary>
public sealed partial class ContainerPatternMatcher : PatternMatcher

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up making tehse public. trying to have a common abstract entrypoint was leading to too much confusion, esp. since the Container matcher and the Simpler matcher have entirely different logic around fuzzy matching.

string pattern,
CultureInfo? culture,
bool includeMatchedSpans,
bool allowFuzzyMatching)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed defaults to make callsites clearer. note: previous default here was to not allow fuzzy matching (for either simple or container pattern matching). fuzzy is always opted into.

// Ok, we had a dotted pattern. Have to see if the symbol's container matches the
// pattern as well.
using var containerPatternMatcher = PatternMatcher.CreateDotSeparatedContainerMatcher(containerPart);
using var containerPatternMatcher = PatternMatcher.CreateDotSeparatedContainerMatcher(containerPart, includeMatchedSpans: false);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'false' matches defaults from before (commented below).

// pattern they passed in. Otherwise, make a pattern matcher just for the part after
// the dot.
using var nameMatcher = PatternMatcher.CreatePatternMatcher(namePart, includeMatchedSpans: false);
using var nameMatcher = PatternMatcher.CreatePatternMatcher(namePart, includeMatchedSpans: false, allowFuzzyMatching: false);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'false' matches defaults from before (commented below).

/// Murmur hash is public domain. Actual code is included below as reference.
/// </summary>
private static int ComputeHash(string key, int seed, bool isCaseSensitive)
=> ComputeHash(key.AsSpan(), seed, isCaseSensitive);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spanified a lot of this code. the new index makes use of a lot of checks preferring to use spans over allocations.

Comment thread src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs
});
}

#region Pre-filter integration tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: i only added a handful of navigate-to tests, since we don't really need end-to-end validatino. THe majority of testing is actually at the PatternMatcher/NavToIndex level directly so we can do fine grained testing of the new indices in a wealth of scenarios.


using var matches = TemporaryArray<PatternMatch>.Empty;
PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true).AddMatches(candidate, ref matches.AsRef());
PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true, allowFuzzyMatching: true).AddMatches(candidate, ref matches.AsRef());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made these options non-optional to make it much clearer what behavior is expected.


using var matches = TemporaryArray<PatternMatch>.Empty;
PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true).AddMatches(candidate, ref matches.AsRef());
PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true, allowFuzzyMatching: false).AddMatches(candidate, ref matches.AsRef());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made these options non-optional to make it much clearer what behavior is expected.

@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review February 18, 2026 13:05
@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@jasonmalinowski @dibarbet ptal.

/// Breaks an identifier string into constituent parts.
/// </summary>
public static void AddWordParts(string identifier, ref TemporaryArray<TextSpan> parts)
public static void AddWordParts(ReadOnlySpan<char> identifier, ref TemporaryArray<TextSpan> parts)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this needed to work on strings. And by moving to ROS, we make it so a lot of other code ni pattern matching doesn't need to allocate substrings (it can just oprate on subspans of a string).

Comment thread src/Workspaces/Core/Portable/PatternMatching/SimplePatternMatcher.cs Outdated

// Store individual hump-initial characters (uppercased).
foreach (var part in charParts)
AddToSet(humpStrings, [char.ToUpperInvariant(name[part.Start])]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddToSet(humpStrings, [char.ToUpperInvariant(name[part.Start])]);

Could this be moved inside the next loop? Or, even just merge these three loops into something like:

for (var i = 0; i < charParts.Count; i++)
{
    var part = charParts[i];
    var ci = char.ToUpperInvariant(name[part.Start]);

    // Store individual hump-initial characters (uppercased).
    AddToSet(humpStrings, [ci]);

    // Store all ordered pairs of hump-initial characters (uppercased).
    // For "GooBarQuux" (humps G, B, Q): stores "GB", "GQ", "BQ".
    // Storing ALL pairs (not just adjacent) enables non-contiguous CamelCase matching
    // like "GQ" matching "GooBarQuux" by skipping "Bar".
    for (var j = i + 1; j < charParts.Count; j++)
    {
        var cj = char.ToUpperInvariant(name[charParts[j].Start]);
        AddToSet(humpStrings, [ci, cj]);
    }

    // Store lowercased prefixes of each character-part (hump).
    // For "GooBar" → parts "Goo", "Bar" → stores "g", "go", "goo", "b", "ba", "bar".
    // Used by the all-lowercase DP to check whether a pattern can be split into segments
    // that each match a hump prefix.
    for (var prefixLen = 1; prefixLen <= part.Length; prefixLen++)
        AddToSet(humpPrefixStrings, loweredName.Slice(part.Start, prefixLen));
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could merge. But i find it much clearer as three separate passes. That way you don't hav to think about all of together. i would far prefer keeping as is if that's ok with you. it's not going to be expensive. This is enumerating a TempArray (which will alsmost always be on the stack). So it's extremely fast :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: in a followup i actually break thse into separate helpers, to keep core logic clean.

Comment thread src/Workspaces/CoreTest/FindSymbols/NavigateToSearchIndexTests.cs
[Fact]
public void ContainerMatcher_ExactMatchReportsExact()
{
using var matcher = PatternMatcher.CreateDotSeparatedContainerMatcher(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test also only testing the pattern matcher? Should it be testing the container char set index instead (or is it intended to be testing the matcher)? Same with the other container tests below

@CyrusNajmabadi CyrusNajmabadi Feb 19, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, removed.

@CyrusNajmabadi CyrusNajmabadi force-pushed the features/navigate-to-bloom-filter branch from 1c78dd0 to 5c6094c Compare February 19, 2026 09:25
Cyrus Najmabadi and others added 10 commits February 19, 2026 10:31
…e container matching always non-fuzzy

Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since
container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first
then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove
CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now
accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the
prefilter tells the caller whether fuzzy matching is worth attempting.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical update of all call sites following the PatternMatcher refactor:
- DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false
- DocumentOutlineViewModel: add missing using on PatternMatcher
- PatternMatcherTests: explicit allowFuzzyMatching: false

Co-authored-by: Cursor <cursoragent@cursor.com>
Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts,
AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based
processing in the new NavigateTo prefilter code without allocating substrings.

Co-authored-by: Cursor <cursoragent@cursor.com>
Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads.
Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and
ProbablyContains(char) to delegate to the span overloads. Add doc remark on
ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching
optimization from the string overload. This enables allocation-free bloom filter
operations in the NavigateTo prefilter.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own
NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass).
This allows the lightweight filter data to be loaded independently of the heavyweight
TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the
filter never need to load the full index.

Bump serialization format checksum from "51" to "52" to invalidate stale indices.

- New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs
- TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code

Co-authored-by: Cursor <cursoragent@cursor.com>
…grams, and length bitset

The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex.
Contains five bloom filters and a length bitset for fast document-level rejection:

- _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar")
- _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for
  "GooBarQuux"), enabling non-contiguous CamelCase matching
- _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar"
  for "GooBar"), used by a DP algorithm for all-lowercase patterns
- _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in
  "Readline")
- _containerFilter: hump chars from fully-qualified container names
- _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering

For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match
a stored hump prefix, avoiding the exponential capitalization enumeration.

Includes extensive documentation with examples throughout.

Co-authored-by: Cursor <cursoragent@cursor.com>
Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and
cached document search paths:

- InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync,
  call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex.
  Pass the allowFuzzyMatching signal from the filter to the PatternMatcher.
- CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the
  full index in the parallel ForEachAsync loop.
- Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs.

Co-authored-by: Cursor <cursoragent@cursor.com>
~1030 lines of declarative theory-based tests covering all match kinds and pre-filter
behaviors. Organized into regions:

- Positive tests: verify CouldContainNavigateToMatch returns true for all supported
  PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous
  variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case
  and all-lowercase patterns
- Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths
  don't match
- CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed
- Multiple symbols: document-level filter matches any symbol
- Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal
- Individual filter checks via TestAccessor (hump, DP, trigram, length)
- Container matching tests

Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows
matches through the full NavigateTo pipeline:

- CamelCase hump bigram (GB -> GooBar)
- All-lowercase DP hump prefix (goo -> GooBar)
- Trigram substring (line -> Readline)
- Fuzzy match enabled by length check (ToEror -> ToError)
- No match when all checks fail (XyzXyzXyzXyz)

Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance
across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified).

Co-authored-by: Cursor <cursoragent@cursor.com>
@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@dibarbet @ToddGrun Have rebased teh PR to make it easier to review. Feel free to merge when satisfied.

@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun @dibarbet @JoeRobich this broke in:

{ "HelixJobId": "73ea6201-4363-4d62-9772-554fb48cb404", "HelixWorkItemName": "workitem_21" }


Error message
Assert.Empty() Failure: Collection was not empty
Collection: [System.AggregateException: One or more errors occurred. ---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
at System.Collections.Generic.Dictionary`2.Enumerator.MoveNext()
at System.Linq.Enumerable.Any[TSource](IEnumerable`1 source, Func`2 predicate)
at Microsoft.VisualStudio.LanguageServices.Implementation.ProjectSystem.VisualStudioWorkspaceImpl.<RefreshProjectExistsUIContextForLanguageAsync>d__116.MoveNext() in //src/VisualStudio/Core/Def/ProjectSystem/VisualStudioWorkspaceImpl.cs:line 1596

Doesn't seem like me. Can you massage CI plz. Thanks!

@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@davkean you might be interested in this as well.

@ToddGrun ToddGrun left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun can you kick CI? it looks like it's failing in tests designed to find flakeyness elsewhere.

@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@ToddGrun plzmrge. Tnx :-)

@CyrusNajmabadi

Copy link
Copy Markdown
Contributor Author

@dibarbet @JoeRobich @ToddGrun @phil-allen-msft or @roslyn-ide plzmrgthnx :)

@JoeRobich JoeRobich merged commit 31befaa into dotnet:main Feb 20, 2026
25 checks passed
@dotnet-policy-service dotnet-policy-service Bot added this to the Next milestone Feb 20, 2026
@akhera99 akhera99 modified the milestones: Next, 18.5 Feb 24, 2026
JoeRobich pushed a commit that referenced this pull request Feb 25, 2026
…ected length thresholds (#82457)

Followup to #82431

## Summary

The NavigateTo pre-filter gates whether fuzzy (edit-distance) matching
is attempted for a document. Previously it used only a
symbol-name-length bitset with a fixed ±2 delta — too coarse in
practice. In a large file with many 6-character symbols, searching for
"FooBar" (also 6 characters) would pass the length check for every
document, forcing an expensive full-scan fuzzy match against every
symbol in those documents even when the character content is completely
different.

This PR improves the fuzzy pre-filter in three ways:

1. **Correct the length check thresholds** to match what
`WordSimilarityChecker` actually enforces: ±1 for pattern lengths 3–4,
±2 for 5+, and no fuzzy matching at all for patterns shorter than 3.
Previously, the pre-filter used a blanket ±2 for all lengths and didn't
reject short patterns, which meant it was more permissive than the
actual fuzzy matcher.

2. **Add an exact bigram bitset** (38×38 = 1444 bits, 184 bytes per
document) storing all lowercased 2-character sliding windows of symbol
names. At query time, use the q-gram count lemma (Ukkonen, 1992) to
require a minimum number of shared bigrams: `min_shared = |pattern| - 1
- 2k`, where `k` is the edit distance threshold. If fewer pattern
bigrams match, the document is skipped for fuzzy matching entirely.

3. **Share threshold logic** by extracting
`WordSimilarityChecker.GetThreshold(int)` and `MinFuzzyLength` so the
pre-filter and the actual fuzzy matcher use the same constants instead
of duplicating magic numbers.

## Motivation

Consider a codebase with thousands of documents, many containing
6-character symbol names. When the user types "FooBar" in NavigateTo:

- **Before**: The length check passes for every document that has any
symbol of length 4–8 (±2). The caller then creates a `PatternMatcher`
with fuzzy matching enabled and runs it against every symbol in those
documents. Most of these comparisons compute edit distance only to
conclude there's no match.

- **After**: The length check passes (length 6, threshold ±2, checks
4–8). But the bigram check then asks: do at least 1 of the 5 bigrams
("fo","oo","ob","ba","ar") exist in this document? For documents with
symbols like "XyzWvq", none of these bigrams are stored, so the document
is skipped without ever creating a `PatternMatcher`.

## Bigram bitset design

Characters are mapped to a 38-element alphabet:
- `a`–`z` → 0..25
- `0`–`9` → 26..35
- `_` → 36
- everything else (Unicode) → 37 ("other" bucket)

This gives exact membership for the 37 most common identifier characters
and a single overflow bucket for rare Unicode characters. The bitset is
a `ulong[23]` (184 bytes) — compact enough to store per-document with
negligible memory overhead.

## Filtering effectiveness by pattern length

| Pattern length | k (threshold) | min_shared bigrams | Filtering power
|
|:-:|:-:|:-:|:--|
| < 3 | — | — | Fuzzy disabled entirely |
| 3 | 1 | 0 | No filtering (always passes) |
| 4 | 1 | 1 | Need ≥ 1 of 3 bigrams |
| 5 | 2 | 0 | No filtering (always passes) |
| 6 | 2 | 1 | Need ≥ 1 of 5 bigrams |
| 7 | 2 | 2 | Need ≥ 2 of 6 bigrams |
| 8 | 2 | 3 | Need ≥ 3 of 7 bigrams |
| 10 | 2 | 5 | Need ≥ 5 of 9 bigrams |

Lengths 3 and 5 get no bigram filtering benefit (min_shared = 0). For
length 3, we should strongly consider whether fuzzy matching itself is
too permissive to even support — with an edit distance threshold of 1, a
3-character pattern like "abc" would fuzzy-match "xbc", "axc", "abx",
"ab", "abcd", etc. This is a potential follow-up.

## Reference

Ukkonen, E. (1992). "Approximate string-matching with q-grams and
maximal matches." *Theoretical Computer Science*, 92(1), 191–211. [DOI:
10.1016/0304-3975(92)90143-4](https://doi.org/10.1016/0304-3975(92)90143-4)

The q-gram count lemma states: each edit operation can destroy at most
*q* q-grams from a string. Therefore, if `edit_distance(s, t) ≤ k`, then
at least `|s| - 1 - q·k` of s's q-gram positions must have a matching
q-gram in t. For bigrams (q=2): `min_shared = |pattern| - 1 - 2k`.

## Benchmarks:

| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-----------------------------------------------
|-----------:|----------:|----------:|------:|------:|------:|----------:|
| 'SameLen: LengthCheck pass (false positive)' | 0.9901 ns | 0.1123 ns |
0.0292 ns | - | - | - | - |
| 'VariedLen: LengthCheck pass' | 0.1914 ns | 0.0710 ns | 0.0184 ns | -
| - | - | - |
| 'SameLen: BigramCheck reject (true negative!)' | 7.1073 ns | 0.5302 ns
| 0.1377 ns | - | - | - | - |
| 'SameLen: BigramCheck pass (true positive)' | 7.1716 ns | 0.4179 ns |
0.0647 ns | - | - | - | - |
| 'SameLen: Combined reject (bigram saves)' | 28.2908 ns | 2.1554 ns |
0.3336 ns | - | - | - | - |
| 'SameLen: Combined pass' | 28.7838 ns | 1.3179 ns | 0.2039 ns | - | -
| - | - |
| 'SameLen: BigramCheck reject len=8' | 10.4243 ns | 1.4520 ns | 0.2247
ns | - | - | - | - |
| 'VariedLen: BigramCheck reject len=10' | 14.9580 ns | 0.7782 ns |
0.1204 ns | - | - | - | - |

---------

Co-authored-by: Cyrus Najmabadi <Cyrus Najmabadi>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants