Improve navigate-to filtering to avoid scanning documents unnecessarily. by CyrusNajmabadi · Pull Request #82431 · dotnet/roslyn

CyrusNajmabadi · 2026-02-17T14:02:15Z

NavigateTo: Add document pre-filtering to skip non-matching files early

Summary

Adds a new lightweight NavigateToSearchIndex that lets NavigateTo skip documents that can't match the user's query — without ever loading the full declared-symbol data. For documents that pass, the filter also controls whether fuzzy matching is enabled, avoiding expensive edit-distance work when no symbol is close in length.

Speedup is anywhere from 15,000x better in the worst case, to 250,000x better in the best case. This is for when we can filter the doc out. When we can't filter the doc out, we spend at worst 300ns on an initial check, which is then dwarfed by millions of ns of actual searching cost. So there's no appreciable cost for documents where there is a hit.

The worst case (only 15,000x speedup) is for users with an all-lowercase NavTo pattern (sorry @dibarbet). Users with any capital letters in their pattern will get the 250,000x improvement.

Why

Today NavigateTo loads the full TopLevelSyntaxTreeIndex for every document and runs PatternMatcher against every symbol. Most documents don't match, so this is wasted work. By putting a small pre-filter in a separate, independently-loaded index, we can reject the vast majority of documents for the cost of loading a few bloom filters and frozen sets.

Two-index architecture

NavigateToSearchIndex extends AbstractSyntaxIndex and is persisted/loaded independently from TopLevelSyntaxTreeIndex. The search pipeline becomes:

Load NavigateToSearchIndex (tiny: bloom filters, frozen sets, a 64-bit bitset).
If the filter rejects, skip — never load TopLevelSyntaxTreeIndex.
On pass, load TopLevelSyntaxTreeIndex (large: all DeclaredSymbolInfos) and do real matching.

Both indices are built from the same syntax tree walk but stored separately. Both search paths (in-process and cached/pre-solution-load) follow this pattern.

Filter data

Five pieces of data per document:

Filter	Type	Purpose
Hump set	`FrozenSet<string>`	Hump initials + ordered bigrams. Rejects mixed-case CamelCase queries like `GB`, `WrLi`.
Hump prefix filter	`BloomFilter`	Lowercased hump prefixes. Handles all-lowercase queries like `wrli` via a DP pass.
Trigram filter	`BloomFilter`	3-char sliding windows. Rejects substring queries like `line`.
Container char set	`FrozenSet<char>`	Container hump initials. Rejects container-qualified queries.
Length bitset	`ulong`	Which symbol name lengths exist. Gates fuzzy matching (±2).

The hump set and container set use exact FrozenSets — the domain is small (for English: ~700 hump strings, a handful of container chars; Unicode adds more but it stays bounded by actual document content). Bloom filters are used for hump prefixes and trigrams where the domain is unbounded.

Key design choices

Bigrams over single chars: Storing ordered pairs of hump initials (GB, GQ, BQ for GooBarQuux) is far more selective than single characters. A lone G would match too many documents.

DP for all-lowercase: Instead of enumerating 2^n capitalizations, we store lowercased hump prefixes and run an O(n × max_hump_length) DP to check if the pattern can be split into valid hump prefixes. Inspired by the Metals bloom filter approach.

Split fuzzy signal: ProbablyContainsMatch returns both "should we search?" and "should we allow fuzzy?" separately. When no symbol has a name length within ±2, fuzzy matching is disabled for that document.

Testing

300 unit tests in NavigateToSearchIndexTests cover each filter individually (including guaranteed-rejection cases for exact sets, cross-symbol behavior, the DP break optimization) and the top-level ProbablyContainsMatch entry point (cross-validated against PatternMatcher). 5 integration tests in NavigateToTests.cs exercise the full pipeline.

Benchmark

BenchmarkDotNet micro-benchmarks for many scenarios.

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
'FullScan (10k symbols, no pre-filter)'	4,679,488.12 ns	253,728.513 ns	39,264.768 ns	1.000	-	-	-	320 B
'Realistic: CamelCase miss'	19.44 ns	0.736 ns	0.191 ns	0.000	-	-	-	-
'Realistic: CamelCase hit'	22.17 ns	1.971 ns	0.305 ns	0.000	-	-	-	-
'Realistic: lowercase hit (hump-prefix DP)'	299.48 ns	12.928 ns	2.001 ns	0.000	-	-	-	-
'Realistic: lowercase miss (hump-prefix DP)'	147.36 ns	3.946 ns	1.025 ns	0.000	-	-	-	-
'Realistic: lowercase trigram hit'	185.78 ns	4.812 ns	0.745 ns	0.000	-	-	-	-
'Realistic: CamelCase trigram hit'	20.72 ns	2.973 ns	0.772 ns	0.000	-	-	-	-
'Realistic: lowercase trigram miss'	154.92 ns	9.421 ns	2.447 ns	0.000	-	-	-	-
'Realistic: CamelCase trigram miss'	17.21 ns	0.157 ns	0.024 ns	0.000	-	-	-	-
'Realistic: container hit'	40.27 ns	2.326 ns	0.604 ns	0.000	-	-	-	-
'Realistic: container miss'	33.53 ns	1.688 ns	0.438 ns	0.000	-	-	-	-
'Realistic: lowercase verbatim hit'	299.95 ns	9.907 ns	2.573 ns	0.000	-	-	-	-
'Realistic: CamelCase verbatim hit'	23.09 ns	1.002 ns	0.155 ns	0.000	-	-	-	-
'Realistic: lowercase multi-word hit'	238.93 ns	8.594 ns	1.330 ns	0.000	-	-	-	-
'Realistic: CamelCase multi-word hit'	34.16 ns	2.226 ns	0.578 ns	0.000	-	-	-	-
'StressAll: hit'	21.53 ns	1.637 ns	0.425 ns	0.000	-	-	-	-
'StressAll: miss'	21.24 ns	0.560 ns	0.146 ns	0.000	-	-	-	-
'StressHumpSet: CamelCase hit'	21.10 ns	1.054 ns	0.274 ns	0.000	-	-	-	-
'StressHumpSet: CamelCase miss'	21.10 ns	1.152 ns	0.299 ns	0.000	-	-	-	-
'StressHumpPrefix: lowercase hit'	186.21 ns	9.798 ns	1.516 ns	0.000	-	-	-	-
'StressHumpPrefix: lowercase miss'	219.83 ns	3.192 ns	0.829 ns	0.000	-	-	-	-
'StressTrigram: lowercase hit'	177.90 ns	2.970 ns	0.771 ns	0.000	-	-	-	-
'StressTrigram: lowercase miss'	156.88 ns	3.827 ns	0.994 ns	0.000	-	-	-	-
'StressContainer: container hit'	29.41 ns	1.238 ns	0.322 ns	0.000	-	-	-	-
'StressContainer: container miss'	32.27 ns	1.977 ns	0.513 ns	0.000	-	-	-	-

dotnet run --project src/Tools/IdeCoreBenchmarks/IdeCoreBenchmarks.csproj -f net10.0 -c Release -- --filter '*PreFilter*'

CyrusNajmabadi · 2026-02-17T16:24:37Z

    public void TestCachingOfPriorResult()
    {
-        using var matcher = PatternMatcher.CreatePatternMatcher("Goo", includeMatchedSpans: true, allowFuzzyMatching: true);
+        using var matcher = PatternMatcher.CreatePatternMatcher("Goo", culture: null, includeMatchedSpans: true, PatternMatchKinds.All);


switched away from optional paraemters. and made teh bool into an enum as we want to control both if we do fuzzy matching or non-fuzzy matching.

CyrusNajmabadi · 2026-02-17T16:51:01Z

Not ready for review quite yet.

CyrusNajmabadi · 2026-02-18T09:57:26Z

    public void TestCachingOfPriorResult()
    {
-        using var matcher = PatternMatcher.CreatePatternMatcher("Goo", includeMatchedSpans: true, allowFuzzyMatching: true);
+        using var matcher = PatternMatcher.CreatePatternMatcher("Goo", culture: null, includeMatchedSpans: true);


a core aspect of this change was removing the flag from the creation of the matcher(where it served as a filter), and just making it an option that the callee pases in when asking for matches.

CyrusNajmabadi · 2026-02-18T10:01:57Z

+                        return;
+
+                    if (!filterIndex.ProbablyContainsNavigateToMatch(patternName, patternContainer, out var nameMatchKinds))
+                        return;


thrust of the change. we do a VERY quick initial filter to not even compute/load the exensive syntactic-tree-index for a doc.

CyrusNajmabadi · 2026-02-18T10:03:05Z

+        // First, load the lightweight filter index to check if this document could possibly match.
+        // This avoids loading the much larger TopLevelSyntaxTreeIndex for non-matching documents.
+        var filterIndex = await NavigateToSearchIndex.GetRequiredIndexAsync(document, cancellationToken).ConfigureAwait(false);
+        if (!filterIndex.ProbablyContainsNavigateToMatch(patternName, patternContainer, out var allowFuzzyMatching))


note: when checking the core index, we are told if we need to bother doing expensive 'fuzzy searching' when searching the full index. this helps us avoid all the work on edit distances when not necessary.

CyrusNajmabadi · 2026-02-18T10:13:05Z

    {
        var patternMatcher = GetPatternMatcher(culture, includeMatchSpans);
-        var match = patternMatcher.GetFirstMatch(text);
+        var match = patternMatcher.GetFirstMatch(text, allowFuzzyMatching: false);


matches value previoulsy passed to constructor below.

CyrusNajmabadi · 2026-02-18T10:18:42Z

+    /// <item>A length bitset indicating which symbol name lengths exist in the document.</item>
+    /// </list>
+    /// </summary>
+    private readonly struct NavigateToSearchInfo


New index designed to be extremely fast, even for the types of WrLi searches that are performed in navigate to.

CyrusNajmabadi · 2026-02-18T10:21:01Z

+    /// how to break on dots and match subportions of that container.  Note: all matching is done in a non-fuzzy way.  Fuzzy
+    /// matching is only performed by the <see cref="SimplePatternMatcher"/>.
+    /// </summary>
+    public sealed partial class ContainerPatternMatcher : PatternMatcher


ended up making tehse public. trying to have a common abstract entrypoint was leading to too much confusion, esp. since the Container matcher and the Simpler matcher have entirely different logic around fuzzy matching.

CyrusNajmabadi · 2026-02-18T10:44:50Z

+        string pattern,
+        CultureInfo? culture,
+        bool includeMatchedSpans,
+        bool allowFuzzyMatching)


removed defaults to make callsites clearer. note: previous default here was to not allow fuzzy matching (for either simple or container pattern matching). fuzzy is always opted into.

CyrusNajmabadi · 2026-02-18T10:45:07Z

        // Ok, we had a dotted pattern.  Have to see if the symbol's container matches the 
        // pattern as well.
-        using var containerPatternMatcher = PatternMatcher.CreateDotSeparatedContainerMatcher(containerPart);
+        using var containerPatternMatcher = PatternMatcher.CreateDotSeparatedContainerMatcher(containerPart, includeMatchedSpans: false);


'false' matches defaults from before (commented below).

CyrusNajmabadi · 2026-02-18T10:45:17Z

        // pattern they passed in.  Otherwise, make a pattern matcher just for the part after
        // the dot.
-        using var nameMatcher = PatternMatcher.CreatePatternMatcher(namePart, includeMatchedSpans: false);
+        using var nameMatcher = PatternMatcher.CreatePatternMatcher(namePart, includeMatchedSpans: false, allowFuzzyMatching: false);


'false' matches defaults from before (commented below).

CyrusNajmabadi · 2026-02-18T10:49:08Z

    /// Murmur hash is public domain.  Actual code is included below as reference.
    /// </summary>
    private static int ComputeHash(string key, int seed, bool isCaseSensitive)
+        => ComputeHash(key.AsSpan(), seed, isCaseSensitive);


spanified a lot of this code. the new index makes use of a lot of checks preferring to use spans over allocations.

CyrusNajmabadi · 2026-02-18T12:17:02Z

        });
    }
+
+    #region Pre-filter integration tests


Note: i only added a handful of navigate-to tests, since we don't really need end-to-end validatino. THe majority of testing is actually at the PatternMatcher/NavToIndex level directly so we can do fine grained testing of the new indices in a wealth of scenarios.

CyrusNajmabadi · 2026-02-18T12:17:26Z


        using var matches = TemporaryArray<PatternMatch>.Empty;
-        PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true).AddMatches(candidate, ref matches.AsRef());
+        PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true, allowFuzzyMatching: true).AddMatches(candidate, ref matches.AsRef());


i made these options non-optional to make it much clearer what behavior is expected.

CyrusNajmabadi · 2026-02-18T12:18:51Z


        using var matches = TemporaryArray<PatternMatch>.Empty;
-        PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true).AddMatches(candidate, ref matches.AsRef());
+        PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true, allowFuzzyMatching: false).AddMatches(candidate, ref matches.AsRef());


i made these options non-optional to make it much clearer what behavior is expected.

CyrusNajmabadi · 2026-02-18T15:28:23Z

@jasonmalinowski @dibarbet ptal.

CyrusNajmabadi · 2026-02-18T16:22:56Z

    /// Breaks an identifier string into constituent parts.
    /// </summary>
-    public static void AddWordParts(string identifier, ref TemporaryArray<TextSpan> parts)
+    public static void AddWordParts(ReadOnlySpan<char> identifier, ref TemporaryArray<TextSpan> parts)


None of this needed to work on strings. And by moving to ROS, we make it so a lot of other code ni pattern matching doesn't need to allocate substrings (it can just oprate on subspans of a string).

ToddGrun · 2026-02-18T19:16:41Z

+
+                // Store individual hump-initial characters (uppercased).
+                foreach (var part in charParts)
+                    AddToSet(humpStrings, [char.ToUpperInvariant(name[part.Start])]);


AddToSet(humpStrings, [char.ToUpperInvariant(name[part.Start])]);

Could this be moved inside the next loop? Or, even just merge these three loops into something like:

for (var i = 0; i < charParts.Count; i++) { var part = charParts[i]; var ci = char.ToUpperInvariant(name[part.Start]); // Store individual hump-initial characters (uppercased). AddToSet(humpStrings, [ci]); // Store all ordered pairs of hump-initial characters (uppercased). // For "GooBarQuux" (humps G, B, Q): stores "GB", "GQ", "BQ". // Storing ALL pairs (not just adjacent) enables non-contiguous CamelCase matching // like "GQ" matching "GooBarQuux" by skipping "Bar". for (var j = i + 1; j < charParts.Count; j++) { var cj = char.ToUpperInvariant(name[charParts[j].Start]); AddToSet(humpStrings, [ci, cj]); } // Store lowercased prefixes of each character-part (hump). // For "GooBar" → parts "Goo", "Bar" → stores "g", "go", "goo", "b", "ba", "bar". // Used by the all-lowercase DP to check whether a pattern can be split into segments // that each match a hump prefix. for (var prefixLen = 1; prefixLen <= part.Length; prefixLen++) AddToSet(humpPrefixStrings, loweredName.Slice(part.Start, prefixLen)); }

We could merge. But i find it much clearer as three separate passes. That way you don't hav to think about all of together. i would far prefer keeping as is if that's ok with you. it's not going to be expensive. This is enumerating a TempArray (which will alsmost always be on the stack). So it's extremely fast :)

Note: in a followup i actually break thse into separate helpers, to keep core logic clean.

dibarbet · 2026-02-18T22:56:40Z

+    [Fact]
+    public void ContainerMatcher_ExactMatchReportsExact()
+    {
+        using var matcher = PatternMatcher.CreateDotSeparatedContainerMatcher(


is this test also only testing the pattern matcher? Should it be testing the container char set index instead (or is it intended to be testing the matcher)? Same with the other container tests below

Fair point, removed.

…e container matching always non-fuzzy Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the prefilter tells the caller whether fuzzy matching is worth attempting. Co-authored-by: Cursor <cursoragent@cursor.com>

Mechanical update of all call sites following the PatternMatcher refactor: - DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false - DocumentOutlineViewModel: add missing using on PatternMatcher - PatternMatcherTests: explicit allowFuzzyMatching: false Co-authored-by: Cursor <cursoragent@cursor.com>

Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts, AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based processing in the new NavigateTo prefilter code without allocating substrings. Co-authored-by: Cursor <cursoragent@cursor.com>

Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads. Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and ProbablyContains(char) to delegate to the span overloads. Add doc remark on ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching optimization from the string overload. This enables allocation-free bloom filter operations in the NavigateTo prefilter. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass). This allows the lightweight filter data to be loaded independently of the heavyweight TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the filter never need to load the full index. Bump serialization format checksum from "51" to "52" to invalidate stale indices. - New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs - TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code Co-authored-by: Cursor <cursoragent@cursor.com>

…grams, and length bitset The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex. Contains five bloom filters and a length bitset for fast document-level rejection: - _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar") - _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for "GooBarQuux"), enabling non-contiguous CamelCase matching - _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar" for "GooBar"), used by a DP algorithm for all-lowercase patterns - _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in "Readline") - _containerFilter: hump chars from fully-qualified container names - _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match a stored hump prefix, avoiding the exponential capitalization enumeration. Includes extensive documentation with examples throughout. Co-authored-by: Cursor <cursoragent@cursor.com>

Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and cached document search paths: - InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync, call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex. Pass the allowFuzzyMatching signal from the filter to the PatternMatcher. - CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the full index in the parallel ForEachAsync loop. - Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs. Co-authored-by: Cursor <cursoragent@cursor.com>

~1030 lines of declarative theory-based tests covering all match kinds and pre-filter behaviors. Organized into regions: - Positive tests: verify CouldContainNavigateToMatch returns true for all supported PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case and all-lowercase patterns - Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths don't match - CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed - Multiple symbols: document-level filter matches any symbol - Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal - Individual filter checks via TestAccessor (hump, DP, trigram, length) - Container matching tests Co-authored-by: Cursor <cursoragent@cursor.com>

End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows matches through the full NavigateTo pipeline: - CamelCase hump bigram (GB -> GooBar) - All-lowercase DP hump prefix (goo -> GooBar) - Trigram substring (line -> Readline) - Fuzzy match enabled by length check (ToEror -> ToError) - No match when all checks fail (XyzXyzXyzXyz) Co-authored-by: Cursor <cursoragent@cursor.com>

BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>

CyrusNajmabadi · 2026-02-19T12:28:21Z

@dibarbet @ToddGrun Have rebased teh PR to make it easier to review. Feel free to merge when satisfied.

CyrusNajmabadi · 2026-02-19T13:19:20Z

@ToddGrun @dibarbet @JoeRobich this broke in:

{ "HelixJobId": "73ea6201-4363-4d62-9772-554fb48cb404", "HelixWorkItemName": "workitem_21" }


Error message
Assert.Empty() Failure: Collection was not empty
Collection: [System.AggregateException: One or more errors occurred. ---> System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
at System.Collections.Generic.Dictionary`2.Enumerator.MoveNext()
at System.Linq.Enumerable.Any[TSource](IEnumerable`1 source, Func`2 predicate)
at Microsoft.VisualStudio.LanguageServices.Implementation.ProjectSystem.VisualStudioWorkspaceImpl.<RefreshProjectExistsUIContextForLanguageAsync>d__116.MoveNext() in //src/VisualStudio/Core/Def/ProjectSystem/VisualStudioWorkspaceImpl.cs:line 1596

Doesn't seem like me. Can you massage CI plz. Thanks!

CyrusNajmabadi · 2026-02-19T14:47:56Z

@davkean you might be interested in this as well.

ToddGrun

CyrusNajmabadi · 2026-02-19T16:09:06Z

@ToddGrun can you kick CI? it looks like it's failing in tests designed to find flakeyness elsewhere.

CyrusNajmabadi · 2026-02-19T22:59:25Z

@ToddGrun plzmrge. Tnx :-)

CyrusNajmabadi · 2026-02-20T08:30:44Z

@dibarbet @JoeRobich @ToddGrun @phil-allen-msft or @roslyn-ide plzmrgthnx :)

…ected length thresholds (#82457) Followup to #82431 ## Summary The NavigateTo pre-filter gates whether fuzzy (edit-distance) matching is attempted for a document. Previously it used only a symbol-name-length bitset with a fixed ±2 delta — too coarse in practice. In a large file with many 6-character symbols, searching for "FooBar" (also 6 characters) would pass the length check for every document, forcing an expensive full-scan fuzzy match against every symbol in those documents even when the character content is completely different. This PR improves the fuzzy pre-filter in three ways: 1. **Correct the length check thresholds** to match what `WordSimilarityChecker` actually enforces: ±1 for pattern lengths 3–4, ±2 for 5+, and no fuzzy matching at all for patterns shorter than 3. Previously, the pre-filter used a blanket ±2 for all lengths and didn't reject short patterns, which meant it was more permissive than the actual fuzzy matcher. 2. **Add an exact bigram bitset** (38×38 = 1444 bits, 184 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use the q-gram count lemma (Ukkonen, 1992) to require a minimum number of shared bigrams: `min_shared = |pattern| - 1 - 2k`, where `k` is the edit distance threshold. If fewer pattern bigrams match, the document is skipped for fuzzy matching entirely. 3. **Share threshold logic** by extracting `WordSimilarityChecker.GetThreshold(int)` and `MinFuzzyLength` so the pre-filter and the actual fuzzy matcher use the same constants instead of duplicating magic numbers. ## Motivation Consider a codebase with thousands of documents, many containing 6-character symbol names. When the user types "FooBar" in NavigateTo: - **Before**: The length check passes for every document that has any symbol of length 4–8 (±2). The caller then creates a `PatternMatcher` with fuzzy matching enabled and runs it against every symbol in those documents. Most of these comparisons compute edit distance only to conclude there's no match. - **After**: The length check passes (length 6, threshold ±2, checks 4–8). But the bigram check then asks: do at least 1 of the 5 bigrams ("fo","oo","ob","ba","ar") exist in this document? For documents with symbols like "XyzWvq", none of these bigrams are stored, so the document is skipped without ever creating a `PatternMatcher`. ## Bigram bitset design Characters are mapped to a 38-element alphabet: - `a`–`z` → 0..25 - `0`–`9` → 26..35 - `_` → 36 - everything else (Unicode) → 37 ("other" bucket) This gives exact membership for the 37 most common identifier characters and a single overflow bucket for rare Unicode characters. The bitset is a `ulong[23]` (184 bytes) — compact enough to store per-document with negligible memory overhead. ## Filtering effectiveness by pattern length | Pattern length | k (threshold) | min_shared bigrams | Filtering power | |:-:|:-:|:-:|:--| | < 3 | — | — | Fuzzy disabled entirely | | 3 | 1 | 0 | No filtering (always passes) | | 4 | 1 | 1 | Need ≥ 1 of 3 bigrams | | 5 | 2 | 0 | No filtering (always passes) | | 6 | 2 | 1 | Need ≥ 1 of 5 bigrams | | 7 | 2 | 2 | Need ≥ 2 of 6 bigrams | | 8 | 2 | 3 | Need ≥ 3 of 7 bigrams | | 10 | 2 | 5 | Need ≥ 5 of 9 bigrams | Lengths 3 and 5 get no bigram filtering benefit (min_shared = 0). For length 3, we should strongly consider whether fuzzy matching itself is too permissive to even support — with an edit distance threshold of 1, a 3-character pattern like "abc" would fuzzy-match "xbc", "axc", "abx", "ab", "abcd", etc. This is a potential follow-up. ## Reference Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." *Theoretical Computer Science*, 92(1), 191–211. [DOI: 10.1016/0304-3975(92)90143-4](https://doi.org/10.1016/0304-3975(92)90143-4) The q-gram count lemma states: each edit operation can destroy at most *q* q-grams from a string. Therefore, if `edit_distance(s, t) ≤ k`, then at least `|s| - 1 - q·k` of s's q-gram positions must have a matching q-gram in t. For bigrams (q=2): `min_shared = |pattern| - 1 - 2k`. ## Benchmarks: | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |----------------------------------------------- |-----------:|----------:|----------:|------:|------:|------:|----------:| | 'SameLen: LengthCheck pass (false positive)' | 0.9901 ns | 0.1123 ns | 0.0292 ns | - | - | - | - | | 'VariedLen: LengthCheck pass' | 0.1914 ns | 0.0710 ns | 0.0184 ns | - | - | - | - | | 'SameLen: BigramCheck reject (true negative!)' | 7.1073 ns | 0.5302 ns | 0.1377 ns | - | - | - | - | | 'SameLen: BigramCheck pass (true positive)' | 7.1716 ns | 0.4179 ns | 0.0647 ns | - | - | - | - | | 'SameLen: Combined reject (bigram saves)' | 28.2908 ns | 2.1554 ns | 0.3336 ns | - | - | - | - | | 'SameLen: Combined pass' | 28.7838 ns | 1.3179 ns | 0.2039 ns | - | - | - | - | | 'SameLen: BigramCheck reject len=8' | 10.4243 ns | 1.4520 ns | 0.2247 ns | - | - | - | - | | 'VariedLen: BigramCheck reject len=10' | 14.9580 ns | 0.7782 ns | 0.1204 ns | - | - | - | - | --------- Co-authored-by: Cyrus Najmabadi <Cyrus Najmabadi> Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions Bot added the Area-IDE label Feb 17, 2026

dotnet-policy-service Bot added Community The pull request was submitted by a contributor who is not a Microsoft employee. VSCode labels Feb 17, 2026

CyrusNajmabadi marked this pull request as ready for review February 17, 2026 16:03

CyrusNajmabadi requested a review from a team as a code owner February 17, 2026 16:03

CyrusNajmabadi commented Feb 17, 2026

View reviewed changes

CyrusNajmabadi marked this pull request as draft February 17, 2026 16:50

CyrusNajmabadi commented Feb 18, 2026

View reviewed changes

Comment thread src/Workspaces/Core/Portable/FindSymbols/Shared/AbstractSyntaxIndex_Persistence.cs Outdated

CyrusNajmabadi commented Feb 18, 2026

View reviewed changes

Comment thread src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

CyrusNajmabadi commented Feb 18, 2026

View reviewed changes

CyrusNajmabadi marked this pull request as ready for review February 18, 2026 13:05

CyrusNajmabadi commented Feb 18, 2026

View reviewed changes

This was referenced Feb 18, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Closed

ToddGrun reviewed Feb 18, 2026

View reviewed changes

Comment thread src/Workspaces/Core/Portable/PatternMatching/SimplePatternMatcher.cs Outdated

ToddGrun reviewed Feb 18, 2026

View reviewed changes

dibarbet approved these changes Feb 18, 2026

View reviewed changes

CyrusNajmabadi force-pushed the features/navigate-to-bloom-filter branch from 1c78dd0 to 5c6094c Compare February 19, 2026 09:25

Cyrus Najmabadi and others added 10 commits February 19, 2026 10:31

Add NavigateTo pre-filter benchmarks

3c67904

BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>

CyrusNajmabadi force-pushed the features/navigate-to-bloom-filter branch from 5c6094c to 3c67904 Compare February 19, 2026 09:34

CyrusNajmabadi mentioned this pull request Feb 19, 2026

Improve nav-to fuzzy pre-filtering with q-gram bigram bitset and corrected length thresholds #82457

Merged

ToddGrun approved these changes Feb 19, 2026

View reviewed changes

JoeRobich merged commit 31befaa into dotnet:main Feb 20, 2026
25 checks passed

dotnet-policy-service Bot added this to the Next milestone Feb 20, 2026

This was referenced Feb 20, 2026

[release/10.0.3xx] Source code updates from dotnet/roslyn dotnet/dotnet#5010

Merged

[main] Source code updates from dotnet/roslyn dotnet/dotnet#5006

Merged

dotnet-bot mentioned this pull request Feb 24, 2026

[Automated] PRs inserted in VS build main-11523.58 #82501

Closed

akhera99 modified the milestones: Next, 18.5 Feb 24, 2026

dotnet-maestro Bot mentioned this pull request Feb 25, 2026

[release/11.0.1xx-preview2] Source code updates from dotnet/roslyn dotnet/dotnet#5090

Merged

Conversation

CyrusNajmabadi commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NavigateTo: Add document pre-filtering to skip non-matching files early

Summary

Why

Two-index architecture

Filter data

Key design choices

Testing

Benchmark

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Feb 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Feb 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Feb 19, 2026

Uh oh!

CyrusNajmabadi commented Feb 19, 2026

Uh oh!

CyrusNajmabadi commented Feb 19, 2026

Uh oh!

ToddGrun left a comment

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Feb 19, 2026

Uh oh!

CyrusNajmabadi commented Feb 19, 2026

Uh oh!

CyrusNajmabadi commented Feb 20, 2026

Uh oh!

Uh oh!

CyrusNajmabadi commented Feb 17, 2026 •

edited

Loading

CyrusNajmabadi Feb 19, 2026 •

edited

Loading