Improve navigate-to filtering to avoid scanning documents unnecessarily.#82431
Conversation
| public void TestCachingOfPriorResult() | ||
| { | ||
| using var matcher = PatternMatcher.CreatePatternMatcher("Goo", includeMatchedSpans: true, allowFuzzyMatching: true); | ||
| using var matcher = PatternMatcher.CreatePatternMatcher("Goo", culture: null, includeMatchedSpans: true, PatternMatchKinds.All); |
There was a problem hiding this comment.
switched away from optional paraemters. and made teh bool into an enum as we want to control both if we do fuzzy matching or non-fuzzy matching.
|
Not ready for review quite yet. |
| public void TestCachingOfPriorResult() | ||
| { | ||
| using var matcher = PatternMatcher.CreatePatternMatcher("Goo", includeMatchedSpans: true, allowFuzzyMatching: true); | ||
| using var matcher = PatternMatcher.CreatePatternMatcher("Goo", culture: null, includeMatchedSpans: true); |
There was a problem hiding this comment.
a core aspect of this change was removing the flag from the creation of the matcher(where it served as a filter), and just making it an option that the callee pases in when asking for matches.
| return; | ||
|
|
||
| if (!filterIndex.ProbablyContainsNavigateToMatch(patternName, patternContainer, out var nameMatchKinds)) | ||
| return; |
There was a problem hiding this comment.
thrust of the change. we do a VERY quick initial filter to not even compute/load the exensive syntactic-tree-index for a doc.
| // First, load the lightweight filter index to check if this document could possibly match. | ||
| // This avoids loading the much larger TopLevelSyntaxTreeIndex for non-matching documents. | ||
| var filterIndex = await NavigateToSearchIndex.GetRequiredIndexAsync(document, cancellationToken).ConfigureAwait(false); | ||
| if (!filterIndex.ProbablyContainsNavigateToMatch(patternName, patternContainer, out var allowFuzzyMatching)) |
There was a problem hiding this comment.
note: when checking the core index, we are told if we need to bother doing expensive 'fuzzy searching' when searching the full index. this helps us avoid all the work on edit distances when not necessary.
| { | ||
| var patternMatcher = GetPatternMatcher(culture, includeMatchSpans); | ||
| var match = patternMatcher.GetFirstMatch(text); | ||
| var match = patternMatcher.GetFirstMatch(text, allowFuzzyMatching: false); |
There was a problem hiding this comment.
matches value previoulsy passed to constructor below.
| /// <item>A length bitset indicating which symbol name lengths exist in the document.</item> | ||
| /// </list> | ||
| /// </summary> | ||
| private readonly struct NavigateToSearchInfo |
There was a problem hiding this comment.
New index designed to be extremely fast, even for the types of WrLi searches that are performed in navigate to.
| /// how to break on dots and match subportions of that container. Note: all matching is done in a non-fuzzy way. Fuzzy | ||
| /// matching is only performed by the <see cref="SimplePatternMatcher"/>. | ||
| /// </summary> | ||
| public sealed partial class ContainerPatternMatcher : PatternMatcher |
There was a problem hiding this comment.
ended up making tehse public. trying to have a common abstract entrypoint was leading to too much confusion, esp. since the Container matcher and the Simpler matcher have entirely different logic around fuzzy matching.
| string pattern, | ||
| CultureInfo? culture, | ||
| bool includeMatchedSpans, | ||
| bool allowFuzzyMatching) |
There was a problem hiding this comment.
removed defaults to make callsites clearer. note: previous default here was to not allow fuzzy matching (for either simple or container pattern matching). fuzzy is always opted into.
| // Ok, we had a dotted pattern. Have to see if the symbol's container matches the | ||
| // pattern as well. | ||
| using var containerPatternMatcher = PatternMatcher.CreateDotSeparatedContainerMatcher(containerPart); | ||
| using var containerPatternMatcher = PatternMatcher.CreateDotSeparatedContainerMatcher(containerPart, includeMatchedSpans: false); |
There was a problem hiding this comment.
'false' matches defaults from before (commented below).
| // pattern they passed in. Otherwise, make a pattern matcher just for the part after | ||
| // the dot. | ||
| using var nameMatcher = PatternMatcher.CreatePatternMatcher(namePart, includeMatchedSpans: false); | ||
| using var nameMatcher = PatternMatcher.CreatePatternMatcher(namePart, includeMatchedSpans: false, allowFuzzyMatching: false); |
There was a problem hiding this comment.
'false' matches defaults from before (commented below).
| /// Murmur hash is public domain. Actual code is included below as reference. | ||
| /// </summary> | ||
| private static int ComputeHash(string key, int seed, bool isCaseSensitive) | ||
| => ComputeHash(key.AsSpan(), seed, isCaseSensitive); |
There was a problem hiding this comment.
spanified a lot of this code. the new index makes use of a lot of checks preferring to use spans over allocations.
| }); | ||
| } | ||
|
|
||
| #region Pre-filter integration tests |
There was a problem hiding this comment.
Note: i only added a handful of navigate-to tests, since we don't really need end-to-end validatino. THe majority of testing is actually at the PatternMatcher/NavToIndex level directly so we can do fine grained testing of the new indices in a wealth of scenarios.
|
|
||
| using var matches = TemporaryArray<PatternMatch>.Empty; | ||
| PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true).AddMatches(candidate, ref matches.AsRef()); | ||
| PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true, allowFuzzyMatching: true).AddMatches(candidate, ref matches.AsRef()); |
There was a problem hiding this comment.
i made these options non-optional to make it much clearer what behavior is expected.
|
|
||
| using var matches = TemporaryArray<PatternMatch>.Empty; | ||
| PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true).AddMatches(candidate, ref matches.AsRef()); | ||
| PatternMatcher.CreatePatternMatcher(pattern, includeMatchedSpans: true, allowFuzzyMatching: false).AddMatches(candidate, ref matches.AsRef()); |
There was a problem hiding this comment.
i made these options non-optional to make it much clearer what behavior is expected.
|
@jasonmalinowski @dibarbet ptal. |
| /// Breaks an identifier string into constituent parts. | ||
| /// </summary> | ||
| public static void AddWordParts(string identifier, ref TemporaryArray<TextSpan> parts) | ||
| public static void AddWordParts(ReadOnlySpan<char> identifier, ref TemporaryArray<TextSpan> parts) |
There was a problem hiding this comment.
None of this needed to work on strings. And by moving to ROS, we make it so a lot of other code ni pattern matching doesn't need to allocate substrings (it can just oprate on subspans of a string).
|
|
||
| // Store individual hump-initial characters (uppercased). | ||
| foreach (var part in charParts) | ||
| AddToSet(humpStrings, [char.ToUpperInvariant(name[part.Start])]); |
There was a problem hiding this comment.
AddToSet(humpStrings, [char.ToUpperInvariant(name[part.Start])]);
Could this be moved inside the next loop? Or, even just merge these three loops into something like:
for (var i = 0; i < charParts.Count; i++)
{
var part = charParts[i];
var ci = char.ToUpperInvariant(name[part.Start]);
// Store individual hump-initial characters (uppercased).
AddToSet(humpStrings, [ci]);
// Store all ordered pairs of hump-initial characters (uppercased).
// For "GooBarQuux" (humps G, B, Q): stores "GB", "GQ", "BQ".
// Storing ALL pairs (not just adjacent) enables non-contiguous CamelCase matching
// like "GQ" matching "GooBarQuux" by skipping "Bar".
for (var j = i + 1; j < charParts.Count; j++)
{
var cj = char.ToUpperInvariant(name[charParts[j].Start]);
AddToSet(humpStrings, [ci, cj]);
}
// Store lowercased prefixes of each character-part (hump).
// For "GooBar" → parts "Goo", "Bar" → stores "g", "go", "goo", "b", "ba", "bar".
// Used by the all-lowercase DP to check whether a pattern can be split into segments
// that each match a hump prefix.
for (var prefixLen = 1; prefixLen <= part.Length; prefixLen++)
AddToSet(humpPrefixStrings, loweredName.Slice(part.Start, prefixLen));
}
There was a problem hiding this comment.
We could merge. But i find it much clearer as three separate passes. That way you don't hav to think about all of together. i would far prefer keeping as is if that's ok with you. it's not going to be expensive. This is enumerating a TempArray (which will alsmost always be on the stack). So it's extremely fast :)
There was a problem hiding this comment.
Note: in a followup i actually break thse into separate helpers, to keep core logic clean.
| [Fact] | ||
| public void ContainerMatcher_ExactMatchReportsExact() | ||
| { | ||
| using var matcher = PatternMatcher.CreateDotSeparatedContainerMatcher( |
There was a problem hiding this comment.
is this test also only testing the pattern matcher? Should it be testing the container char set index instead (or is it intended to be testing the matcher)? Same with the other container tests below
There was a problem hiding this comment.
Fair point, removed.
1c78dd0 to
5c6094c
Compare
…e container matching always non-fuzzy Move _allowFuzzyMatching from the base PatternMatcher class to SimplePatternMatcher, since container matching is always non-fuzzy. Refactor MatchPatternChunk to try non-fuzzy first then fuzzy as a fallback, removing the two-pass pattern from AddMatches. Remove CreateContainerPatternMatcher (replaced by CreateDotSeparatedContainerMatcher which now accepts nullable pattern). This is prerequisite for the NavigateTo prefilter work where the prefilter tells the caller whether fuzzy matching is worth attempting. Co-authored-by: Cursor <cursoragent@cursor.com>
Mechanical update of all call sites following the PatternMatcher refactor: - DeclarationFinder: explicit allowFuzzyMatching: false, includeMatchedSpans: false - DocumentOutlineViewModel: add missing using on PatternMatcher - PatternMatcherTests: explicit allowFuzzyMatching: false Co-authored-by: Cursor <cursoragent@cursor.com>
Change StringBreaker's public API from string to ReadOnlySpan<char> parameters (AddWordParts, AddCharacterParts, AddParts, GenerateSpan) and all internal helpers. This enables span-based processing in the new NavigateTo prefilter code without allocating substrings. Co-authored-by: Cursor <cursoragent@cursor.com>
Add ComputeHash(ReadOnlySpan<char>) and GetCharacter(ReadOnlySpan<char>) overloads. Refactor ComputeHash(string) to delegate to the span overload. Simplify Add(char) and ProbablyContains(char) to delegate to the span overloads. Add doc remark on ProbablyContains(ReadOnlySpan<char>) explaining why it cannot share the caching optimization from the string overload. This enables allocation-free bloom filter operations in the NavigateTo prefilter. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the NavigateTo pre-filter data out of TopLevelSyntaxTreeIndex into its own NavigateToSearchIndex (a new AbstractSyntaxIndex<NavigateToSearchIndex> subclass). This allows the lightweight filter data to be loaded independently of the heavyweight TopLevelSyntaxTreeIndex containing all declared symbols. Documents rejected by the filter never need to load the full index. Bump serialization format checksum from "51" to "52" to invalidate stale indices. - New: NavigateToSearchIndex.cs, _Create.cs, _Persistence.cs - TopLevelSyntaxTreeIndex: remove _navigateToSearchInfo field and related code Co-authored-by: Cursor <cursoragent@cursor.com>
…grams, and length bitset The core pre-filtering logic for NavigateTo, stored per-document in the NavigateToSearchIndex. Contains five bloom filters and a length bitset for fast document-level rejection: - _humpCharFilter: individual uppercased hump-initial characters (e.g. 'G','B' for "GooBar") - _humpBigramFilter: all C(k,2) ordered pairs of hump initials (e.g. "GB","GQ","BQ" for "GooBarQuux"), enabling non-contiguous CamelCase matching - _humpPrefixFilter: lowercased prefixes of each hump (e.g. "g","go","goo","b","ba","bar" for "GooBar"), used by a DP algorithm for all-lowercase patterns - _trigramFilter: 3-char sliding windows for LowercaseSubstring matching (e.g. "line" in "Readline") - _containerFilter: hump chars from fully-qualified container names - _symbolNameLengthBitset: 64-bit bitset for fuzzy match length pre-filtering For all-lowercase patterns, a DP algorithm splits the pattern into segments that each match a stored hump prefix, avoiding the exponential capitalization enumeration. Includes extensive documentation with examples throughout. Co-authored-by: Cursor <cursoragent@cursor.com>
Integrate the lightweight NavigateToSearchIndex pre-filter into both the in-process and cached document search paths: - InProcess: load the filter index first via NavigateToSearchIndex.GetRequiredIndexAsync, call CouldContainNavigateToMatch to decide whether to load the full TopLevelSyntaxTreeIndex. Pass the allowFuzzyMatching signal from the filter to the PatternMatcher. - CachedDocumentSearch: add s_cachedFilterIndexMap for the filter index, load it before the full index in the parallel ForEachAsync loop. - Add LowercaseSubstring -> Fuzzy mapping to s_kindPairs. Co-authored-by: Cursor <cursoragent@cursor.com>
~1030 lines of declarative theory-based tests covering all match kinds and pre-filter behaviors. Organized into regions: - Positive tests: verify CouldContainNavigateToMatch returns true for all supported PatternMatchKinds (Exact, Prefix, CamelCaseExact/Prefix/Substring, NonContiguous variants, StartOfWordSubstring, LowercaseSubstring, Fuzzy) with both mixed-case and all-lowercase patterns - Negative tests: verify rejection when hump chars, bigrams, trigrams, and lengths don't match - CrossHumpSubstring: documents NonLowercaseSubstring as intentionally not guaranteed - Multiple symbols: document-level filter matches any symbol - Fuzzy/non-fuzzy split: verify allowFuzzyMatching output signal - Individual filter checks via TestAccessor (hump, DP, trigram, length) - Container matching tests Co-authored-by: Cursor <cursoragent@cursor.com>
End-to-end tests verifying that the NavigateToSearchIndex pre-filter correctly allows matches through the full NavigateTo pipeline: - CamelCase hump bigram (GB -> GooBar) - All-lowercase DP hump prefix (goo -> GooBar) - Trigram substring (line -> Readline) - Fuzzy match enabled by length check (ToEror -> ToError) - No match when all checks fail (XyzXyzXyzXyz) Co-authored-by: Cursor <cursoragent@cursor.com>
BenchmarkDotNet benchmarks for measuring NavigateToSearchIndex pre-filter performance across various pattern types (CamelCase, all-lowercase, trigram, fuzzy, container-qualified). Co-authored-by: Cursor <cursoragent@cursor.com>
5c6094c to
3c67904
Compare
|
@ToddGrun @dibarbet @JoeRobich this broke in: Doesn't seem like me. Can you massage CI plz. Thanks! |
|
@davkean you might be interested in this as well. |
|
@ToddGrun can you kick CI? it looks like it's failing in tests designed to find flakeyness elsewhere. |
|
@ToddGrun plzmrge. Tnx :-) |
|
@dibarbet @JoeRobich @ToddGrun @phil-allen-msft or @roslyn-ide plzmrgthnx :) |
…ected length thresholds (#82457) Followup to #82431 ## Summary The NavigateTo pre-filter gates whether fuzzy (edit-distance) matching is attempted for a document. Previously it used only a symbol-name-length bitset with a fixed ±2 delta — too coarse in practice. In a large file with many 6-character symbols, searching for "FooBar" (also 6 characters) would pass the length check for every document, forcing an expensive full-scan fuzzy match against every symbol in those documents even when the character content is completely different. This PR improves the fuzzy pre-filter in three ways: 1. **Correct the length check thresholds** to match what `WordSimilarityChecker` actually enforces: ±1 for pattern lengths 3–4, ±2 for 5+, and no fuzzy matching at all for patterns shorter than 3. Previously, the pre-filter used a blanket ±2 for all lengths and didn't reject short patterns, which meant it was more permissive than the actual fuzzy matcher. 2. **Add an exact bigram bitset** (38×38 = 1444 bits, 184 bytes per document) storing all lowercased 2-character sliding windows of symbol names. At query time, use the q-gram count lemma (Ukkonen, 1992) to require a minimum number of shared bigrams: `min_shared = |pattern| - 1 - 2k`, where `k` is the edit distance threshold. If fewer pattern bigrams match, the document is skipped for fuzzy matching entirely. 3. **Share threshold logic** by extracting `WordSimilarityChecker.GetThreshold(int)` and `MinFuzzyLength` so the pre-filter and the actual fuzzy matcher use the same constants instead of duplicating magic numbers. ## Motivation Consider a codebase with thousands of documents, many containing 6-character symbol names. When the user types "FooBar" in NavigateTo: - **Before**: The length check passes for every document that has any symbol of length 4–8 (±2). The caller then creates a `PatternMatcher` with fuzzy matching enabled and runs it against every symbol in those documents. Most of these comparisons compute edit distance only to conclude there's no match. - **After**: The length check passes (length 6, threshold ±2, checks 4–8). But the bigram check then asks: do at least 1 of the 5 bigrams ("fo","oo","ob","ba","ar") exist in this document? For documents with symbols like "XyzWvq", none of these bigrams are stored, so the document is skipped without ever creating a `PatternMatcher`. ## Bigram bitset design Characters are mapped to a 38-element alphabet: - `a`–`z` → 0..25 - `0`–`9` → 26..35 - `_` → 36 - everything else (Unicode) → 37 ("other" bucket) This gives exact membership for the 37 most common identifier characters and a single overflow bucket for rare Unicode characters. The bitset is a `ulong[23]` (184 bytes) — compact enough to store per-document with negligible memory overhead. ## Filtering effectiveness by pattern length | Pattern length | k (threshold) | min_shared bigrams | Filtering power | |:-:|:-:|:-:|:--| | < 3 | — | — | Fuzzy disabled entirely | | 3 | 1 | 0 | No filtering (always passes) | | 4 | 1 | 1 | Need ≥ 1 of 3 bigrams | | 5 | 2 | 0 | No filtering (always passes) | | 6 | 2 | 1 | Need ≥ 1 of 5 bigrams | | 7 | 2 | 2 | Need ≥ 2 of 6 bigrams | | 8 | 2 | 3 | Need ≥ 3 of 7 bigrams | | 10 | 2 | 5 | Need ≥ 5 of 9 bigrams | Lengths 3 and 5 get no bigram filtering benefit (min_shared = 0). For length 3, we should strongly consider whether fuzzy matching itself is too permissive to even support — with an edit distance threshold of 1, a 3-character pattern like "abc" would fuzzy-match "xbc", "axc", "abx", "ab", "abcd", etc. This is a potential follow-up. ## Reference Ukkonen, E. (1992). "Approximate string-matching with q-grams and maximal matches." *Theoretical Computer Science*, 92(1), 191–211. [DOI: 10.1016/0304-3975(92)90143-4](https://doi.org/10.1016/0304-3975(92)90143-4) The q-gram count lemma states: each edit operation can destroy at most *q* q-grams from a string. Therefore, if `edit_distance(s, t) ≤ k`, then at least `|s| - 1 - q·k` of s's q-gram positions must have a matching q-gram in t. For bigrams (q=2): `min_shared = |pattern| - 1 - 2k`. ## Benchmarks: | Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |----------------------------------------------- |-----------:|----------:|----------:|------:|------:|------:|----------:| | 'SameLen: LengthCheck pass (false positive)' | 0.9901 ns | 0.1123 ns | 0.0292 ns | - | - | - | - | | 'VariedLen: LengthCheck pass' | 0.1914 ns | 0.0710 ns | 0.0184 ns | - | - | - | - | | 'SameLen: BigramCheck reject (true negative!)' | 7.1073 ns | 0.5302 ns | 0.1377 ns | - | - | - | - | | 'SameLen: BigramCheck pass (true positive)' | 7.1716 ns | 0.4179 ns | 0.0647 ns | - | - | - | - | | 'SameLen: Combined reject (bigram saves)' | 28.2908 ns | 2.1554 ns | 0.3336 ns | - | - | - | - | | 'SameLen: Combined pass' | 28.7838 ns | 1.3179 ns | 0.2039 ns | - | - | - | - | | 'SameLen: BigramCheck reject len=8' | 10.4243 ns | 1.4520 ns | 0.2247 ns | - | - | - | - | | 'VariedLen: BigramCheck reject len=10' | 14.9580 ns | 0.7782 ns | 0.1204 ns | - | - | - | - | --------- Co-authored-by: Cyrus Najmabadi <Cyrus Najmabadi> Co-authored-by: Cursor <cursoragent@cursor.com>
NavigateTo: Add document pre-filtering to skip non-matching files early
Summary
Adds a new lightweight
NavigateToSearchIndexthat lets NavigateTo skip documents that can't match the user's query — without ever loading the full declared-symbol data. For documents that pass, the filter also controls whether fuzzy matching is enabled, avoiding expensive edit-distance work when no symbol is close in length.Speedup is anywhere from 15,000x better in the worst case, to 250,000x better in the best case. This is for when we can filter the doc out. When we can't filter the doc out, we spend at worst 300ns on an initial check, which is then dwarfed by millions of ns of actual searching cost. So there's no appreciable cost for documents where there is a hit.
The worst case (only 15,000x speedup) is for users with an all-lowercase NavTo pattern (sorry @dibarbet). Users with any capital letters in their pattern will get the 250,000x improvement.
Why
Today NavigateTo loads the full
TopLevelSyntaxTreeIndexfor every document and runsPatternMatcheragainst every symbol. Most documents don't match, so this is wasted work. By putting a small pre-filter in a separate, independently-loaded index, we can reject the vast majority of documents for the cost of loading a few bloom filters and frozen sets.Two-index architecture
NavigateToSearchIndexextendsAbstractSyntaxIndexand is persisted/loaded independently fromTopLevelSyntaxTreeIndex. The search pipeline becomes:NavigateToSearchIndex(tiny: bloom filters, frozen sets, a 64-bit bitset).TopLevelSyntaxTreeIndex.TopLevelSyntaxTreeIndex(large: allDeclaredSymbolInfos) and do real matching.Both indices are built from the same syntax tree walk but stored separately. Both search paths (in-process and cached/pre-solution-load) follow this pattern.
Filter data
Five pieces of data per document:
FrozenSet<string>GB,WrLi.BloomFilterwrlivia a DP pass.BloomFilterline.FrozenSet<char>ulongThe hump set and container set use exact
FrozenSets — the domain is small (for English: ~700 hump strings, a handful of container chars; Unicode adds more but it stays bounded by actual document content). Bloom filters are used for hump prefixes and trigrams where the domain is unbounded.Key design choices
Bigrams over single chars: Storing ordered pairs of hump initials (
GB,GQ,BQforGooBarQuux) is far more selective than single characters. A loneGwould match too many documents.DP for all-lowercase: Instead of enumerating 2^n capitalizations, we store lowercased hump prefixes and run an O(n × max_hump_length) DP to check if the pattern can be split into valid hump prefixes. Inspired by the Metals bloom filter approach.
Split fuzzy signal:
ProbablyContainsMatchreturns both "should we search?" and "should we allow fuzzy?" separately. When no symbol has a name length within ±2, fuzzy matching is disabled for that document.Testing
Benchmark
BenchmarkDotNet micro-benchmarks for many scenarios.