Skip to content

Conversation

@glepnir
Copy link
Member

@glepnir glepnir commented Jun 20, 2025

Introduces an optimized fuzzy matching algorithm using the Smith-Waterman algorithm with affine gap penalties, accelerated via SIMD instructions (SSE2 for x86 and NEON for ARM). The implementation features:

  1. Full affine gap penalty model using three state matrices:

    • M matrix: Tracks match/mismatch scores
    • X matrix: Tracks gaps in the haystack (insertions in needle)
    • Y matrix: Tracks gaps in the needle (deletions in needle)
  2. Position-based scoring bonuses:

    • Prefix bonus (+15) for start of haystack
    • Delimiter bonus (+30) after separator chars
    • Capitalization bonus (+30) capital after lowercase
    • Case match bonus (+10) for exact case matches
    • Exact match bonus (+50) for perfect matches
  3. SIMD optimizations:

    • Diagonal strip processing with 8-element parallelization
    • Vectorized gap penalty calculations
    • Batched character comparison with bonus application
    • Fallback to scalar computation for edge cases

The algorithm improves on classic Smith-Waterman by:

  • Using affine gap penalties (open: -3, extend: -1) instead of linear
  • Adding position-aware bonuses for natural language patterns
  • Leveraging SIMD for 4-8x speedup on modern CPUs
  • Implementing traceback with state machine for efficient path recovery

References:

relate neovim/neovim#34101

@glepnir glepnir marked this pull request as draft June 20, 2025 07:26
@glepnir
Copy link
Member Author

glepnir commented Jun 20, 2025

cc @ychin

@glepnir glepnir force-pushed the simd_sw branch 2 times, most recently from 79a7744 to 1459f84 Compare June 20, 2025 07:39
@habamax
Copy link
Contributor

habamax commented Jun 20, 2025

Does it have enhanced camelcase support or "it just works"?

@glepnir
Copy link
Member Author

glepnir commented Jun 20, 2025

Capitalization bonus for camelcase

@habamax
Copy link
Contributor

habamax commented Jun 20, 2025

So the algorithm is different and results as well?

Need to try it out with my usecases.

@habamax
Copy link
Contributor

habamax commented Jun 20, 2025

oh, it is not drop in replacement for existing fuzzymatch functions

@glepnir
Copy link
Member Author

glepnir commented Jun 20, 2025

based on the Smith-Waterman with fully affine gaps. Unlike nucleo, three matrices are used here instead of two. fzf uses one matrix. This is just a preliminary implementation of the algorithm. It needs further testing. I have not replaced the fuzzy related functions yet. Currently, it is used in fuzzy_match_str. That is to say, it can be tested with omnifunc

@glepnir glepnir force-pushed the simd_sw branch 6 times, most recently from 5e1e241 to ff3678f Compare June 20, 2025 11:15
@chrisbra
Copy link
Member

Thanks, but I am still not sure what you want to achieve here. I don't really think it makes sense to use SIMD here.
Also why have another fuzzy algorithm here? I don't think any user will be aware of the specifics of the fuzzy algorithm so that they know that they even can decide to use a different algorithm.

@glepnir
Copy link
Member Author

glepnir commented Jun 21, 2025

This algorithm is much better and is used in many successful projects that I have listed in the references. The current fuzzy algorithms are not ideal and are not efficient. At present, The overall implementation has been completed. After some testing, it can replace the current algorithm, as it uses SIMD acceleration on platforms that support it, and falls back to a scalar implementation otherwise.

@ychin
Copy link
Contributor

ychin commented Jun 22, 2025

Are there concrete analysis or at least examples of 1) matching quality of the new algorithm, and 2) performance improvements enabled by SIMD, and how it perform compared to old algorithm?

For (1), while it may sound better I think it would be easier to illustrate the point if there are examples of common use cases and we can see how the old / new compare. I'm sure it performs better in some cases but do we know it generally gives better matches on average?

For (2), I would imagine having some basic performance benchmarks to be able to both compare the scalar performance of old versus new algo, and the scalar vs SIMD of the new algo would be useful. SIMD code adds more maintenance cost but if we can make concrete statements like "this <insert_string> costs 6 seconds to match in <insert_modern_cpu> but 0.5 seconds in SIMD" that seems easier to understand the impact. I think I mentioned a similar sentiment in #15837.


Also, do you know what was the current algorithm based on again?

@glepnir
Copy link
Member Author

glepnir commented Jun 22, 2025

I’ll definitely provide it — just not at the moment. I’ve only completed an initial implementation of the algorithm and want to evaluate whether it meets the requirements and if there are any aspects I might have overlooked. Once it's stable, I’ll include benchmark data as well. There’s also a fallback scalar implementation. current is ported from https://github.com/forrestthewoods/lib_fts/tree/master/code and blog post

@glepnir glepnir force-pushed the simd_sw branch 2 times, most recently from d3035ec to 5dae211 Compare June 22, 2025 08:35
@ubaldot
Copy link
Contributor

ubaldot commented Jul 19, 2025

I second @ychin comment: we need to define a sound comparison metric which shall be measurable.

And on top of that, we shall keep in mind the maintenance of it: will it be required a PhD in math along being a guru of the C language and knowing every corner of Vim to maintain the feature? If so, by experience, a sub-optimal solution would be way better.

EDIT: ... and also, out of curiosity, why you choose this algo instead of Levenshtein & variants of it, BK-Tree, etc? :)

@glepnir
Copy link
Member Author

glepnir commented Jul 20, 2025

EDIT: ... and also, out of curiosity, why you choose this algo instead of Levenshtein & variants of it, BK-Tree, etc? :)

Because this algorithm is used in the referenced projects. It has been implemented in the completion projects. These projects are very successful. So why do we need another algorithm...

This PR fully implements the three-matrix gaps. I am considering removing simd to make it simpler. But I have something to do recently. Maybe I will come back to these PRs after next month.

@ubaldot
Copy link
Contributor

ubaldot commented Jul 20, 2025

Because this algorithm is used in the referenced projects. It has been implemented in the completion projects. These > projects are very successful. So why do we need another algorithm...

Well, I am pretty sure that other algorithms are used in other successful project as well :)
I am fairly confident that each algorithm has pros and cons, and the choice depends on what one wants to achieve and what features have greater priority: speed? accuracy? other metrics?

Generally, before choosing a method, it should be cautionary to perform a literature review and evaluate the best approach for the given use-case rather than going with the first method found. ;-)

But I have something to do recently. Maybe I will come back to these PRs after next month.

No worries. Keep in mind that here we are al volunteers and we don't/can't expect anything from anyone. We rely on our common passion for software and reciprocal trust.

@glepnir
Copy link
Member Author

glepnir commented Jul 21, 2025

Well, I am pretty sure that other algorithms are used in other successful project as well :)
I am fairly confident that each algorithm has pros and cons, and the choice depends on what one wants to achieve and what features have greater priority: speed? accuracy? other metrics?

main reason for reference is https://github.com/helix-editor/nucleo and fzf. They do a very good job here. helix is the editor. fzf is built with many tools including vim/neovim

@lifepillar
Copy link
Contributor

I'm glad that someone is looking at Swith-Waterman for this.

@glepnir Your implementation takes quadratic space in the input size. Do you think that memory usage can become a bottleneck for the intended usage, also considering that it is using three matrices instead of one? Implementing SW in linear space is possible, although backtracking becomes a bit more difficult (and I'm not suggesting that it should be done in this PR, eh!).

Perhaps, this PR would be accepted more easily if the SIMD optimization were left for a separate PR. Would doing that be too much work?

@glepnir
Copy link
Member Author

glepnir commented Jul 22, 2025

Perhaps, this PR would be accepted more easily if the SIMD optimization were left for a separate PR. Would doing that be too much work?

Yes. That's my plan. Remove SIMD and do a benchmark. If the performance is OK, SIMD is not necessary then

@glepnir glepnir force-pushed the simd_sw branch 2 times, most recently from dcb33a1 to 034f1e3 Compare July 29, 2025 12:41
@glepnir
Copy link
Member Author

glepnir commented Jul 29, 2025

@lifepillar I’ve already replaced the current algorithm (though it still needs some cleanup — this is just for early testing). You can compile it directly and use this GAPS Smith-Waterman implementation. I’ve removed the SIMD part for now. You can try testing it with your completion plugin.

This implementation reflects my personal understanding after reading the GAPS paper and the Smith-Waterman algorithm. There isn’t a full C version of the three-matrix approach that can be ported directly 😢. There are still some parts I’m uncertain about — it might take more time to look into them further.

I also noticed someone is trying to port fzy, which is another option. But I haven’t looked into the details of fzy’s algorithm yet. It probably works great too. I might check it out when I have time.

Give this a try first — we can revisit things later. since I might be a bit busy recently.😩

@gcanat
Copy link
Contributor

gcanat commented Aug 1, 2025

Got a compilation error:

quickfix.c: In function ‘vgr_match_buflines’:
quickfix.c:6491:20: error: too many arguments to function ‘fuzzy_match’
 6491 |             while (fuzzy_match(str + col, spat, FALSE, &score,
      |                    ^~~~~~~~~~~
In file included from proto.h:188,
                 from vim.h:2464,
                 from quickfix.c:14:
proto/search.pro:40:5: note: declared here
   40 | int fuzzy_match(char_u *str, char_u *pat_arg, int matchseq, int *outScore, int_u *matches, int maxMatches);
      |     ^~~~~~~~~~~

works with the following patch:

diff --git a/src/quickfix.c b/src/quickfix.c
index ab595d7bb..11a4a5d3f 100644
--- a/src/quickfix.c
+++ b/src/quickfix.c
@@ -6489,7 +6489,7 @@ vgr_match_buflines(
            // Fuzzy string match
            CLEAR_FIELD(matches);
            while (fuzzy_match(str + col, spat, FALSE, &score,
-                       matches, sz, TRUE) > 0)
+                       matches, sz) > 0)
            {
                // Pass the buffer number so that it gets used even for a
                // dummy buffer, unless duplicate_name is set, then the

Introduces an optimized fuzzy matching algorithm using the Smith-Waterman
algorithm with affine gap penalties, accelerated via SIMD instructions (SSE2 for x86
and NEON for ARM). The implementation features:

1. Full affine gap penalty model using three state matrices:
   - M matrix: Tracks match/mismatch scores
   - X matrix: Tracks gaps in the haystack (insertions in needle)
   - Y matrix: Tracks gaps in the needle (deletions in needle)

2. Position-based scoring bonuses:
   - Prefix bonus (+15) for start of haystack
   - Delimiter bonus (+30) after separator chars
   - Capitalization bonus (+30) Capital after lowercase
   - Case match bonus (+10) for exact case matches
   - Exact match bonus (+50) for perfect matches

3. SIMD optimizations:
   - Diagonal strip processing with 8-element parallelization
   - Vectorized gap penalty calculations
   - Batched character comparison with bonus application
   - Fallback to scalar computation for edge cases

The algorithm improves on classic Smith-Waterman by:
- Using affine gap penalties (open: -3, extend: -1) instead of linear
- Adding position-aware bonuses for natural language patterns
- Leveraging SIMD for 4-8x speedup on modern CPUs
- Implementing traceback with state machine for efficient path recovery

References:
- CMU affine gap lecture: https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf
- Nucleo's SIMD approach: https://github.com/helix-editor/nucleo
- Frizbee's scoring model: https://github.com/Saghen/frizbee
@gcanat
Copy link
Contributor

gcanat commented Aug 1, 2025

Not sure if it is the best way to test but I tried like this:
vim --clean then sourced the file containing this:

func! Fuzzytest(pattern)
	let file_list = systemlist("rg --files")
	echom file_list->len()
	let starttime = reltime()
	let result = matchfuzzy(file_list, a:pattern)
	echom (starttime->reltime()->reltimefloat() * 1000)
endfunc

then :call Fuzzytest("diffusion")
file_list length is approx 52000.
Time is 85ms on master branch, whereas it is 500ms with this branch. So it appears to be much slower.

@glepnir
Copy link
Member Author

glepnir commented Aug 2, 2025

thanks does the results are correct ? There are indeed some performance issues in certain areas. I want to confirm whether the algorithm produces better results.

@gcanat
Copy link
Contributor

gcanat commented Aug 2, 2025

thanks does the results are correct ? There are indeed some performance issues in certain areas. I want to confirm whether the algorithm produces better results.

I dont know, it's probably a case by case thing. Maybe some results are better in some cases, and other way around in other cases... 🤷‍♂️
But I can give you an example of "bad" results with this branch.
The top3 results for Fuzzytest("normal.c"):

tools/ctags/Units/optscript.r/op-matchloc2line.d/args.ctags
tools/ctags/Units/optscript.r/op-markplaceholder.d/args.ctags
tools/vim/src/normal.c

where as with master branch it is:

tools/vim/src/normal.c
tools/wezterm/deps/harfbuzz/harfbuzz/src/hb-ot-shape-normalize.cc
tools/ctags/Tmain/common-prelude.d/normalize_spaces.expected

@benknoble
Copy link
Contributor

Likely conflicts with #17900

@glepnir
Copy link
Member Author

glepnir commented Aug 5, 2025

I don't know fzy's algorithm. I'm not sure if it's better. I just don't have much time to optimize this PR because my goal is to introduce simd. But seems like there does not like simd in core.

@glepnir glepnir closed this Aug 5, 2025
@przepompownia
Copy link
Contributor

Or maybe give users the option to choose the algorithm by some option? I do not strongly encourage this due to the costs of maintaining each one individually.

@glepnir
Copy link
Member Author

glepnir commented Aug 5, 2025

So, there's only one best algorithm. I used the SIMD Smith-Waterman+GAP algorithm from Helix and Blink.cmp as reference. it already works well in completion plugins and editors. I haven't compared the specific algorithm with fzy. So I don't know which one is better, and I don't have time to evaluate it yet. A pull request for fzy has already been created, so let's give it a try.

@glepnir glepnir deleted the simd_sw branch August 5, 2025 09:47
@benknoble
Copy link
Contributor

So, there's only one best algorithm.

That seems unlikely ;) best in which cases? Along which axes (time, memory, user experience, etc.)?

@glepnir
Copy link
Member Author

glepnir commented Aug 6, 2025

So, there's only one best algorithm.

That seems unlikely ;) best in which cases? Along which axes (time, memory, user experience, etc.)?

aha..I mean there can only be one optimal algorithm for the core. not multiple ..😢

@ychin
Copy link
Contributor

ychin commented Aug 7, 2025

But seems like there does not like simd in core

I don't think SIMD is an issue per se personally. I just think it's better to only do it if we felt there was a need in specific spots first, rather than introducing SIMD because we want to. It's fun to add SIMD though so I get the allure. The other thing is that SIMD mostly introduces a constant time improvement, so it's not going to turn a fundamentally slow algorithm into a fast one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants