-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Feat: Replace fuzzy algorithm with improved fzy-based implementation #17900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
First impressions: it is really good for the usecases I have -- matches what I believe it should match, no surprises like current implementation has. I didn't measure the performance -- it would be interesting to see the difference with 100k lines of regular text with long lines (80+ characters). I remember it was noticeably slow. |
|
Same test with the book: https://www.gutenberg.org/ebooks/4300 (30k+ lines) War and Peace 66k+ lines, https://www.gutenberg.org/ebooks/2600 |
You mention an original implementation (presumably of the fzy algorithm), but I don't see any mention of original source code (or a license)—if we are adapting existing code, understanding copyright requirements might be essential. Likely conflicts with #17581 |
| if (positions) | ||
| for (int i = 0; i < n; i++) | ||
| positions[i] = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (positions) | |
| for (int i = 0; i < n; i++) | |
| positions[i] = i; | |
| if (positions) | |
| { | |
| for (int i = 0; i < n; i++) | |
| positions[i] = i; | |
| } |
src/fuzzy.c
Outdated
| { | ||
| match_row(&match, i, D[i], M[i], D[i - 1], M[i - 1]); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | |
| match_row(&match, i, D[i], M[i], D[i - 1], M[i - 1]); | |
| } | |
| match_row(&match, i, D[i], M[i], D[i - 1], M[i - 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR replaces the current fuzzy matching algorithm with an improved fzy-based implementation that addresses several issues with the previous algorithm, particularly with CamelCase matching and prioritization of matches at the beginning of strings. The new implementation is based on the fzy algorithm but enhanced with multibyte character support.
Key changes:
- Complete replacement of the fuzzy matching algorithm with a fzy-based implementation
- Removal of the
camelcaseoption frommatchfuzzy()andmatchfuzzypos()functions - Performance improvements with ~2x faster matching speed
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/vim.h | Updated fuzzy matching constants and removed old limits |
| src/fuzzy.c | New file containing the fzy-based fuzzy matching implementation |
| src/search.c | Removed old fuzzy matching algorithm implementation |
| src/testdir/test_matchfuzzy.vim | Updated tests to reflect new algorithm behavior and removed camelcase option tests |
| src/quickfix.c | Updated to use new fuzzy matching constants and function signature |
| src/proto/search.pro | Removed fuzzy matching function prototypes |
| src/proto/fuzzy.pro | Added new fuzzy matching function prototypes |
| Multiple files | Updated function calls to use new fuzzy score constants and removed camelcase parameter |
|
@chrisbra, should fzy MIT license be added as well? |
|
As I understood it, merely the fzy algorithm was picked but no code was shared, in which case we don't need to apply the MIT license. Correct @girishji ? |
|
@chrisbra, I intentionally copied 2 functions (~130 lines) from https://github.com/jhawthorn/fzy/blob/master/src/match.c, and made minor changes to support multibyte codepoints. Do you think we should also reimplement the copied functions? (Not a big deal). |
|
The license does look ugly when you open fuzzy.c. Maybe I'll just reimplement the algorithm from ground up? Let me know either way. |
|
Okay, it should be fine to just mention the license like in iscygpty.c. I don't care if it looks ugly ;) |
When I implemented the current fuzzy feature, I reached out to Forrest Smith and received his |
Ok, I'll reach out to the author. |
|
Great, can you please squash this in? diff --git a/src/fuzzy.c b/src/fuzzy.c
index eb37effad..6ebfa5b21 100644
--- a/src/fuzzy.c
+++ b/src/fuzzy.c
@@ -171,7 +171,7 @@ fuzzy_match_item_compare(const void *s1, const void *s2)
{
int exact_match1 = FALSE, exact_match2 = FALSE;
char_u *pat = ((fuzzyItem_T *)s1)->pat;
- int patlen = STRLEN(pat);
+ int patlen = (int)STRLEN(pat);
int startpos = ((fuzzyItem_T *)s1)->startpos;
exact_match1 = (startpos >= 0) && STRNCMP(pat,
((fuzzyItem_T *)s1)->itemstr + startpos, patlen) == 0;Otherwise, this causes failure on Appveyor build |
This version is based on the original fzy algorithm, with one key enhancement: multibyte character support. The original implementation supports only ASCII via table lookups for uppercase/lowercase scoring. This patch replaces those lookups with function calls, making it compatible with multibyte character sets. Core logic (match_row() and match_positions()) remains faithful to the original, but now operates on codepoints rather than single-byte characters.
Done. |
|
Thanks, I include it. Can you please verify, if this description is still correct? |
|
Thank you! |
Not quite but almost. I'll edit it later. |
Problem: matchfuzzy() leaks allocated lists (after 9.1.1627).
Solution: Restore the "retmatchpos" condition.
The memory leak cannot be caught by ASAN as the garbage collector still
tracks these lists, but it can be reproduced by the following script,
which increases Vim's memory usage to over 1 GiB without this fix:
```vim
for i in range(100000)
call matchfuzzy([repeat('a', 300)], repeat('a', 257))
endfor
```
related: vim#17900
Problem: matchfuzzy() leaks allocated lists (after 9.1.1627).
Solution: Restore the "retmatchpos" condition.
The memory leak cannot be caught by ASAN as the garbage collector still
tracks these lists, but it can be reproduced by the following script,
which increases Vim's memory usage to over 1 GiB without this fix:
```vim
for i in range(100000)
call matchfuzzy([repeat('a', 300)], repeat('a', 257))
endfor
```
related: vim#17900
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim#17531 (comment))
and
[here](vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim/neovim#34101
fixes vim#17531
closes: vim#17900
Signed-off-by: Girish Palya <girishji@gmail.com>
Signed-off-by: Christian Brabandt <cb@256bit.org>
Problem: fuzzy-matching can be improved
Solution: Implement a better fuzzy matching algorithm
(Girish Palya)
Replace fuzzy matching algorithm with improved fzy-based implementation
The
[current](https://www.forrestthewoods.com/blog/reverse_engineering_sublime_texts_fuzzy_match/)
fuzzy matching algorithm has several accuracy issues:
* It struggles with CamelCase
* It fails to prioritize matches at the beginning of strings, often
ranking middle matches higher.
After evaluating alternatives (see my comments
[here](vim/vim#17531 (comment))
and
[here](vim/vim#17531 (comment))),
I chose to adopt the [fzy](https://github.com/jhawthorn/fzy) algorithm,
which:
* Resolves the aforementioned issues.
* Performs better.
Implementation details
This version is based on the original fzy
[algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c),
with one key enhancement: **multibyte character support**.
* The original implementation supports only ASCII.
* This patch replaces ascii lookup tables with function calls, making it
compatible with multibyte character sets.
* Core logic (`match_row()` and `match_positions()`) remains faithful to
the original, but now operates on codepoints rather than single-byte
characters.
Performance
Tested against a dataset of **90,000 Linux kernel filenames**. Results
(in milliseconds) show a **\~2x performance improvement** over the
current fuzzy matching algorithm.
```
Search String Current Algo FZY Algo
-------------------------------------------------
init 131.759 66.916
main 83.688 40.861
sig 98.348 39.699
index 109.222 30.738
ab 72.222 44.357
cd 83.036 54.739
a 58.94 62.242
b 43.612 43.442
c 64.39 67.442
k 40.585 36.371
z 34.708 22.781
w 38.033 30.109
cpa 82.596 38.116
arz 84.251 23.964
zzzz 35.823 22.75
dimag 110.686 29.646
xa 43.188 29.199
nha 73.953 31.001
nedax 94.775 29.568
dbue 79.846 25.902
fp 46.826 31.641
tr 90.951 55.883
kw 38.875 23.194
rp 101.575 55.775
kkkkkkkkkkkkkkkkkkkkkkkkkkkkk 48.519 30.921
```
```vim
vim9script
var haystack = readfile('/Users/gp/linux.files')
var needles = ['init', 'main', 'sig', 'index', 'ab', 'cd', 'a', 'b',
'c', 'k',
'z', 'w', 'cpa', 'arz', 'zzzz', 'dimag', 'xa', 'nha', 'nedax',
'dbue',
'fp', 'tr', 'kw', 'rp', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkk']
for needle in needles
var start = reltime()
var tmp = matchfuzzy(haystack, needle)
echom $'{needle}' (start->reltime()->reltimefloat() * 1000)
endfor
```
Additional changes
* Removed the "camelcase" option from both matchfuzzy() and
matchfuzzypos(), as it's now obsolete with the improved algorithm.
related: neovim#34101
fixes vim/vim#17531
closes: vim/vim#17900
vim/vim@7e0df5e
Co-authored-by: Girish Palya <girishji@gmail.com>


Feat: Replace fuzzy matching algorithm with improved fzy-based implementation
The current fuzzy matching algorithm has several accuracy issues:
You can observe these issues in action via the demo app — (try typing into the input box).
Why fzy?
After evaluating alternatives (see my comments here and here), I chose to adopt the fzy algorithm, which:
Implementation details
This version is based on the original fzy algorithm, with one key enhancement: multibyte character support.
match_row()andmatch_positions()) remains faithful to the original, but now operates on codepoints rather than single-byte characters.Performance
Tested against a dataset of 90,000 Linux kernel filenames. Results (in milliseconds) show a ~2x performance improvement over the current fuzzy matching algorithm.
Additional changes
camelcaseoption from bothmatchfuzzy()andmatchfuzzypos(), as it's now obsolete with the improved algorithm.Resolves #17531 and neovim/neovim#34101.