Full-Text Index ASCII Folding (Normalization)#7408
Merged
timvisee merged 8 commits intoqdrant:devfrom Oct 16, 2025
Merged
Conversation
Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes.
Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes.
Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults.
timvisee
reviewed
Oct 15, 2025
Member
timvisee
left a comment
There was a problem hiding this comment.
Thank you very much for your effort on implementing this!
We agree that this is a useful feature, and would like to include it.
I triggered our CI jobs. It shows some errors. Could you look into them and adjust your PR as necessary.
I've also left some review remarks. Please see below.
lib/segment/src/index/field_index/full_text_index/tokenizers/ascii_folding.rs
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
Outdated
Show resolved
Hide resolved
coszio
approved these changes
Oct 15, 2025
Contributor
Author
timvisee
approved these changes
Oct 16, 2025
6 tasks
timvisee
added a commit
that referenced
this pull request
Nov 14, 2025
* Add ASCII folding to tokenization process Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes. * Refactor tokenization code for improved readability and maintainability Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes. * Update test cases to reflect optional tokenizer settings changes Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults. * address review remarks * fix codespell * thx coderabbit * Don't copy tokens that are already ASCII * Shrink folded string to fit --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com> Co-authored-by: timvisee <tim@visee.me>
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?Summary
Why this matters
Implementation details
Configuration
Practical examples
Tests