Skip to content

Full-Text Index ASCII Folding (Normalization)#7408

Merged
timvisee merged 8 commits intoqdrant:devfrom
eltu:ascii_folding
Oct 16, 2025
Merged

Full-Text Index ASCII Folding (Normalization)#7408
timvisee merged 8 commits intoqdrant:devfrom
eltu:ascii_folding

Conversation

@eltu
Copy link
Contributor

@eltu eltu commented Oct 15, 2025

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Summary

  • Implements ASCII folding for string normalization in Qdrant's full_text_index.
  • The algorithm is based on the Apache Lucene implementation (Apache 2.0 license), adapted for Qdrant.

Why this matters

  • In Latin-based languages like Portuguese, terms with diacritics (e.g., "ação", "coração", "café") benefit from normalization to improve information retrieval.
  • With ASCII folding, queries like "acao" can match documents containing "ação"; similarly, "coracao" → "coração" and "cafe" → "café".
  • Normalization reduces friction between how users type queries and how data is indexed, increasing recall in text search.

Implementation details

  • Added the ascii_folding configuration option to TextIndexParams for full_text_index.
  • When enabled, the tokenization/normalization pipeline applies folding of accented characters and equivalent symbols to their ASCII counterparts.
  • The algorithm is inspired by Lucene's implementation (Apache 2.0); comments and references in the code point to the origin and license.
  • Behavior is opt-in to preserve compatibility with existing indexes; by default it remains disabled unless explicitly configured.

Configuration

  • Parameter: ascii_folding: Option
    • true: enables normalization/folding during indexing and querying.
    • false: disables it.
    • absent: current behavior (backward compatible), keeping it disabled by default.

Practical examples

  • Document: "ação no coração"; Query: "acao" → matches when ascii_folding = true.
  • Document: "café com leite"; Query: "cafe" → matches when ascii_folding = true.
  • Queries with original accents ("ação", "coração", "café") continue to work regardless of the setting.

Tests

  • Tests were added in full_text_index covering behavior with ascii_folding on and off.
    • Reference file: lib/segment/src/index/field_index/full_text_index/tests/mod.rs
    • Example test: test_ascii_folding_in_full_text_index_word
      • Ensures ASCII-only queries (e.g., "acao") return results when folding is enabled and do not when disabled.
      • Ensures queries with diacritics (e.g., "ação") continue to work in both modes.

eltu added 3 commits October 14, 2025 19:39
Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes.
Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes.
Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults.
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Oct 15, 2025
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your effort on implementing this!

We agree that this is a useful feature, and would like to include it.

I triggered our CI jobs. It shows some errors. Could you look into them and adjust your PR as necessary.

I've also left some review remarks. Please see below.

@timvisee timvisee requested a review from coszio October 15, 2025 12:37
@qdrant qdrant deleted a comment from coderabbitai bot Oct 15, 2025
Copy link
Contributor

@coszio coszio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken the liberty of addressing the existing and some additional remarks directly in abc7bc8

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Oct 15, 2025
@eltu
Copy link
Contributor Author

eltu commented Oct 15, 2025

Hey @coszio and @timvisee,
Thanks a lot by the review, the fixes and accept the pull request. =)

@qdrant qdrant deleted a comment from coderabbitai bot Oct 16, 2025
@qdrant qdrant deleted a comment from coderabbitai bot Oct 16, 2025
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Oct 16, 2025
@timvisee timvisee requested a review from generall October 16, 2025 09:01
Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks once again @eltu!

Merging this now. This will be part of Qdrant 1.16.0.

@timvisee timvisee merged commit aa0f497 into qdrant:dev Oct 16, 2025
15 checks passed
timvisee added a commit that referenced this pull request Nov 14, 2025
* Add ASCII folding to tokenization process

Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes.

* Refactor tokenization code for improved readability and maintainability

Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes.

* Update test cases to reflect optional tokenizer settings changes

Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults.

* address review remarks

* fix codespell

* thx coderabbit

* Don't copy tokens that are already ASCII

* Shrink folded string to fit

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
Co-authored-by: timvisee <tim@visee.me>
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants