Full-Text Index ASCII Folding (Normalization) by eltu · Pull Request #7408 · qdrant/qdrant

eltu · 2025-10-15T01:05:35Z

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Summary

Implements ASCII folding for string normalization in Qdrant's full_text_index.
The algorithm is based on the Apache Lucene implementation (Apache 2.0 license), adapted for Qdrant.

Why this matters

In Latin-based languages like Portuguese, terms with diacritics (e.g., "ação", "coração", "café") benefit from normalization to improve information retrieval.
With ASCII folding, queries like "acao" can match documents containing "ação"; similarly, "coracao" → "coração" and "cafe" → "café".
Normalization reduces friction between how users type queries and how data is indexed, increasing recall in text search.

Implementation details

Added the ascii_folding configuration option to TextIndexParams for full_text_index.
When enabled, the tokenization/normalization pipeline applies folding of accented characters and equivalent symbols to their ASCII counterparts.
The algorithm is inspired by Lucene's implementation (Apache 2.0); comments and references in the code point to the origin and license.
Behavior is opt-in to preserve compatibility with existing indexes; by default it remains disabled unless explicitly configured.

Configuration

Parameter: ascii_folding: Option
- true: enables normalization/folding during indexing and querying.
- false: disables it.
- absent: current behavior (backward compatible), keeping it disabled by default.

Practical examples

Document: "ação no coração"; Query: "acao" → matches when ascii_folding = true.
Document: "café com leite"; Query: "cafe" → matches when ascii_folding = true.
Queries with original accents ("ação", "coração", "café") continue to work regardless of the setting.

Tests

Tests were added in full_text_index covering behavior with ascii_folding on and off.
- Reference file: lib/segment/src/index/field_index/full_text_index/tests/mod.rs
- Example test: test_ascii_folding_in_full_text_index_word
  - Ensures ASCII-only queries (e.g., "acao") return results when folding is enabled and do not when disabled.
  - Ensures queries with diacritics (e.g., "ação") continue to work in both modes.

Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes.

Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes.

Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults.

timvisee

Thank you very much for your effort on implementing this!

We agree that this is a useful feature, and would like to include it.

I triggered our CI jobs. It shows some errors. Could you look into them and adjust your PR as necessary.

I've also left some review remarks. Please see below.

lib/api/src/grpc/proto/collections.proto

lib/segment/src/index/field_index/full_text_index/tokenizers/ascii_folding.rs

lib/segment/src/index/field_index/full_text_index/tokenizers/tokens_processor.rs

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

coszio

I have taken the liberty of addressing the existing and some additional remarks directly in abc7bc8

eltu · 2025-10-15T21:54:25Z

Hey @coszio and @timvisee,
Thanks a lot by the review, the fixes and accept the pull request. =)

timvisee

Thanks once again @eltu!

Merging this now. This will be part of Qdrant 1.16.0.

* Add ASCII folding to tokenization process Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes. * Refactor tokenization code for improved readability and maintainability Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes. * Update test cases to reflect optional tokenizer settings changes Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults. * address review remarks * fix codespell * thx coderabbit * Don't copy tokens that are already ASCII * Shrink folded string to fit --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com> Co-authored-by: timvisee <tim@visee.me>

eltu added 3 commits October 14, 2025 19:39

Add ASCII folding to tokenization process

ff231f8

Introduced an optional ASCII folding feature within the `TokensProcessor` to normalize non-ASCII characters to their ASCII equivalents. Updated tests and documentation to reflect the changes.

Refactor tokenization code for improved readability and maintainability

f27d051

Reorganized and reformatted the tokenization module, including `TokensProcessor` initialization and ASCII folding mappings for better clarity. Updated tests to align with the changes.

Update test cases to reflect optional tokenizer settings changes

5327a21

Adjusted `ascii_folding`, `lowercase`, and `phrase_matching` settings in tests to `None` where applicable, aligning with updates in tokenizer configuration defaults.

This comment was marked as resolved.

Sign in to view

qdrant deleted a comment from coderabbitai bot Oct 15, 2025

timvisee reviewed Oct 15, 2025

View reviewed changes

timvisee requested a review from coszio October 15, 2025 12:37

coszio added 2 commits October 15, 2025 16:55

address review remarks

abc7bc8

fix codespell

372d201

qdrant deleted a comment from coderabbitai bot Oct 15, 2025

coszio reviewed Oct 15, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

thx coderabbit

626c4c3

coszio force-pushed the ascii_folding branch from 14247d9 to 626c4c3 Compare October 15, 2025 20:05

qdrant deleted a comment from coderabbitai bot Oct 15, 2025

coszio approved these changes Oct 15, 2025

View reviewed changes

qdrant deleted a comment from coderabbitai bot Oct 16, 2025

timvisee added 2 commits October 16, 2025 10:44

Don't copy tokens that are already ASCII

21bde3a

Shrink folded string to fit

f5dc67c

qdrant deleted a comment from coderabbitai bot Oct 16, 2025

timvisee approved these changes Oct 16, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

qdrant deleted a comment from coderabbitai bot Oct 16, 2025

timvisee requested a review from generall October 16, 2025 09:01

timvisee approved these changes Oct 16, 2025

View reviewed changes

timvisee merged commit aa0f497 into qdrant:dev Oct 16, 2025
15 checks passed

abdonpijpelink mentioned this pull request Nov 11, 2025

[v1.16.0] Documentation for ASCII folding (and lowercasing) qdrant/landing_page#1984

Merged

6 tasks

timvisee mentioned this pull request Nov 14, 2025

Bump version to 1.16.0 #7535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-Text Index ASCII Folding (Normalization)#7408

Full-Text Index ASCII Folding (Normalization)#7408
timvisee merged 8 commits intoqdrant:devfrom
eltu:ascii_folding

eltu commented Oct 15, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coszio left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

eltu commented Oct 15, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eltu commented Oct 15, 2025

All Submissions:

New Feature Submissions:

Summary

Why this matters

Implementation details

Configuration

Practical examples

Tests

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coszio left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

eltu commented Oct 15, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants