Refactor token processing in `TokenizerConfig` by coszio · Pull Request #6912 · qdrant/qdrant

coszio · 2025-07-21T16:42:50Z

Depends on #6891

This is a nice-to-have refactor of token processing, which centralizes the processing of all tokenizer options in a single place which applies all the configured ones.

coszio · 2025-07-21T16:43:43Z

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

-            if self
-                .config
-                .min_token_len
-                .map(|min_len| token.len() < min_len && token.chars().count() < min_len)
-                .unwrap_or(false)
-            {
-                return;
-            }
-            if self
-                .config
-                .max_token_len
-                .map(|max_len| token.len() > max_len && token.chars().count() > max_len)
-                .unwrap_or(false)
-            {
-                return;
-            }


The only functional thing that is being changed here is that this PR checks token length only with token.chars().count(), instead of checking token.len() first.

* Refactor token processing in TokenizerConfig * fix max length checking

* Uncouple tokenizer and TextIndexParams * Refactor token processing in `TokenizerConfig` (#6912) * Refactor token processing in TokenizerConfig * fix max length checking * [Bm25] Execution implementation (#6939) * Add Bm25 * Execute BM25 if config available * cargo format * Add mocked tests for inference and bm25 * Properly apply InferenceType * Fix tests * Review remarks * Review remarks * Add overwritten model name fix again * use enum instead of multiple constants * ensure handling all fields of InferenceInput in infer_local --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com> * review fixes * fmt * spell-check * deduplicate code --------- Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> Co-authored-by: Luis Cossío <luis.cossio@outlook.com> Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>

Refactor token processing in TokenizerConfig

eecd3da

coszio commented Jul 21, 2025

View reviewed changes

coszio requested review from JojiiOfficial and generall July 21, 2025 16:44

fix max length checking

3054a97

JojiiOfficial approved these changes Jul 23, 2025

View reviewed changes

JojiiOfficial merged commit 09bd866 into uncouple_tokenizer_and_textindex_params Aug 1, 2025
15 checks passed

JojiiOfficial deleted the centralize-tokenizer-processing branch August 1, 2025 07:09

generall pushed a commit that referenced this pull request Aug 3, 2025

Refactor token processing in TokenizerConfig (#6912)

2e239f6

* Refactor token processing in TokenizerConfig * fix max length checking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor token processing in `TokenizerConfig`#6912

Refactor token processing in `TokenizerConfig`#6912
JojiiOfficial merged 2 commits intouncouple_tokenizer_and_textindex_paramsfrom
centralize-tokenizer-processing

coszio commented Jul 21, 2025 •

edited

Loading

Uh oh!

coszio Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coszio commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coszio Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coszio commented Jul 21, 2025 •

edited

Loading