Skip to content

Refactor token processing in TokenizerConfig#6912

Merged
JojiiOfficial merged 2 commits intouncouple_tokenizer_and_textindex_paramsfrom
centralize-tokenizer-processing
Aug 1, 2025
Merged

Refactor token processing in TokenizerConfig#6912
JojiiOfficial merged 2 commits intouncouple_tokenizer_and_textindex_paramsfrom
centralize-tokenizer-processing

Conversation

@coszio
Copy link
Contributor

@coszio coszio commented Jul 21, 2025

Depends on #6891

This is a nice-to-have refactor of token processing, which centralizes the processing of all tokenizer options in a single place which applies all the configured ones.

Comment on lines -220 to -235
if self
.config
.min_token_len
.map(|min_len| token.len() < min_len && token.chars().count() < min_len)
.unwrap_or(false)
{
return;
}
if self
.config
.max_token_len
.map(|max_len| token.len() > max_len && token.chars().count() > max_len)
.unwrap_or(false)
{
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only functional thing that is being changed here is that this PR checks token length only with token.chars().count(), instead of checking token.len() first.

@coszio coszio requested review from JojiiOfficial and generall July 21, 2025 16:44
@JojiiOfficial JojiiOfficial merged commit 09bd866 into uncouple_tokenizer_and_textindex_params Aug 1, 2025
15 checks passed
@JojiiOfficial JojiiOfficial deleted the centralize-tokenizer-processing branch August 1, 2025 07:09
generall pushed a commit that referenced this pull request Aug 3, 2025
* Refactor token processing in TokenizerConfig

* fix max length checking
generall added a commit that referenced this pull request Aug 4, 2025
* Uncouple tokenizer and TextIndexParams

* Refactor token processing in `TokenizerConfig` (#6912)

* Refactor token processing in TokenizerConfig

* fix max length checking

* [Bm25] Execution implementation (#6939)

* Add Bm25

* Execute BM25 if config available

* cargo format

* Add mocked tests for inference and bm25

* Properly apply InferenceType

* Fix tests

* Review remarks

* Review remarks

* Add overwritten model name fix again

* use enum instead of multiple constants

* ensure handling all fields of InferenceInput in infer_local

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>

* review fixes

* fmt

* spell-check

* deduplicate code

---------

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
timvisee pushed a commit that referenced this pull request Aug 11, 2025
* Uncouple tokenizer and TextIndexParams

* Refactor token processing in `TokenizerConfig` (#6912)

* Refactor token processing in TokenizerConfig

* fix max length checking

* [Bm25] Execution implementation (#6939)

* Add Bm25

* Execute BM25 if config available

* cargo format

* Add mocked tests for inference and bm25

* Properly apply InferenceType

* Fix tests

* Review remarks

* Review remarks

* Add overwritten model name fix again

* use enum instead of multiple constants

* ensure handling all fields of InferenceInput in infer_local

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>

* review fixes

* fmt

* spell-check

* deduplicate code

---------

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
Co-authored-by: Andrey Vasnetsov <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants