Add new multilingual tokenizer by JojiiOfficial · Pull Request #6678 · qdrant/qdrant

JojiiOfficial · 2025-06-12T09:03:21Z

Adds a first implementation of the new multilingual tokenizer. This one is not yet enabled, which will be done after further adjustments in a separate PR.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
1-2: Complete the TODO and remove dead code allowance

Once the tokenizer is properly integrated into the dispatch logic, remove the #![allow(dead_code)] attribute and complete the TODO to ensure all code is actively used.

18-19: Use configuration parameter for stemming instead of hardcoding

The stem flag is hardcoded to false. Implement the TODO by reading the stemming configuration from the config parameter.

21-26: 🛠️ Refactor suggestion

Latin branch incorrectly skips stemming

The comment mentions using rust_stemmers for Latin script, but the code passes no language to tokenize_charabia, disabling stemming. This contradicts the intended behavior.

Apply this diff to enable stemming for Latin scripts:
-        // If the script of the input is latin and we don't need to stem early tokenize to skip language detection
-        // to reduce overhead and improve performance.
-        if script_is_latin(script) && !stem {
-            Self::tokenize_charabia(input, cb);
-            return;
-        }
+        // For Latin script without stemming, skip language detection for performance
+        if script_is_latin(script) && !stem {
+            Self::tokenize_charabia(input, cb);
+            return;
+        }
Additionally, update the comment to accurately reflect that stemming is skipped in this branch.

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

184-193: Consider adding tests for stemming configuration

Once the stemming configuration is properly read from TextIndexParams, add test cases that verify tokenization with stemming enabled.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 997a88a and ec779ad.

📒 Files selected for processing (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consensus-compose
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: storage-compat-test
GitHub Check: test-consistency
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: integration-tests
GitHub Check: rust-tests (windows-latest)
GitHub Check: integration-tests-consensus
GitHub Check: test-low-resources
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: lint
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (2)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)

32-35: Japanese detection now works correctly

Good fix! Adding Language::Jpn to SUPPORTED_LANGUAGES ensures the Japanese tokenization path can be triggered.

49-65: Stemming implementation looks good

The stemmer is now properly used with stem_cow(), resolving the previous TODO about lifetime issues. The whitespace filtering is also a good addition.

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)
1-2: #![allow(dead_code)] is still present – please drop it once the tokenizer is wired in

Previous feedback already requested the removal of the dead-code allowance after integration.
Leaving the attribute makes it harder for CI / clippy to catch real dead code later on.
Remove the attr (and the TODO) as soon as the dispatcher uses MultilingualV2.

18-21: Hard-coded stem = false ignores TextIndexParams – hook it up now or gate behind a feature flag

The whole point of the _config argument is to decide whether stemming should be applied, yet the flag is fixed to false.
Either wire it to a real field (e.g. config.enable_stemming) or keep the hard-coded default behind an explicit // FIXME so the behaviour is obvious.
-        // TODO(multilingual): Replace this with a value from `config`!
-        let stem = false;
+        // Respect user configuration
+        let stem = _config.stemming_enabled;
(Adjust the field name to match the actual struct.)

🧹 Nitpick comments (3)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
16-29: Avoid running language detection when it is not needed

You already have the script; for non-Latin scripts where stem is false and script ≠ Jpan, calling detect_language burns extra CPU without changing the control flow (it ultimately falls through to tokenize_charabia).
Short-circuit early to save a second pass over the string:
-        // Detect language to know if we're dealing with Japanese text, or what stemming algorithm to apply.
-        let language = detect_language(input);
-
-        // If the script of the input is Japanese, use vaporetto to segment.
-        if language == Some(Language::Jpn) {
+        // Only run language detection if we still need it.
+        let language = detect_language(input);
+
+        if language == Some(Language::Jpn) {
Even better: branch on script == Script::Jpan instead of language detection for Japanese to avoid a false negative on short strings.

50-55: Instantiate Stemmer once per language, not per call

Creating a Stemmer allocates and loads the rule tables every time tokenize_charabia_and_stem runs.
Consider caching the Stemmer in a lazy_static / once_cell keyed by Algorithm or by embedding it in TextIndexParams to avoid repeated allocations in high-throughput scenarios.

59-61: Filtering on is_alphabetic silently drops digits & emoji

char::is_alphabetic() returns false for digits, symbols, and many emoji.
If the index is supposed to capture numeric identifiers (e.g. “Q4”, “123”) or emoji, consider a broader predicate (e.g. is_alphanumeric) or making the filter configurable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13d5fe8 and b27e32a.

📒 Files selected for processing (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: lint
GitHub Check: integration-tests-consensus
GitHub Check: storage-compat-test
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)

…apanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

…ultilingual.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

* Add new tokenizer for multilingual * Make codespell happy * improve handling of when to apply stemming * Fix stemming implementation * Fix Japanese language detection. Add tests for MultilingualV2 * Update lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> * Update lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> * Update lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> * Make init() private * move model into segment, filter non-alphanumeric tokens * fmt * exclude punctuation from tokenizer result * fm --------- Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> Co-authored-by: generall <andrey@vasnetsov.com>

JojiiOfficial force-pushed the better_multilingual_tokenizer branch 3 times, most recently from 5a173cd to d4898ba Compare June 12, 2025 09:12

qdrant deleted a comment from coderabbitai bot Jun 12, 2025

JojiiOfficial requested review from coszio and generall June 12, 2025 09:38

This comment was marked as outdated.

Sign in to view

This comment was marked as off-topic.

Sign in to view

coderabbitai bot reviewed Jun 12, 2025

View reviewed changes

coszio approved these changes Jun 12, 2025

View reviewed changes

JojiiOfficial force-pushed the better_multilingual_tokenizer branch 2 times, most recently from 1e07a3f to 46656e0 Compare June 12, 2025 15:59

coszio approved these changes Jun 12, 2025

View reviewed changes

JojiiOfficial force-pushed the better_multilingual_tokenizer branch from 46656e0 to a77e951 Compare June 13, 2025 06:40

generall approved these changes Jun 17, 2025

View reviewed changes

generall force-pushed the better_multilingual_tokenizer branch from d563039 to 13d5fe8 Compare June 17, 2025 14:24

coderabbitai bot reviewed Jun 17, 2025

View reviewed changes

JojiiOfficial and others added 10 commits June 25, 2025 09:20

Add new tokenizer for multilingual

9f392bb

Make codespell happy

2992daf

improve handling of when to apply stemming

d2f25df

Fix stemming implementation

ca434be

Fix Japanese language detection. Add tests for MultilingualV2

8a6e840

Update lib/segment/src/index/field_index/full_text_index/tokenizers/j…

b31a23d

…apanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

Update lib/segment/src/index/field_index/full_text_index/tokenizers/j…

64305bb

…apanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

Update lib/segment/src/index/field_index/full_text_index/tokenizers/m…

9b74d26

…ultilingual.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

Make init() private

cccf654

move model into segment, filter non-alphanumeric tokens

e3c446c

generall added 3 commits June 25, 2025 09:20

fmt

234dddc

exclude punctuation from tokenizer result

1144b90

fm

2f2b551

JojiiOfficial force-pushed the better_multilingual_tokenizer branch from b27e32a to 2f2b551 Compare June 25, 2025 07:22

JojiiOfficial merged commit 4959deb into dev Jun 25, 2025
18 checks passed

JojiiOfficial deleted the better_multilingual_tokenizer branch June 25, 2025 09:39

This was referenced Jun 26, 2025

Implement new multilingual tokenizer #6762

Merged

Implement stemming #6770

Merged

generall added this to the Multilingual Tokenizer milestone Jul 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new multilingual tokenizer#6678

Add new multilingual tokenizer#6678
JojiiOfficial merged 13 commits intodevfrom
better_multilingual_tokenizer

JojiiOfficial commented Jun 12, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JojiiOfficial commented Jun 12, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as off-topic.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants