Skip to content

Add new multilingual tokenizer#6678

Merged
JojiiOfficial merged 13 commits intodevfrom
better_multilingual_tokenizer
Jun 25, 2025
Merged

Add new multilingual tokenizer#6678
JojiiOfficial merged 13 commits intodevfrom
better_multilingual_tokenizer

Conversation

@JojiiOfficial
Copy link
Contributor

Adds a first implementation of the new multilingual tokenizer. This one is not yet enabled, which will be done after further adjustments in a separate PR.

@JojiiOfficial JojiiOfficial force-pushed the better_multilingual_tokenizer branch 3 times, most recently from 5a173cd to d4898ba Compare June 12, 2025 09:12
@qdrant qdrant deleted a comment from coderabbitai bot Jun 12, 2025
@qdrant qdrant deleted a comment from coderabbitai bot Jun 12, 2025
@JojiiOfficial JojiiOfficial requested review from coszio and generall June 12, 2025 09:38
@coderabbitai

This comment was marked as outdated.

@coderabbitai

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as off-topic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)

1-2: Complete the TODO and remove dead code allowance

Once the tokenizer is properly integrated into the dispatch logic, remove the #![allow(dead_code)] attribute and complete the TODO to ensure all code is actively used.


18-19: Use configuration parameter for stemming instead of hardcoding

The stem flag is hardcoded to false. Implement the TODO by reading the stemming configuration from the config parameter.


21-26: 🛠️ Refactor suggestion

Latin branch incorrectly skips stemming

The comment mentions using rust_stemmers for Latin script, but the code passes no language to tokenize_charabia, disabling stemming. This contradicts the intended behavior.

Apply this diff to enable stemming for Latin scripts:

-        // If the script of the input is latin and we don't need to stem early tokenize to skip language detection
-        // to reduce overhead and improve performance.
-        if script_is_latin(script) && !stem {
-            Self::tokenize_charabia(input, cb);
-            return;
-        }
+        // For Latin script without stemming, skip language detection for performance
+        if script_is_latin(script) && !stem {
+            Self::tokenize_charabia(input, cb);
+            return;
+        }

Additionally, update the comment to accurately reflect that stemming is skipped in this branch.

🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

184-193: Consider adding tests for stemming configuration

Once the stemming configuration is properly read from TextIndexParams, add test cases that verify tokenization with stemming enabled.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 997a88a and ec779ad.

📒 Files selected for processing (1)
  • lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: storage-compat-test
  • GitHub Check: test-consistency
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-low-resources
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: lint
  • GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)

32-35: Japanese detection now works correctly

Good fix! Adding Language::Jpn to SUPPORTED_LANGUAGES ensures the Japanese tokenization path can be triggered.


49-65: Stemming implementation looks good

The stemmer is now properly used with stem_cow(), resolving the previous TODO about lifetime issues. The whitespace filtering is also a good addition.

@JojiiOfficial JojiiOfficial force-pushed the better_multilingual_tokenizer branch 2 times, most recently from 1e07a3f to 46656e0 Compare June 12, 2025 15:59
@JojiiOfficial JojiiOfficial force-pushed the better_multilingual_tokenizer branch from 46656e0 to a77e951 Compare June 13, 2025 06:40
@generall generall force-pushed the better_multilingual_tokenizer branch from d563039 to 13d5fe8 Compare June 17, 2025 14:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)

1-2: #![allow(dead_code)] is still present – please drop it once the tokenizer is wired in

Previous feedback already requested the removal of the dead-code allowance after integration.
Leaving the attribute makes it harder for CI / clippy to catch real dead code later on.
Remove the attr (and the TODO) as soon as the dispatcher uses MultilingualV2.


18-21: Hard-coded stem = false ignores TextIndexParams – hook it up now or gate behind a feature flag

The whole point of the _config argument is to decide whether stemming should be applied, yet the flag is fixed to false.
Either wire it to a real field (e.g. config.enable_stemming) or keep the hard-coded default behind an explicit // FIXME so the behaviour is obvious.

-        // TODO(multilingual): Replace this with a value from `config`!
-        let stem = false;
+        // Respect user configuration
+        let stem = _config.stemming_enabled;

(Adjust the field name to match the actual struct.)

🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)

16-29: Avoid running language detection when it is not needed

You already have the script; for non-Latin scripts where stem is false and script ≠ Jpan, calling detect_language burns extra CPU without changing the control flow (it ultimately falls through to tokenize_charabia).
Short-circuit early to save a second pass over the string:

-        // Detect language to know if we're dealing with Japanese text, or what stemming algorithm to apply.
-        let language = detect_language(input);
-
-        // If the script of the input is Japanese, use vaporetto to segment.
-        if language == Some(Language::Jpn) {
+        // Only run language detection if we still need it.
+        let language = detect_language(input);
+
+        if language == Some(Language::Jpn) {

Even better: branch on script == Script::Jpan instead of language detection for Japanese to avoid a false negative on short strings.


50-55: Instantiate Stemmer once per language, not per call

Creating a Stemmer allocates and loads the rule tables every time tokenize_charabia_and_stem runs.
Consider caching the Stemmer in a lazy_static / once_cell keyed by Algorithm or by embedding it in TextIndexParams to avoid repeated allocations in high-throughput scenarios.


59-61: Filtering on is_alphabetic silently drops digits & emoji

char::is_alphabetic() returns false for digits, symbols, and many emoji.
If the index is supposed to capture numeric identifiers (e.g. “Q4”, “123”) or emoji, consider a broader predicate (e.g. is_alphanumeric) or making the filter configurable.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13d5fe8 and b27e32a.

📒 Files selected for processing (1)
  • lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)

@JojiiOfficial JojiiOfficial force-pushed the better_multilingual_tokenizer branch from b27e32a to 2f2b551 Compare June 25, 2025 07:22
@JojiiOfficial JojiiOfficial merged commit 4959deb into dev Jun 25, 2025
18 checks passed
@JojiiOfficial JojiiOfficial deleted the better_multilingual_tokenizer branch June 25, 2025 09:39
@generall generall added this to the Multilingual Tokenizer milestone Jul 17, 2025
generall added a commit that referenced this pull request Jul 17, 2025
* Add new tokenizer for multilingual

* Make codespell happy

* improve handling of when to apply stemming

* Fix stemming implementation

* Fix Japanese language detection. Add tests for MultilingualV2

* Update lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

* Update lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

* Update lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>

* Make init() private

* move model into segment, filter non-alphanumeric tokens

* fmt

* exclude punctuation from tokenizer result

* fm

---------

Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
Co-authored-by: generall <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants