Conversation
5a173cd to
d4898ba
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (3)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
1-2: Complete the TODO and remove dead code allowanceOnce the tokenizer is properly integrated into the dispatch logic, remove the
#![allow(dead_code)]attribute and complete the TODO to ensure all code is actively used.
18-19: Use configuration parameter for stemming instead of hardcodingThe
stemflag is hardcoded tofalse. Implement the TODO by reading the stemming configuration from theconfigparameter.
21-26: 🛠️ Refactor suggestionLatin branch incorrectly skips stemming
The comment mentions using rust_stemmers for Latin script, but the code passes no language to
tokenize_charabia, disabling stemming. This contradicts the intended behavior.Apply this diff to enable stemming for Latin scripts:
- // If the script of the input is latin and we don't need to stem early tokenize to skip language detection - // to reduce overhead and improve performance. - if script_is_latin(script) && !stem { - Self::tokenize_charabia(input, cb); - return; - } + // For Latin script without stemming, skip language detection for performance + if script_is_latin(script) && !stem { + Self::tokenize_charabia(input, cb); + return; + }Additionally, update the comment to accurately reflect that stemming is skipped in this branch.
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
184-193: Consider adding tests for stemming configurationOnce the stemming configuration is properly read from
TextIndexParams, add test cases that verify tokenization with stemming enabled.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: storage-compat-test
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: test-low-resources
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: lint
- GitHub Check: rust-tests (macos-latest)
🔇 Additional comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)
32-35: Japanese detection now works correctlyGood fix! Adding
Language::JpntoSUPPORTED_LANGUAGESensures the Japanese tokenization path can be triggered.
49-65: Stemming implementation looks goodThe stemmer is now properly used with
stem_cow(), resolving the previous TODO about lifetime issues. The whitespace filtering is also a good addition.
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
Outdated
Show resolved
Hide resolved
1e07a3f to
46656e0
Compare
46656e0 to
a77e951
Compare
d563039 to
13d5fe8
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (2)
1-2:#![allow(dead_code)]is still present – please drop it once the tokenizer is wired inPrevious feedback already requested the removal of the dead-code allowance after integration.
Leaving the attribute makes it harder for CI / clippy to catch real dead code later on.
Remove the attr (and the TODO) as soon as the dispatcher usesMultilingualV2.
18-21: Hard-codedstem = falseignoresTextIndexParams– hook it up now or gate behind a feature flagThe whole point of the
_configargument is to decide whether stemming should be applied, yet the flag is fixed tofalse.
Either wire it to a real field (e.g.config.enable_stemming) or keep the hard-coded default behind an explicit// FIXMEso the behaviour is obvious.- // TODO(multilingual): Replace this with a value from `config`! - let stem = false; + // Respect user configuration + let stem = _config.stemming_enabled;(Adjust the field name to match the actual struct.)
🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
16-29: Avoid running language detection when it is not neededYou already have the script; for non-Latin scripts where
stemisfalseand script ≠ Jpan, callingdetect_languageburns extra CPU without changing the control flow (it ultimately falls through totokenize_charabia).
Short-circuit early to save a second pass over the string:- // Detect language to know if we're dealing with Japanese text, or what stemming algorithm to apply. - let language = detect_language(input); - - // If the script of the input is Japanese, use vaporetto to segment. - if language == Some(Language::Jpn) { + // Only run language detection if we still need it. + let language = detect_language(input); + + if language == Some(Language::Jpn) {Even better: branch on
script == Script::Jpaninstead of language detection for Japanese to avoid a false negative on short strings.
50-55: InstantiateStemmeronce per language, not per callCreating a
Stemmerallocates and loads the rule tables every timetokenize_charabia_and_stemruns.
Consider caching theStemmerin alazy_static/once_cellkeyed byAlgorithmor by embedding it inTextIndexParamsto avoid repeated allocations in high-throughput scenarios.
59-61: Filtering onis_alphabeticsilently drops digits & emoji
char::is_alphabetic()returnsfalsefor digits, symbols, and many emoji.
If the index is supposed to capture numeric identifiers (e.g. “Q4”, “123”) or emoji, consider a broader predicate (e.g.is_alphanumeric) or making the filter configurable.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: integration-tests-consensus
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
…apanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
…apanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
…ultilingual.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com>
b27e32a to
2f2b551
Compare
* Add new tokenizer for multilingual * Make codespell happy * improve handling of when to apply stemming * Fix stemming implementation * Fix Japanese language detection. Add tests for MultilingualV2 * Update lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> * Update lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> * Update lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> * Make init() private * move model into segment, filter non-alphanumeric tokens * fmt * exclude punctuation from tokenizer result * fm --------- Co-authored-by: Luis Cossío <luis.cossio@qdrant.com> Co-authored-by: generall <andrey@vasnetsov.com>
Adds a first implementation of the new multilingual tokenizer. This one is not yet enabled, which will be done after further adjustments in a separate PR.