Conversation
📝 WalkthroughWalkthroughThis change introduces stemming algorithm support for text indexing throughout the codebase and API. New types and configuration fields are added to represent stemming algorithms, specifically the Snowball stemmer with language selection, in both internal data structures and public API schemas (OpenAPI and gRPC/protobuf). The text indexing parameters now accept an optional stemmer configuration. Tokenizer modules are refactored to use a unified configuration struct that includes stemming, and stemming is conditionally applied during tokenization. A new Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (14)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (11)
docs/grpc/docs.md (4)
69-75: Fix list-indentation to satisfy markdown-lint (MD007).The newly added bullet points are indented with four spaces, whereas the rest of the table of contents uses two.
Keeping a consistent indentation prevents future CI noise.- - [SnowballStemmingParams](#qdrant-SnowballStemmingParams) ... - - [StemmingAlgorithm](#qdrant-StemmingAlgorithm) + - [SnowballStemmingParams](#qdrant-SnowballStemmingParams) + - [StemmingAlgorithm](#qdrant-StemmingAlgorithm)
1404-1413: Consider enumerating supported languages instead of a free-form string.
languageis typed asstring; this pushes validation burden onto every caller and server-side runtime checks.
A small enum (e.g.SnowballLanguage) keeps the API self-documenting and prevents typos at compile time for generated clients.If a strict enum is not feasible, at least add an inline note listing accepted values (e.g.
"english" | "german" | …").
1484-1492:oneofwith a single variant looks odd — add a placeholder or comment.Today the
StemmingAlgorithmoneofonly containssnowball_params.
Either:
- Add a short comment such as
// Future algorithms go here, or- Introduce a
Noop/Disabledvariant,so that generated code does not expose empty union types that confuse users.
1651-1654: Great to expose stemming, but clarify precedence withlowercase& tokenizers.Docs should say whether stemming runs before/after lower-casing and how it interacts with different
TokenizerTypes (e.g.Multilingual).
A one-sentence note here avoids divergent client assumptions.lib/api/src/grpc/qdrant.rs (2)
873-879: Consider derivingDeserializealongsideSerializeThroughout this file most protobuf structs derive only
Serialize, but note that REST/OpenAPI flows (e.g.POST /collections) rely on both directions when the same Rust types are reused for JSON (this happens in several endpoints). DerivingDeserializehere avoids asymmetry and future surprises.#[derive(serde::Serialize)] +# [derive(serde::Deserialize)]
885-889: Reserve tags for future algorithms
StemmingParamscurrently exposes onlySnowballParamswith tag1. It’s advisable to pre-allocate a small range (e.g. 1–10) for future variants to prevent accidental collisions later or the need for tag re-use, which protobuf forbids.oneof stemming_params { SnowballStemmingParams snowball_params = 1; // reserved 2 to 10 for future algorithms }lib/api/src/grpc/proto/collections.proto (1)
492-494: Consider adding validation constraints for language field.The
SnowballStemmingParamsmessage design is clean, but consider if thelanguagefield should have validation constraints or an enum type instead of a string to prevent invalid language codes.Consider using an enum for supported languages instead of a string:
message SnowballStemmingParams{ - string language = 1; // Which language the algorithm should stem. + SnowballLanguage language = 1; // Which language the algorithm should stem. } +enum SnowballLanguage { + ENGLISH = 0; + SPANISH = 1; + FRENCH = 2; + // ... other supported languages +}docs/redoc/master/openapi.json (2)
7771-7793: RedundantoneOfwrapper with a single option
StemmingAlgorithmcurrently defines only one variant yet wraps it inoneOf. This adds schema depth with no benefit and complicates client validation; consider inlining the object until another algorithm is introduced.-"StemmingAlgorithm": { - "description": "Different stemming algorithms with their configs.", - "oneOf": [ - { "type": "object", ... } - ] -} +"StemmingAlgorithm": { + "description": "Different stemming algorithms with their configs.", + "type": "object", + "required": ["stemmer","language"], + "properties": { ... } +}
7781-7786:stemmerfield conveys a constant – mark it explicitlyIf the intent is to indicate the algorithm kind, the
"stemmer"property will always be"Snowball"for this object. Mark it as a constant to help generators produce stricter types:-"stemmer": { - "type": "string", - "enum": ["Snowball"] -} +"stemmer": { + "type": "string", + "enum": ["Snowball"], + "example": "Snowball" +}Alternatively, rename the property to
algorithmto avoid tautology (StemmingAlgorithm.stemmer), but that is stylistic.lib/segment/src/data_types/index.rs (1)
295-308: Optimize Display and FromStr implementations.Using
serde_jsonfor simple enum Display/FromStr is inefficient. Consider using direct match statements for better performance and clarity.impl fmt::Display for SnowballLanguage { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { - let json_string = serde_json::to_string(self).map_err(|_| fmt::Error)?; - f.write_str(json_string.trim_matches('"')) + let name = match self { + SnowballLanguage::Arabic => "Arabic", + SnowballLanguage::Armenian => "Armenian", + SnowballLanguage::Danish => "Danish", + SnowballLanguage::Dutch => "Dutch", + SnowballLanguage::English => "English", + SnowballLanguage::Finnish => "Finnish", + SnowballLanguage::French => "French", + SnowballLanguage::German => "German", + SnowballLanguage::Greek => "Greek", + SnowballLanguage::Hungarian => "Hungarian", + SnowballLanguage::Italian => "Italian", + SnowballLanguage::Norwegian => "Norwegian", + SnowballLanguage::Portuguese => "Portuguese", + SnowballLanguage::Romanian => "Romanian", + SnowballLanguage::Russian => "Russian", + SnowballLanguage::Spanish => "Spanish", + SnowballLanguage::Swedish => "Swedish", + SnowballLanguage::Tamil => "Tamil", + SnowballLanguage::Turkish => "Turkish", + }; + f.write_str(name) } } impl FromStr for SnowballLanguage { - type Err = serde_json::Error; + type Err = String; fn from_str(s: &str) -> Result<Self, Self::Err> { - serde_json::from_str(&format!("\"{s}\"")) + match s { + "Arabic" | "ar" => Ok(SnowballLanguage::Arabic), + "Armenian" | "hy" => Ok(SnowballLanguage::Armenian), + "Danish" | "da" => Ok(SnowballLanguage::Danish), + "Dutch" | "nl" => Ok(SnowballLanguage::Dutch), + "English" | "en" => Ok(SnowballLanguage::English), + "Finnish" | "fi" => Ok(SnowballLanguage::Finnish), + "French" | "fr" => Ok(SnowballLanguage::French), + "German" | "de" => Ok(SnowballLanguage::German), + "Greek" | "el" => Ok(SnowballLanguage::Greek), + "Hungarian" | "hu" => Ok(SnowballLanguage::Hungarian), + "Italian" | "it" => Ok(SnowballLanguage::Italian), + "Norwegian" | "no" => Ok(SnowballLanguage::Norwegian), + "Portuguese" | "pt" => Ok(SnowballLanguage::Portuguese), + "Romanian" | "ro" => Ok(SnowballLanguage::Romanian), + "Russian" | "ru" => Ok(SnowballLanguage::Russian), + "Spanish" | "es" => Ok(SnowballLanguage::Spanish), + "Swedish" | "sv" => Ok(SnowballLanguage::Swedish), + "Tamil" | "ta" => Ok(SnowballLanguage::Tamil), + "Turkish" | "tr" => Ok(SnowballLanguage::Turkish), + _ => Err(format!("Unknown language: {}", s)), + } } }lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
273-273: Track the temporary Clone requirement.The TODO indicates that the
Clonederive is temporary and should be removed once rocksdb is removed from the codebase.Would you like me to create an issue to track the removal of this
Clonerequirement once rocksdb migration is complete?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/grpc/docs.md(4 hunks)docs/redoc/master/openapi.json(2 hunks)lib/api/src/grpc/conversions.rs(8 hunks)lib/api/src/grpc/proto/collections.proto(17 hunks)lib/api/src/grpc/qdrant.rs(1 hunks)lib/segment/src/data_types/index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/tests/mod.rs(2 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs(27 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/segment/src/index/field_index/full_text_index/inverted_index.rs:330-344
Timestamp: 2025-05-26T14:45:01.364Z
Learning: For future phrase capabilities in the full-text index, equality checks need to compare both the ID and value of PostingElement items, not just the IDs. This supports use cases like position-aware or frequency-weighted searches where the value component contains important metadata.
🧬 Code Graph Analysis (2)
lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (2)
lib/segment/src/data_types/index.rs (3)
fmt(296-299)fmt(408-411)from(270-292)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
new(179-207)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (2)
tokenize(40-71)tokenize(75-82)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (3)
tokenize(15-37)tokenize(43-65)tokenize(71-104)lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (1)
from_algorithm(24-30)
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
69-69: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
70-70: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
71-71: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
72-72: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
74-74: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-consensus-compose
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-low-resources
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
- GitHub Check: storage-compat-test
🔇 Additional comments (25)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (2)
166-166: LGTM! Correct field addition for test compatibility.The addition of
stemmer: Noneappropriately maintains existing test behavior while accommodating the new optional stemming field inTextIndexParams.
213-213: LGTM! Consistent field addition for test compatibility.The
stemmer: Nonefield addition is consistent with the other test update and correctly maintains the focus on phrase matching functionality without introducing stemming behavior.lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1)
406-406: LGTM! Appropriate field addition for test compatibility.The addition of
stemmer: Nonecorrectly maintains existing test behavior for general full-text indexing functionality while accommodating the new optional stemming field inTextIndexParams.lib/api/src/grpc/qdrant.rs (1)
869-871: Tag addition looks safe but double-check wire compatibilityAdding an optional field with tag
8follows protobuf’s forwards-compatibility rules, so existing clients won’t break.
Please still verify that:
- No other message re-uses tag
8insideTextIndexParamsin older.protoversions.- Down-stream SDKs that ship pre-compiled stubs are regenerated; otherwise they will silently drop the new field.
lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (4)
8-13: LGTM - Clean enum design with future extensibility.The
Stemmerenum is well-designed for extensibility, allowing easy addition of other stemming algorithms in the future. TheArcwrapper is appropriate for thread-safe sharing across tokenizers.
15-21: Good practice - Custom Debug implementation maintains abstraction.The custom
Debugimplementation correctly hides the internalArc<rust_stemmers::Stemmer>details, maintaining clean debug output and abstraction boundaries.
32-37: Verify stem_cow method availability.The
stemmethod delegates tostem_cowon the underlying stemmer. Ensure this method exists and handles theCow<str>type correctly.#!/bin/bash # Description: Verify that rust_stemmers::Stemmer has stem_cow method # Expected: Should find the method definition or usage # Search for stem_cow method usage rg -A 3 -B 3 "stem_cow" --type rust # Check if there are any other stem methods being used rg -A 3 -B 3 "\.stem\(" --type rust
24-30: [web_search]How does rust_stemmers::Stemmer::create handle invalid Algorithm parameters? Does it panic or return a default value?lib/api/src/grpc/proto/collections.proto (2)
483-484: LGTM - Consistent field naming and numbering.The addition of the
stemmingfield toTextIndexParamsfollows the established pattern with proper optional marking and sequential field numbering.
486-490: Good extensible design with oneof pattern.The
StemmingAlgorithmmessage uses theoneofpattern effectively, allowing for future addition of other stemming algorithms while maintaining backward compatibility. The field naming follows protobuf conventions.docs/redoc/master/openapi.json (1)
7795-7816: Language enum looks good – consider case consistencyAll language identifiers are Title-case while API payloads elsewhere are often lower-case (
"english"). Double-check server deserialization logic to ensure case-sensitivity issues don’t surface.lib/segment/src/data_types/index.rs (1)
214-217: LGTM! The optional stemmer field is well-placed.The addition of the optional
stemmerfield toTextIndexParamsfollows the established pattern and uses appropriate serde attributes.lib/api/src/grpc/conversions.rs (3)
14-14: LGTM! Proper handling of stemming fields in TextIndexParams conversion.The conversion correctly extracts and transforms the stemmer field between internal and gRPC representations.
Also applies to: 36-37, 220-221, 227-228, 238-238
360-373: LGTM! Clean conversion from internal to gRPC StemmingAlgorithm.The implementation properly maps the internal enum to the gRPC structure.
539-553: LGTM! Robust error handling in stemming parameter conversion.The implementation properly validates the language parameter and provides a clear error message for invalid values.
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
19-26: LGTM! Clean refactoring to use TokenizerConfig.The signature change improves API consistency and reduces parameter count while maintaining all functionality.
30-60: LGTM! Proper integration of stemming into the tokenization flow.The refactored logic correctly:
- Optimizes for Latin script without stemming
- Delegates stemming to the config
- Maintains backward compatibility with the Japanese tokenizer
187-201: LGTM! Good test coverage for stemming functionality.The test properly verifies that the English Snowball stemmer correctly stems "Testing" to "test" while leaving "this" unchanged.
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (7)
4-4: LGTM! Clean module organization.The addition of the
stemmermodule and its import follows the existing pattern and maintains good separation of concerns.Also applies to: 7-7
15-38: Correct tokenization pipeline implementation.The WhiteSpaceTokenizer correctly implements the tokenization pipeline:
- Lowercase transformation (if enabled)
- Stopword filtering
- Stemming application
This order ensures efficient processing by filtering stopwords before applying the potentially expensive stemming operation.
43-66: Consistent tokenizer implementations.Both WordTokenizer and PrefixTokenizer correctly follow the same tokenization pipeline pattern established in WhiteSpaceTokenizer. The uniform use of
TokenizerConfigimproves code maintainability.Also applies to: 71-104
122-147: Well-documented query tokenization logic.The implementation correctly applies stemming without stopword filtering for prefix queries. The detailed comment effectively explains the rationale with a concrete example, which is valuable for future maintainers.
170-206: Clean refactoring of Tokenizer struct.The consolidation of configuration parameters into a single
TokenizerConfigfield improves encapsulation and reduces the number of parameters passed around. The constructor properly initializes all configuration options including the optional stemmer.
281-291: Efficient stemming implementation.The
stem_if_applicablemethod is well-implemented:
- Uses the idiomatic let-else pattern for optional handling
- Preserves
Cowsemantics to minimize allocations- Clear documentation of its behavior
735-758: Comprehensive stemming test coverage.The new
test_stemming_snowballtest effectively verifies stemming functionality across all tokenizer types. The test cases use good examples ("interestingly" → "interest", "proceeding" → "proceed", "living" → "live") that clearly demonstrate the stemming behavior.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
35-35: Minor style: Remove extra blank lines for consistency.There are extra blank lines after the stemming application in PrefixTokenizer methods that are inconsistent with other tokenizers.
- callback(config.stem_if_applicable(token_cow)); - + callback(config.stem_if_applicable(token_cow));Apply the same fix to lines 137-138 in
tokenize_query.Also applies to: 63-63, 91-92, 137-138
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lib/segment/src/data_types/index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs(27 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/segment/src/data_types/index.rs
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: integration-tests
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: storage-compat-test
🔇 Additional comments (5)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (5)
4-7: LGTM! Clean integration of stemming module.The new stemmer module import and Stemmer struct integration are well-placed and follow the existing module structure.
175-206: Excellent refactoring with TokenizerConfig.The consolidation of tokenizer parameters into a unified
TokenizerConfigstruct improves maintainability and makes stemming integration clean. The constructor properly handles the optional stemmer parameter fromTextIndexParams.
274-291: Well-designed TokenizerConfig with proper stemming abstraction.The
stem_if_applicablemethod provides a clean abstraction for conditional stemming, safely handling the optional stemmer case. The struct design allows for easy extension with additional tokenizer features.
744-767: Comprehensive stemming test coverage.The test effectively verifies stemming behavior across all tokenizer types with appropriate expected outputs. The test cases demonstrate that stemming is applied correctly to produce root forms ("interestingly" → "interest", "proceeding" → "proceed", "living" → "live").
17-38: Consistent tokenizer interface updates.All tokenizer implementations have been consistently updated to use the new
TokenizerConfigparameter. The order of operations (lowercase → stopwords → stemming) is logical and maintained across all implementations.Also applies to: 45-66, 73-104, 124-147
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
Outdated
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
Outdated
Show resolved
Hide resolved
What is |
lib/segment/src/data_types/index.rs
Outdated
|
|
||
| /// Different stemming algorithms with their configs. | ||
| #[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Hash, Eq)] | ||
| #[serde(tag = "type", rename_all = "snake_case")] |
There was a problem hiding this comment.
Using serde tag will generate anonymous object, which is very hard to use in python client. Instead, we use e.g. pub r#type: FloatIndexType, and untagged enum
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
30-30: Address the previous review concern about the stemmer check.This check was previously questioned by reviewers who noted it might not be necessary. The optimization skips language detection for Latin script when no stemming is configured, but this adds complexity that may not provide significant benefit.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/redoc/master/openapi.json(2 hunks)lib/api/src/grpc/conversions.rs(8 hunks)lib/segment/src/data_types/index.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs(27 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs
- lib/segment/src/data_types/index.rs
- lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
- lib/api/src/grpc/conversions.rs
- docs/redoc/master/openapi.json
🧰 Additional context used
🧠 Learnings (1)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)
Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: test-low-resources
- GitHub Check: lint
- GitHub Check: integration-tests
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consensus-compose
- GitHub Check: storage-compat-test
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
🔇 Additional comments (3)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)
187-203: Excellent test coverage for stemming functionality.The test properly validates that stemming works correctly by checking that "Testing" is stemmed to "test". The test setup with
TokenizerConfigandStemmingAlgorithm::Snowballconfiguration is comprehensive.
22-22: Good refactoring to unified configuration.The migration from individual parameters to
TokenizerConfigimproves maintainability and extensibility. This design makes it easier to add new tokenization options without changing method signatures.
58-58: Clean encapsulation of stemming logic.The use of
config.stem_if_applicable(cased_token)properly encapsulates the stemming decision within the configuration object, maintaining separation of concerns.
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs
Show resolved
Hide resolved
It's leftovers from the previous approach, that I accidentally copied into here. This is not part of this PR and can be ignored. |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
lib/api/src/grpc/conversions.rs(8 hunks)lib/api/src/grpc/proto/collections.proto(17 hunks)lib/api/src/grpc/qdrant.rs(1 hunks)lib/segment/Cargo.toml(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- lib/api/src/grpc/conversions.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/api/src/grpc/qdrant.rs
- lib/api/src/grpc/proto/collections.proto
🧰 Additional context used
🧠 Learnings (1)
lib/segment/Cargo.toml (6)
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/tests.rs:44-47
Timestamp: 2025-05-15T22:54:30.292Z
Learning: The rand crate version 0.9.0 and newer uses method names with `random_*` prefix (like `random_range`), while versions 0.8.x and older use `gen_*` prefix (like `gen_range`). This naming change was introduced in rand 0.9.0-alpha.1.
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/tests.rs:44-47
Timestamp: 2025-05-15T22:54:30.292Z
Learning: The rand crate version 0.9.0 and higher changed method names from `gen_*` to `random_*` (e.g., `gen_range()` became `random_range()`). Code using rand 0.9.x should use the `random_*` method names, while code using rand 0.8.x and earlier should use the `gen_*` method names.
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/tests.rs:44-47
Timestamp: 2025-05-15T22:54:30.292Z
Learning: The rand crate version 0.9.0 and newer uses method names with `random_*` prefix (like `random_range`), while versions 0.8.x and older use `gen_*` prefix (like `gen_range`). This was part of an API redesign in rand 0.9.0 released in February 2024.
Learnt from: coszio
PR: qdrant/qdrant#6609
File: lib/gridstore/src/blob.rs:46-59
Timestamp: 2025-06-02T18:10:47.203Z
Learning: In the Qdrant codebase, zerocopy crate is extensively used for safe byte-level operations across GPU operations, HNSW indices, memory-mapped structures, and serialization. When implementing Blob trait for generic Vec<T>, using zerocopy's FromBytes and IntoBytes traits is preferred over size_of::<T>() because it guarantees memory layout equals byte representation, making serialization safe and correct.
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: test-low-resources
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
docs/grpc/docs.md (3)
69-74: Fix wrong list indentation in the TOC
markdownlintflags the six newly-added list items because their bullet indentation (4 spaces) deviates from the existing 2-space pattern used throughout the document.
Keeping a consistent indent avoids broken anchors when other tooling re-writes or lints the file.- - [SnowballParameters](#qdrant-SnowballParameters) ... - - [StemmingAlgorithm](#qdrant-StemmingAlgorithm) + - [SnowballParameters](#qdrant-SnowballParameters) ... + - [StemmingAlgorithm](#qdrant-StemmingAlgorithm)
1483-1492: Consider wrapping algorithm variants in aoneof
StemmingAlgorithmcurrently contains a single optional fieldsnowball.
If additional algorithms are added later (e.g., Porter, Lancaster) you will want mutual exclusivity. Declaring the variants inside aoneoffrom day one avoids a breaking proto change down the road.Example:
message StemmingAlgorithm { oneof algorithm { SnowballParameters snowball = 1; // future variants... } }
1653-1654: Clarify default behaviour and interaction with other flagsThe new
stemmingfield is optional, but the docs don’t specify:
• What happens when it’s omitted – is stemming disabled or is there a collection-level default?
• How does it interact withlowercaseandstopwords? (order matters)Add a short note (as done for
lowercase,phrase_matchingetc.) so users understand the defaults.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/grpc/docs.md(4 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/grpc/docs.md
69-69: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
70-70: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
71-71: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
72-72: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
73-73: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
74-74: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms (14)
- GitHub Check: storage-compat-test
- GitHub Check: test-consensus-compose
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: integration-tests
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
| <a name="qdrant-SnowballParameters"></a> | ||
|
|
||
| ### SnowballParameters | ||
|
|
||
|
|
||
|
|
||
| | Field | Type | Label | Description | | ||
| | ----- | ---- | ----- | ----------- | | ||
| | language | [string](#string) | | Which language the algorithm should stem. | | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Document the allowed languages or change the type to an enum
language is declared as a free-form string, but the implementation (rust-stemmers) only accepts a finite set of ISO-639 codes.
Exposing it as a raw string in the public API invites invalid input and inconsistent error handling.
Actionable options:
- Replace
string languagewith an enum that lists supported Snowball languages (preferred – compile-time safety, self-documenting). - Keep
stringbut add a note like “Allowed values:en,de, …” and link to the Snowball spec.
Failing to do either will shift validation burden to runtime and every client.
🤖 Prompt for AI Agents
In docs/grpc/docs.md around lines 1404 to 1413, the `language` field is
currently a free-form string but should be restricted to valid Snowball stemmer
languages. To fix this, either replace the `string` type with an enum listing
all supported ISO-639 language codes for compile-time validation, or if keeping
it as a string, add a clear note specifying the allowed language codes and link
to the Snowball specification. This will prevent invalid inputs and improve API
clarity.
* Implement stemming * minor fix * Add openapi specs * Fix import * Fix Test, rename API type * update openapi * fix rest models * Review remarks * Minor code cleanup * make grpc naming aligned with Rest + use master branch for dependencies * grpc docs * use tagged version --------- Co-authored-by: generall <andrey@vasnetsov.com>
Add stemming support for all tokenizers.
Configuring the API now looks like this: