Implement stemming by JojiiOfficial · Pull Request #6770 · qdrant/qdrant

JojiiOfficial · 2025-06-27T12:58:33Z

Add stemming support for all tokenizers.

Configuring the API now looks like this:

PUT collections/benchmark/index
{
  "field_name": "t",
  "field_schema": {
    "type": "text",
    "stemmer": {
      "type": "snowball",
      "language": "en"
    }
  }
}

coderabbitai · 2025-06-27T13:10:37Z

📝 Walkthrough

Walkthrough

This change introduces stemming algorithm support for text indexing throughout the codebase and API. New types and configuration fields are added to represent stemming algorithms, specifically the Snowball stemmer with language selection, in both internal data structures and public API schemas (OpenAPI and gRPC/protobuf). The text indexing parameters now accept an optional stemmer configuration. Tokenizer modules are refactored to use a unified configuration struct that includes stemming, and stemming is conditionally applied during tokenization. A new Stemmer abstraction wraps stemming functionality. Conversions between internal and external representations are updated accordingly. Tests are modified to explicitly set the stemmer parameter, and new tests verify stemming behavior. The dependency on rust-stemmers is updated to track a fixed tag version.

Suggested reviewers

coszio
JojiiOfficial

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 42f1f38 and 3e07a27.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (1)

lib/segment/Cargo.toml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/Cargo.toml

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: test-consensus-compose
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-consistency
GitHub Check: test-low-resources
GitHub Check: integration-tests
GitHub Check: lint
GitHub Check: storage-compat-test
GitHub Check: integration-tests-consensus

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (11)

docs/grpc/docs.md (4)
69-75: Fix list-indentation to satisfy markdown-lint (MD007).

The newly added bullet points are indented with four spaces, whereas the rest of the table of contents uses two.
Keeping a consistent indentation prevents future CI noise.
-    - [SnowballStemmingParams](#qdrant-SnowballStemmingParams)
...
-    - [StemmingAlgorithm](#qdrant-StemmingAlgorithm)
+  - [SnowballStemmingParams](#qdrant-SnowballStemmingParams)
+  - [StemmingAlgorithm](#qdrant-StemmingAlgorithm)
1404-1413: Consider enumerating supported languages instead of a free-form string.

language is typed as string; this pushes validation burden onto every caller and server-side runtime checks.
A small enum (e.g. SnowballLanguage) keeps the API self-documenting and prevents typos at compile time for generated clients.

If a strict enum is not feasible, at least add an inline note listing accepted values (e.g. "english" | "german" | …").

1484-1492: oneof with a single variant looks odd — add a placeholder or comment.

Today the StemmingAlgorithm oneof only contains snowball_params.
Either:

Add a short comment such as // Future algorithms go here, or

Introduce a Noop / Disabled variant,

so that generated code does not expose empty union types that confuse users.

1651-1654: Great to expose stemming, but clarify precedence with lowercase & tokenizers.

Docs should say whether stemming runs before/after lower-casing and how it interacts with different TokenizerTypes (e.g. Multilingual).
A one-sentence note here avoids divergent client assumptions.
lib/api/src/grpc/qdrant.rs (2)
873-879: Consider deriving Deserialize alongside Serialize

Throughout this file most protobuf structs derive only Serialize, but note that REST/OpenAPI flows (e.g. POST /collections) rely on both directions when the same Rust types are reused for JSON (this happens in several endpoints). Deriving Deserialize here avoids asymmetry and future surprises.
 #[derive(serde::Serialize)]
+# [derive(serde::Deserialize)]
885-889: Reserve tags for future algorithms

StemmingParams currently exposes only SnowballParams with tag 1. It’s advisable to pre-allocate a small range (e.g. 1–10) for future variants to prevent accidental collisions later or the need for tag re-use, which protobuf forbids.
oneof stemming_params {
    SnowballStemmingParams snowball_params = 1;
    // reserved 2 to 10 for future algorithms
}
lib/api/src/grpc/proto/collections.proto (1)
492-494: Consider adding validation constraints for language field.

The SnowballStemmingParams message design is clean, but consider if the language field should have validation constraints or an enum type instead of a string to prevent invalid language codes.

Consider using an enum for supported languages instead of a string:
message SnowballStemmingParams{
-    string language = 1; // Which language the algorithm should stem.
+    SnowballLanguage language = 1; // Which language the algorithm should stem.
}

+enum SnowballLanguage {
+    ENGLISH = 0;
+    SPANISH = 1;
+    FRENCH = 2;
+    // ... other supported languages
+}
docs/redoc/master/openapi.json (2)
7771-7793: Redundant oneOf wrapper with a single option

StemmingAlgorithm currently defines only one variant yet wraps it in oneOf. This adds schema depth with no benefit and complicates client validation; consider inlining the object until another algorithm is introduced.
-"StemmingAlgorithm": {
-  "description": "Different stemming algorithms with their configs.",
-  "oneOf": [
-    { "type": "object", ... }
-  ]
-}
+"StemmingAlgorithm": {
+  "description": "Different stemming algorithms with their configs.",
+  "type": "object",
+  "required": ["stemmer","language"],
+  "properties": { ... }
+}
7781-7786: stemmer field conveys a constant – mark it explicitly

If the intent is to indicate the algorithm kind, the "stemmer" property will always be "Snowball" for this object. Mark it as a constant to help generators produce stricter types:
-"stemmer": {
-  "type": "string",
-  "enum": ["Snowball"]
-}
+"stemmer": {
+  "type": "string",
+  "enum": ["Snowball"],
+  "example": "Snowball"
+}
Alternatively, rename the property to algorithm to avoid tautology (StemmingAlgorithm.stemmer), but that is stylistic.
lib/segment/src/data_types/index.rs (1)
295-308: Optimize Display and FromStr implementations.

Using serde_json for simple enum Display/FromStr is inefficient. Consider using direct match statements for better performance and clarity.
 impl fmt::Display for SnowballLanguage {
     fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
-        let json_string = serde_json::to_string(self).map_err(|_| fmt::Error)?;
-        f.write_str(json_string.trim_matches('"'))
+        let name = match self {
+            SnowballLanguage::Arabic => "Arabic",
+            SnowballLanguage::Armenian => "Armenian",
+            SnowballLanguage::Danish => "Danish",
+            SnowballLanguage::Dutch => "Dutch",
+            SnowballLanguage::English => "English",
+            SnowballLanguage::Finnish => "Finnish",
+            SnowballLanguage::French => "French",
+            SnowballLanguage::German => "German",
+            SnowballLanguage::Greek => "Greek",
+            SnowballLanguage::Hungarian => "Hungarian",
+            SnowballLanguage::Italian => "Italian",
+            SnowballLanguage::Norwegian => "Norwegian",
+            SnowballLanguage::Portuguese => "Portuguese",
+            SnowballLanguage::Romanian => "Romanian",
+            SnowballLanguage::Russian => "Russian",
+            SnowballLanguage::Spanish => "Spanish",
+            SnowballLanguage::Swedish => "Swedish",
+            SnowballLanguage::Tamil => "Tamil",
+            SnowballLanguage::Turkish => "Turkish",
+        };
+        f.write_str(name)
     }
 }

 impl FromStr for SnowballLanguage {
-    type Err = serde_json::Error;
+    type Err = String;

     fn from_str(s: &str) -> Result<Self, Self::Err> {
-        serde_json::from_str(&format!("\"{s}\""))
+        match s {
+            "Arabic" | "ar" => Ok(SnowballLanguage::Arabic),
+            "Armenian" | "hy" => Ok(SnowballLanguage::Armenian),
+            "Danish" | "da" => Ok(SnowballLanguage::Danish),
+            "Dutch" | "nl" => Ok(SnowballLanguage::Dutch),
+            "English" | "en" => Ok(SnowballLanguage::English),
+            "Finnish" | "fi" => Ok(SnowballLanguage::Finnish),
+            "French" | "fr" => Ok(SnowballLanguage::French),
+            "German" | "de" => Ok(SnowballLanguage::German),
+            "Greek" | "el" => Ok(SnowballLanguage::Greek),
+            "Hungarian" | "hu" => Ok(SnowballLanguage::Hungarian),
+            "Italian" | "it" => Ok(SnowballLanguage::Italian),
+            "Norwegian" | "no" => Ok(SnowballLanguage::Norwegian),
+            "Portuguese" | "pt" => Ok(SnowballLanguage::Portuguese),
+            "Romanian" | "ro" => Ok(SnowballLanguage::Romanian),
+            "Russian" | "ru" => Ok(SnowballLanguage::Russian),
+            "Spanish" | "es" => Ok(SnowballLanguage::Spanish),
+            "Swedish" | "sv" => Ok(SnowballLanguage::Swedish),
+            "Tamil" | "ta" => Ok(SnowballLanguage::Tamil),
+            "Turkish" | "tr" => Ok(SnowballLanguage::Turkish),
+            _ => Err(format!("Unknown language: {}", s)),
+        }
     }
 }
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)

273-273: Track the temporary Clone requirement.

The TODO indicates that the Clone derive is temporary and should be removed once rocksdb is removed from the codebase.

Would you like me to create an issue to track the removal of this Clone requirement once rocksdb migration is complete?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfc5706 and 222ccf6.

📒 Files selected for processing (11)

docs/grpc/docs.md (4 hunks)
docs/redoc/master/openapi.json (2 hunks)
lib/api/src/grpc/conversions.rs (8 hunks)
lib/api/src/grpc/proto/collections.proto (17 hunks)
lib/api/src/grpc/qdrant.rs (1 hunks)
lib/segment/src/data_types/index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/tests/mod.rs (2 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (27 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (5 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

lib/segment/src/index/field_index/full_text_index/tests/mod.rs (1)

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/segment/src/index/field_index/full_text_index/inverted_index.rs:330-344
Timestamp: 2025-05-26T14:45:01.364Z
Learning: For future phrase capabilities in the full-text index, equality checks need to compare both the ID and value of PostingElement items, not just the IDs. This supports use cases like position-aware or frequency-weighted searches where the value component contains important metadata.

🧬 Code Graph Analysis (2)

lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (2)

lib/segment/src/data_types/index.rs (3)

fmt (296-299)

fmt (408-411)

from (270-292)

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)

new (179-207)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)

lib/segment/src/index/field_index/full_text_index/tokenizers/japanese.rs (2)

tokenize (40-71)

tokenize (75-82)

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (3)

tokenize (15-37)

tokenize (43-65)

tokenize (71-104)

lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (1)

from_algorithm (24-30)

🪛 markdownlint-cli2 (0.17.2)

docs/grpc/docs.md

69-69: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

70-70: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

71-71: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

72-72: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

74-74: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (13)

GitHub Check: test-consensus-compose
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: test-low-resources
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: integration-tests
GitHub Check: integration-tests-consensus
GitHub Check: lint
GitHub Check: storage-compat-test

🔇 Additional comments (25)

lib/segment/src/index/field_index/full_text_index/tests/mod.rs (2)

166-166: LGTM! Correct field addition for test compatibility.

The addition of stemmer: None appropriately maintains existing test behavior while accommodating the new optional stemming field in TextIndexParams.

213-213: LGTM! Consistent field addition for test compatibility.

The stemmer: None field addition is consistent with the other test update and correctly maintains the focus on phrase matching functionality without introducing stemming behavior.

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (1)

406-406: LGTM! Appropriate field addition for test compatibility.

The addition of stemmer: None correctly maintains existing test behavior for general full-text indexing functionality while accommodating the new optional stemming field in TextIndexParams.

lib/api/src/grpc/qdrant.rs (1)

869-871: Tag addition looks safe but double-check wire compatibility

Adding an optional field with tag 8 follows protobuf’s forwards-compatibility rules, so existing clients won’t break.
Please still verify that:

No other message re-uses tag 8 inside TextIndexParams in older .proto versions.

Down-stream SDKs that ship pre-compiled stubs are regenerated; otherwise they will silently drop the new field.
lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (4)

8-13: LGTM - Clean enum design with future extensibility.

The Stemmer enum is well-designed for extensibility, allowing easy addition of other stemming algorithms in the future. The Arc wrapper is appropriate for thread-safe sharing across tokenizers.

15-21: Good practice - Custom Debug implementation maintains abstraction.

The custom Debug implementation correctly hides the internal Arc<rust_stemmers::Stemmer> details, maintaining clean debug output and abstraction boundaries.

32-37: Verify stem_cow method availability.

The stem method delegates to stem_cow on the underlying stemmer. Ensure this method exists and handles the Cow<str> type correctly.
#!/bin/bash
# Description: Verify that rust_stemmers::Stemmer has stem_cow method
# Expected: Should find the method definition or usage

# Search for stem_cow method usage
rg -A 3 -B 3 "stem_cow" --type rust

# Check if there are any other stem methods being used
rg -A 3 -B 3 "\.stem\(" --type rust
24-30: [web_search]
How does rust_stemmers::Stemmer::create handle invalid Algorithm parameters? Does it panic or return a default value?
lib/api/src/grpc/proto/collections.proto (2)

483-484: LGTM - Consistent field naming and numbering.

The addition of the stemming field to TextIndexParams follows the established pattern with proper optional marking and sequential field numbering.

486-490: Good extensible design with oneof pattern.

The StemmingAlgorithm message uses the oneof pattern effectively, allowing for future addition of other stemming algorithms while maintaining backward compatibility. The field naming follows protobuf conventions.

docs/redoc/master/openapi.json (1)

7795-7816: Language enum looks good – consider case consistency

All language identifiers are Title-case while API payloads elsewhere are often lower-case ("english"). Double-check server deserialization logic to ensure case-sensitivity issues don’t surface.

lib/segment/src/data_types/index.rs (1)

214-217: LGTM! The optional stemmer field is well-placed.

The addition of the optional stemmer field to TextIndexParams follows the established pattern and uses appropriate serde attributes.

lib/api/src/grpc/conversions.rs (3)

14-14: LGTM! Proper handling of stemming fields in TextIndexParams conversion.

The conversion correctly extracts and transforms the stemmer field between internal and gRPC representations.

Also applies to: 36-37, 220-221, 227-228, 238-238

360-373: LGTM! Clean conversion from internal to gRPC StemmingAlgorithm.

The implementation properly maps the internal enum to the gRPC structure.

539-553: LGTM! Robust error handling in stemming parameter conversion.

The implementation properly validates the language parameter and provides a clear error message for invalid values.

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)

19-26: LGTM! Clean refactoring to use TokenizerConfig.

The signature change improves API consistency and reduces parameter count while maintaining all functionality.

30-60: LGTM! Proper integration of stemming into the tokenization flow.

The refactored logic correctly:

Optimizes for Latin script without stemming

Delegates stemming to the config

Maintains backward compatibility with the Japanese tokenizer

187-201: LGTM! Good test coverage for stemming functionality.

The test properly verifies that the English Snowball stemmer correctly stems "Testing" to "test" while leaving "this" unchanged.

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (7)

4-4: LGTM! Clean module organization.

The addition of the stemmer module and its import follows the existing pattern and maintains good separation of concerns.

Also applies to: 7-7

15-38: Correct tokenization pipeline implementation.

The WhiteSpaceTokenizer correctly implements the tokenization pipeline:

Lowercase transformation (if enabled)

Stopword filtering

Stemming application

This order ensures efficient processing by filtering stopwords before applying the potentially expensive stemming operation.

43-66: Consistent tokenizer implementations.

Both WordTokenizer and PrefixTokenizer correctly follow the same tokenization pipeline pattern established in WhiteSpaceTokenizer. The uniform use of TokenizerConfig improves code maintainability.

Also applies to: 71-104

122-147: Well-documented query tokenization logic.

The implementation correctly applies stemming without stopword filtering for prefix queries. The detailed comment effectively explains the rationale with a concrete example, which is valuable for future maintainers.

170-206: Clean refactoring of Tokenizer struct.

The consolidation of configuration parameters into a single TokenizerConfig field improves encapsulation and reduces the number of parameters passed around. The constructor properly initializes all configuration options including the optional stemmer.

281-291: Efficient stemming implementation.

The stem_if_applicable method is well-implemented:

Uses the idiomatic let-else pattern for optional handling

Preserves Cow semantics to minimize allocations

Clear documentation of its behavior

735-758: Comprehensive stemming test coverage.

The new test_stemming_snowball test effectively verifies stemming functionality across all tokenizer types. The test cases use good examples ("interestingly" → "interest", "proceeding" → "proceed", "living" → "live") that clearly demonstrate the stemming behavior.

lib/api/src/grpc/qdrant.rs

docs/redoc/master/openapi.json

lib/segment/src/data_types/index.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (1)
35-35: Minor style: Remove extra blank lines for consistency.

There are extra blank lines after the stemming application in PrefixTokenizer methods that are inconsistent with other tokenizers.
-            callback(config.stem_if_applicable(token_cow));
-
+            callback(config.stem_if_applicable(token_cow));
Apply the same fix to lines 137-138 in tokenize_query.

Also applies to: 63-63, 91-92, 137-138

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7f691b5 and eca6adf.

📒 Files selected for processing (2)

lib/segment/src/data_types/index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (27 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/segment/src/data_types/index.rs

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests
GitHub Check: test-consensus-compose
GitHub Check: integration-tests-consensus
GitHub Check: lint
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: storage-compat-test

🔇 Additional comments (5)

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (5)

4-7: LGTM! Clean integration of stemming module.

The new stemmer module import and Stemmer struct integration are well-placed and follow the existing module structure.

175-206: Excellent refactoring with TokenizerConfig.

The consolidation of tokenizer parameters into a unified TokenizerConfig struct improves maintainability and makes stemming integration clean. The constructor properly handles the optional stemmer parameter from TextIndexParams.

274-291: Well-designed TokenizerConfig with proper stemming abstraction.

The stem_if_applicable method provides a clean abstraction for conditional stemming, safely handling the optional stemmer case. The struct design allows for easy extension with additional tokenizer features.

744-767: Comprehensive stemming test coverage.

The test effectively verifies stemming behavior across all tokenizer types with appropriate expected outputs. The test cases demonstrate that stemming is applied correctly to produce root forms ("interestingly" → "interest", "proceeding" → "proceed", "living" → "live").

17-38: Consistent tokenizer interface updates.

All tokenizer implementations have been consistently updated to use the new TokenizerConfig parameter. The order of operations (lowercase → stopwords → stemming) is logical and maintained across all implementations.

Also applies to: 45-66, 73-104, 124-147

lib/api/src/grpc/proto/collections.proto

lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

generall · 2025-06-28T19:39:31Z

PUT collections/benchmark/index
{
  "field_name": "t",
  "field_schema": {
    "type": "text",
    "normalize": true,
    "stemmer": {
      "type": "snowball",
      "language": "en"
    }
  }
}

What is "normalize": true,?

generall · 2025-06-28T19:54:00Z

lib/segment/src/data_types/index.rs

+
+/// Different stemming algorithms with their configs.
+#[derive(Debug, Deserialize, Serialize, JsonSchema, Clone, PartialEq, Hash, Eq)]
+#[serde(tag = "type", rename_all = "snake_case")]


Using serde tag will generate anonymous object, which is very hard to use in python client. Instead, we use e.g. pub r#type: FloatIndexType, and untagged enum

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

30-30: Address the previous review concern about the stemmer check.

This check was previously questioned by reviewers who noted it might not be necessary. The optimization skips language detection for Latin script when no stemming is configured, but this adds complexity that may not provide significant benefit.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eca6adf and 6486d4c.

📒 Files selected for processing (6)

docs/redoc/master/openapi.json (2 hunks)
lib/api/src/grpc/conversions.rs (8 hunks)
lib/segment/src/data_types/index.rs (1 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs (27 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (5 hunks)
lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

lib/segment/src/index/field_index/full_text_index/tokenizers/stemmer.rs
lib/segment/src/data_types/index.rs
lib/segment/src/index/field_index/full_text_index/tokenizers/mod.rs
lib/api/src/grpc/conversions.rs
docs/redoc/master/openapi.json

🧰 Additional context used

🧠 Learnings (1)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (1)

Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: rust-tests (windows-latest)
GitHub Check: test-low-resources
GitHub Check: lint
GitHub Check: integration-tests
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consensus-compose
GitHub Check: storage-compat-test
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)

🔇 Additional comments (3)

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs (3)

187-203: Excellent test coverage for stemming functionality.

The test properly validates that stemming works correctly by checking that "Testing" is stemmed to "test". The test setup with TokenizerConfig and StemmingAlgorithm::Snowball configuration is comprehensive.

22-22: Good refactoring to unified configuration.

The migration from individual parameters to TokenizerConfig improves maintainability and extensibility. This design makes it easier to add new tokenization options without changing method signatures.

58-58: Clean encapsulation of stemming logic.

The use of config.stem_if_applicable(cased_token) properly encapsulates the stemming decision within the configuration object, maintaining separation of concerns.

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs

JojiiOfficial · 2025-07-02T08:12:48Z

What is "normalize": true,?

It's leftovers from the previous approach, that I accidentally copied into here. This is not part of this PR and can be ignored.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 184a52b and 3e00fae.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

lib/api/src/grpc/conversions.rs (8 hunks)
lib/api/src/grpc/proto/collections.proto (17 hunks)
lib/api/src/grpc/qdrant.rs (1 hunks)
lib/segment/Cargo.toml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

lib/api/src/grpc/conversions.rs

🚧 Files skipped from review as they are similar to previous changes (2)

lib/api/src/grpc/qdrant.rs
lib/api/src/grpc/proto/collections.proto

🧰 Additional context used

🧠 Learnings (1)

lib/segment/Cargo.toml (6)

Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/tests.rs:44-47
Timestamp: 2025-05-15T22:54:30.292Z
Learning: The rand crate version 0.9.0 and newer uses method names with `random_*` prefix (like `random_range`), while versions 0.8.x and older use `gen_*` prefix (like `gen_range`). This naming change was introduced in rand 0.9.0-alpha.1.

Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/tests.rs:44-47
Timestamp: 2025-05-15T22:54:30.292Z
Learning: The rand crate version 0.9.0 and higher changed method names from `gen_*` to `random_*` (e.g., `gen_range()` became `random_range()`). Code using rand 0.9.x should use the `random_*` method names, while code using rand 0.8.x and earlier should use the `gen_*` method names.

Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/tests.rs:44-47
Timestamp: 2025-05-15T22:54:30.292Z
Learning: The rand crate version 0.9.0 and newer uses method names with `random_*` prefix (like `random_range`), while versions 0.8.x and older use `gen_*` prefix (like `gen_range`). This was part of an API redesign in rand 0.9.0 released in February 2024.

Learnt from: coszio
PR: qdrant/qdrant#6609
File: lib/gridstore/src/blob.rs:46-59
Timestamp: 2025-06-02T18:10:47.203Z
Learning: In the Qdrant codebase, zerocopy crate is extensively used for safe byte-level operations across GPU operations, HNSW indices, memory-mapped structures, and serialization. When implementing Blob trait for generic Vec<T>, using zerocopy's FromBytes and IntoBytes traits is preferred over size_of::<T>() because it guarantees memory layout equals byte representation, making serialization safe and correct.

Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.

Learnt from: JojiiOfficial
PR: qdrant/qdrant#5954
File: lib/segment/src/index/field_index/bool_index/simple_bool_index.rs:219-228
Timestamp: 2025-03-12T13:40:56.980Z
Learning: In Rust 1.80.0 and later, `std::mem::size_of` is included in the prelude and doesn't need to be explicitly imported.

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-consistency
GitHub Check: rust-tests (macos-latest)
GitHub Check: test-low-resources
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: integration-tests-consensus
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: lint
GitHub Check: storage-compat-test

lib/segment/Cargo.toml

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

docs/grpc/docs.md (3)
69-74: Fix wrong list indentation in the TOC

markdownlint flags the six newly-added list items because their bullet indentation (4 spaces) deviates from the existing 2-space pattern used throughout the document.
Keeping a consistent indent avoids broken anchors when other tooling re-writes or lints the file.
-    - [SnowballParameters](#qdrant-SnowballParameters)
...
-    - [StemmingAlgorithm](#qdrant-StemmingAlgorithm)
+  - [SnowballParameters](#qdrant-SnowballParameters)
...
+  - [StemmingAlgorithm](#qdrant-StemmingAlgorithm)
1483-1492: Consider wrapping algorithm variants in a oneof

StemmingAlgorithm currently contains a single optional field snowball.
If additional algorithms are added later (e.g., Porter, Lancaster) you will want mutual exclusivity. Declaring the variants inside a oneof from day one avoids a breaking proto change down the road.

Example:
message StemmingAlgorithm {
  oneof algorithm {
    SnowballParameters snowball = 1;
    // future variants...
  }
}
1653-1654: Clarify default behaviour and interaction with other flags

The new stemming field is optional, but the docs don’t specify:
• What happens when it’s omitted – is stemming disabled or is there a collection-level default?
• How does it interact with lowercase and stopwords? (order matters)

Add a short note (as done for lowercase, phrase_matching etc.) so users understand the defaults.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e00fae and 42f1f38.

📒 Files selected for processing (1)

docs/grpc/docs.md (4 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

docs/grpc/docs.md

69-69: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

70-70: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

71-71: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

72-72: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

73-73: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

74-74: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms (14)

GitHub Check: storage-compat-test
GitHub Check: test-consensus-compose
GitHub Check: test-consistency
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: integration-tests
GitHub Check: rust-tests (macos-latest)
GitHub Check: integration-tests-consensus
GitHub Check: lint
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (ubuntu-latest)

coderabbitai · 2025-07-02T11:58:33Z

docs/grpc/docs.md

+<a name="qdrant-SnowballParameters"></a>
+
+### SnowballParameters
+
+
+
+| Field | Type | Label | Description |
+| ----- | ---- | ----- | ----------- |
+| language | [string](#string) |  | Which language the algorithm should stem. |
+


🛠️ Refactor suggestion

Document the allowed languages or change the type to an enum

language is declared as a free-form string, but the implementation (rust-stemmers) only accepts a finite set of ISO-639 codes.
Exposing it as a raw string in the public API invites invalid input and inconsistent error handling.

Actionable options:

Replace string language with an enum that lists supported Snowball languages (preferred – compile-time safety, self-documenting).

Keep string but add a note like “Allowed values: en, de, …” and link to the Snowball spec.

Failing to do either will shift validation burden to runtime and every client.

🤖 Prompt for AI Agents

In docs/grpc/docs.md around lines 1404 to 1413, the `language` field is currently a free-form string but should be restricted to valid Snowball stemmer languages. To fix this, either replace the `string` type with an enum listing all supported ISO-639 language codes for compile-time validation, or if keeping it as a string, add a clear note specifying the allowed language codes and link to the Snowball specification. This will prevent invalid inputs and improve API clarity.

* Implement stemming * minor fix * Add openapi specs * Fix import * Fix Test, rename API type * update openapi * fix rest models * Review remarks * Minor code cleanup * make grpc naming aligned with Rest + use master branch for dependencies * grpc docs * use tagged version --------- Co-authored-by: generall <andrey@vasnetsov.com>

JojiiOfficial added 3 commits June 27, 2025 14:56

Implement stemming

963ab42

minor fix

95d6138

Add openapi specs

222ccf6

JojiiOfficial marked this pull request as ready for review June 27, 2025 13:06

JojiiOfficial requested review from coszio and generall June 27, 2025 13:06

Fix import

7f691b5

coderabbitai bot reviewed Jun 27, 2025

View reviewed changes

lib/api/src/grpc/qdrant.rs Outdated Show resolved Hide resolved

docs/redoc/master/openapi.json Show resolved Hide resolved

lib/segment/src/data_types/index.rs Show resolved Hide resolved

JojiiOfficial added 2 commits June 27, 2025 15:25

Fix Test, rename API type

eca6adf

update openapi

495b33b

coderabbitai bot reviewed Jun 27, 2025

View reviewed changes

coszio reviewed Jun 27, 2025

View reviewed changes

generall reviewed Jun 28, 2025

View reviewed changes

fix rest models

6486d4c

coderabbitai bot reviewed Jun 28, 2025

View reviewed changes

lib/segment/src/index/field_index/full_text_index/tokenizers/multilingual.rs Show resolved Hide resolved

JojiiOfficial added 2 commits July 2, 2025 10:36

Review remarks

bd18baa

Minor code cleanup

184a52b

This comment was marked as off-topic.

Sign in to view

JojiiOfficial requested review from coszio and generall July 2, 2025 09:38

make grpc naming aligned with Rest + use master branch for dependencies

3e00fae

generall approved these changes Jul 2, 2025

View reviewed changes

coderabbitai bot reviewed Jul 2, 2025

View reviewed changes

lib/segment/Cargo.toml Outdated Show resolved Hide resolved

grpc docs

42f1f38

coderabbitai bot reviewed Jul 2, 2025

View reviewed changes

use tagged version

3e07a27

JojiiOfficial merged commit b00a908 into dev Jul 2, 2025
18 checks passed

JojiiOfficial deleted the add_stemming branch July 2, 2025 13:29

coderabbitai bot mentioned this pull request Jul 9, 2025

rename SnowballParameters to SnowballParams #6838

Merged

generall added this to the Stemmer milestone Jul 17, 2025

coderabbitai bot mentioned this pull request Aug 3, 2025

BM25 in the Core #6891

Merged

coderabbitai bot mentioned this pull request Oct 15, 2025

Full-Text Index ASCII Folding (Normalization) #7408

Merged

6 tasks

Conversation

JojiiOfficial commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

generall commented Jun 28, 2025

Uh oh!

generall Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JojiiOfficial commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JojiiOfficial commented Jun 27, 2025 •

edited

Loading

coderabbitai bot commented Jun 27, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

generall Jun 28, 2025 •

edited

Loading

JojiiOfficial commented Jul 2, 2025 •

edited

Loading