Skip to content

filter text any#7100

Merged
generall merged 8 commits intodevfrom
filter-text-any
Aug 24, 2025
Merged

filter text any#7100
generall merged 8 commits intodevfrom
filter-text-any

Conversation

@generall
Copy link
Member

@generall generall commented Aug 19, 2025

  • internal implementation for filter any
  • implement api for match text_any

ToDo:

  • Tests
  • Examples

@generall generall requested a review from coszio August 19, 2025 23:05
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 19, 2025

📝 Walkthrough

Walkthrough

Adds a new TextAny full-text match variant (MatchTextAny / Match::TextAny) across the API surface: proto, gRPC conversions, OpenAPI/docs, and public types. Introduces explicit AllTokens (previously Tokens) and new AnyTokens semantics with OR-style matching, implements union merging for postings (mutable, mmap, immutable inverted indexes), adds mmap→immutable conversion, updates query parsing (parse_text_any_query), condition conversion (TextQueryType), payload matching, cardinality estimation (expected_should_estimation), index inference, and an integration test verifying text_any behavior.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • timvisee
  • coszio
  • agourlay

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch filter-text-any

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
lib/segment/src/index/field_index/full_text_index/text_index.rs (2)

582-601: Filter for Match::TextAny is missing from FullTextIndex::filter (FieldIndex path)

FieldIndex::filter currently supports Text and Phrase only. Without TextAny, the index cannot prefilter candidates for text_any conditions, potentially degrading performance.

Proposed change:

-        let parsed_query_opt = match &condition.r#match {
-            Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter),
-            Some(Match::Phrase(MatchPhrase { phrase })) => {
-                self.parse_phrase_query(phrase, hw_counter)
-            }
-            _ => return None,
-        };
+        let parsed_query_opt = match &condition.r#match {
+            Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter),
+            Some(Match::TextAny(MatchTextAny { text_any })) => {
+                self.parse_text_any_query(text_any, hw_counter)
+            }
+            Some(Match::Phrase(MatchPhrase { phrase })) => {
+                self.parse_phrase_query(phrase, hw_counter)
+            }
+            _ => return None,
+        };

Note: you’ll need to import MatchTextAny at the top:

  • use crate::types::{FieldCondition, Match, MatchPhrase, MatchText, MatchTextAny, PayloadKeyType};

602-619: Cardinality estimation for TextAny is not wired

Same gap exists in estimate_cardinality. Wire it to parse_text_any_query so the estimator can use estimate_has_any_cardinality.

Proposed change:

-        let parsed_query_opt = match &condition.r#match {
-            Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter),
-            Some(Match::Phrase(MatchPhrase { phrase })) => {
-                self.parse_phrase_query(phrase, hw_counter)
-            }
-            _ => return None,
-        };
+        let parsed_query_opt = match &condition.r#match {
+            Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter),
+            Some(Match::TextAny(MatchTextAny { text_any })) => {
+                self.parse_text_any_query(text_any, hw_counter)
+            }
+            Some(Match::Phrase(MatchPhrase { phrase })) => {
+                self.parse_phrase_query(phrase, hw_counter)
+            }
+            _ => return None,
+        };
🧹 Nitpick comments (11)
lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1)

205-208: Add AnyTokens coverage to full-text index congruence tests

No lingering references to the old ParsedQuery::Tokens variant were found. However, the existing test_congruence.rs suite only exercises the AllTokens (AND) and Phrase paths. To fully validate the new split, please add a test case for ParsedQuery::AnyTokens (OR semantics) in:

  • lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs

This will ensure OR‐query behavior is covered alongside the existing cases.

docs/grpc/docs.md (1)

3370-3371: Good addition; clarify semantics for users

Consider clarifying that text_any:

  • Matches any token after the text pipeline (tokenizer, lowercase, stopwords, stemming if configured).
  • Is OR semantics vs match.text, which is AND semantics.

This will help users choose between text vs text_any correctly.

lib/api/src/grpc/qdrant.rs (1)

6724-6727: Add clearer docs for TextAny semantics (OR vs AND) in the proto.

The comment “Match any word in the text” is terse. Given “Text” and “Phrase” already exist, this should explicitly state OR-semantics across tokenized terms (vs. Text’s AND-semantics, if that’s the case), and note it applies to text-indexed fields, with normalizer/stopwords/stemming behavior identical to Text.

Consider updating the proto docstring so it propagates into generated code and public docs:

-        /// Match any word in the text
+        /// Match any token in the text (OR semantics).
+        /// Uses the same tokenization/normalization/stopwords/stemming as `text`.
+        /// Contrast with:
+        ///  - `text`: all tokens must match (AND semantics).
+        ///  - `phrase`: exact phrase match.

If this is not the intended behavior, please clarify in the comment accordingly.

Also add tests (in the relevant layer) to cover:

  • simple single-token match,
  • multi-token with OR semantics,
  • interaction with stopwords/stemming,
  • contrast cases where text fails but text_any matches, and vice versa.
docs/redoc/master/openapi.json (1)

8680-8691: Clarify semantics and tighten schema for MatchTextAny

Two small improvements:

  • Make OR-semantics explicit in the description to contrast it with MatchText.
  • Optionally guard against empty strings with minLength: 1 and add an example. This mirrors typical full-text usage and improves generated SDK hints. Backward-compatible and low-risk.

Apply this diff:

       "MatchTextAny": {
-        "description": "Full-text match of at least one token of the string.",
+        "description": "Full-text match of at least one token of the string (OR semantics).",
         "type": "object",
         "required": [
           "text_any"
         ],
         "properties": {
           "text_any": {
-            "type": "string"
+            "type": "string",
+            "minLength": 1,
+            "example": "quick fox"
           }
         }
       },
lib/segment/src/payload_storage/condition_checker.rs (1)

168-173: TextAny fallback semantics: OK, but confirm alignment with tokenizer behavior.

This adds substring-based OR over whitespace-split tokens for string payloads. It matches the existing substring approach for Text and Phrase in this ValueChecker, so behavior is consistent here. However, the index-backed path likely uses normalized/tokenized terms; on unindexed fields, substring matching may yield slightly different results (case, punctuation, CJK). If this divergence is acceptable for unindexed checks, we’re good. Otherwise, consider reusing the same tokenization/normalization used by the full-text index for parity.

Would you like me to wire in the tokenizer used in the text index here for closer parity on unindexed fields?

lib/segment/src/types.rs (1)

2169-2175: Adding Match::TextAny: API shape and conversions LGTM; consider convenience constructor.

  • New struct MatchTextAny{text_any} and TextAny variants in MatchInterface/Match look good.
  • From mapping and FieldCondition::input_size update are correct.
  • Serialization fields are unambiguous among Text/TextAny/Phrase, which is important for untagged enums.

Nice-to-haves:

  • Add a helper constructor for ergonomics, mirroring new_text/new_phrase:
 impl Match {
+    pub fn new_text_any(text_any: &str) -> Self {
+        Self::TextAny(MatchTextAny { text_any: text_any.into() })
+    }
 }
  • Optionally implement From<&str> for MatchTextAny, if that’s a pattern you like elsewhere.

One cross-cutting check: PayloadSchemaType::supports_match for Text fields remains false (unchanged). That’s intentional in this codepath; the planner/condition converter should route TextAny through the text index or flag unindexed usage separately (which your PR mentions). Just ensure the unindexed-field diagnostics and text index inference cover TextAny as they do Text/Phrase.

Also applies to: 2217-2218, 2229-2233, 2270-2273, 2790-2790

lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)

284-311: Unwrap on postings.get for check_has_any: safe under current construction, but worth a note.

As with check_has_subset, this assumes TokenSet only contains known token IDs. That’s consistent with how ParsedQuery is built (unknown tokens are dropped) and matches the comment above. No change requested.

If there’s any path that can produce TokenSet with unknown IDs, switch to if let Some(pl) = postings.get(*token_id) { … } to be defensive.

lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (2)

27-34: OR-merge yields duplicates; dedup consecutive ids

kmerge_by will output the same point id multiple times when it appears in multiple postings. For OR semantics we typically want unique point ids. Add a dedup step since streams are sorted.

Proposed change:

 pub fn merge_postings_iterator<'a>(
     postings: Vec<&'a PostingList>,
 ) -> impl Iterator<Item = PointOffsetType> + 'a {
     postings
         .into_iter()
         .map(PostingList::iter)
         .kmerge_by(|a, b| a < b)
+        .dedup()
 }

Please confirm downstream consumers of InvertedIndex::filter expect unique ids (most filters do). If uniqueness is required, we should also dedup in the compressed variant below.


70-85: Compressed OR-merge also outputs duplicates; dedup them

Same duplicate issue as uncompressed merge. Because kmerge_by orders by id, consecutive duplicates can be removed with dedup.

Proposed change:

 pub fn merge_compressed_postings_iterator<'a, V: PostingValue + 'a>(
     postings: Vec<PostingListView<'a, V>>,
     is_active: impl Fn(PointOffsetType) -> bool + 'a,
 ) -> impl Iterator<Item = PointOffsetType> + 'a {
     postings
         .into_iter()
         .map(PostingListView::into_iter)
         .kmerge_by(|a, b| a.id < b.id)
         .filter_map(move |elem| {
             if is_active(elem.id) {
                 Some(elem.id)
             } else {
                 None
             }
         })
+        .dedup()
 }
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)

309-322: parse_text_any_query: consider returning None when no known tokens

Current implementation returns Some(AnyTokens(empty)) when nothing is recognized. That ultimately yields no matches, but returning None would be consistent with the “unseen tokens -> no matches” contract and allow upstream to short-circuit earlier.

Proposed change:

 pub fn parse_text_any_query(
     &self,
     text: &str,
     hw_counter: &HardwareCounterCell,
 ) -> Option<ParsedQuery> {
     let mut tokens = AHashSet::new();
     self.get_tokenizer().tokenize_query(text, |token| {
         if let Some(token_id) = self.get_token(token.as_ref(), hw_counter) {
             tokens.insert(token_id);
         }
     });
-    let tokens = tokens.into_iter().collect::<TokenSet>();
-    Some(ParsedQuery::AnyTokens(tokens))
+    let tokens = tokens.into_iter().collect::<TokenSet>();
+    if tokens.is_empty() {
+        None
+    } else {
+        Some(ParsedQuery::AnyTokens(tokens))
+    }
 }
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (1)

286-325: Has-any estimator: bounds/expectation look good; consider min bound nuance

  • min: using largest individual posting length as a lower bound is safe (union can’t be smaller).
  • max: min(sum, points_count) is correct.
  • exp: expected_should_estimation is a good fit.

No functional issues. You can drop the unused variable name comment if you like; otherwise LGTM.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2b4f19f and 5700c54.

📒 Files selected for processing (17)
  • docs/grpc/docs.md (1 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (2 hunks)
  • lib/api/src/grpc/proto/points.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (2 hunks)
  • lib/collection/src/problems/unindexed_field.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (2 hunks)
  • lib/segment/src/index/query_estimator.rs (2 hunks)
  • lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (3 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (2 hunks)
  • lib/segment/src/types.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-26T14:50:13.183Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
📚 Learning: 2025-06-28T19:25:53.448Z
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-05-26T14:47:23.505Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-07-11T11:35:21.549Z
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
🧬 Code Graph Analysis (7)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)
  • iter (82-86)
  • iter (146-146)
  • into_iter (139-141)
lib/segment/src/index/field_index/full_text_index/inverted_index/posting_list.rs (1)
  • iter (34-36)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
  • is_active (164-171)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3)
  • merge_compressed_postings_iterator (70-85)
  • postings (49-52)
  • postings (118-121)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (2)
lib/segment/src/index/query_estimator.rs (1)
  • expected_should_estimation (72-78)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (1)
  • new (27-35)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (1)
  • merge_postings_iterator (27-34)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
  • tokens (40-42)
  • tokens (111-113)
  • iter (82-86)
  • iter (146-146)
lib/segment/src/index/field_index/full_text_index/inverted_index/posting_list.rs (1)
  • iter (34-36)
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
  • tokens (40-42)
  • tokens (111-113)
  • new (99-101)
  • query (423-426)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3)
  • merge_compressed_postings_iterator (70-85)
  • postings (49-52)
  • postings (118-121)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (1)
lib/segment/src/index/field_index/field_index_base.rs (2)
  • full_text_index (187-187)
  • full_text_index (190-190)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: e2e-tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: integration-tests
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (27)
lib/collection/src/problems/unindexed_field.rs (2)

228-236: Mapping TextAny to Text index is appropriate

Treating Match::TextAny as requiring a Text index is consistent with the existing Text/Phrase split. No changes needed here.


213-236: TextAny branch coverage verified across all Match handlers

All exhaustive match Match statements now account for Match::TextAny:

  • lib/collection/src/problems/unindexed_field.rs (infer_index_from_field_condition) handles Match::TextAny
  • lib/segment/src/payload_storage/condition_checker.rs includes a Match::TextAny arm
  • lib/segment/src/index/query_optimization/condition_converter/match_converter.rs includes Match::TextAny and calls parse_text_any_query
  • The full-text index parser implements parse_text_any_query, ensuring TextAny is supported end-to-end

No remaining match Match blocks omit the TextAny branch.

lib/api/src/grpc/proto/points.proto (1)

1117-1118: Additive oneof field verified: conversions in place and tag is unique

  • lib/api/src/grpc/conversions.rs: covers MatchValue::TextAny ↔ internal Match::TextAny in both directions
  • lib/segment/src/types.rs, payload_storage/condition_checker.rs & query_optimization/match_converter.rs: all handle Match::TextAny
  • lib/api/src/grpc/proto/points.proto: tag 10 is only used by text_any (no collisions)

Please rebuild (cargo build) to regenerate Prost/Tonic stubs and double-check your OpenAPI spec/docs include this new field.

lib/api/src/grpc/conversions.rs (2)

1898-1900: TextAny gRPC → internal conversion looks correct

Maps MatchValue::TextAny(text_any) to segment::types::Match::TextAny(MatchTextAny { text_any }). This aligns with the new segment-side type and keeps error paths intact for malformed matches.


1941-1943: ✅ TextAny round-trip verified

All paths for the new TextAny variant are wired correctly:

  • Proto oneof (lib/api/src/grpc/proto/points.proto:1117): string text_any = 10;
  • Generated gRPC enum (lib/api/src/grpc/qdrant.rs:6726): TextAny(::prost::alloc::string::String)
  • Conversions (lib/api/src/grpc/conversions.rs:1898–1900 & 1941–1943): both directions handled
  • Segment types (lib/segment/src/types.rs:2172 & 2217): MatchTextAny struct and Match::TextAny variant present
  • JSON/REST surfaces (docs/grpc/docs.md:3370 & docs/redoc/master/openapi.json:8680): text_any and MatchTextAny referenced

LGTM—approving.

lib/api/src/grpc/qdrant.rs (1)

6688-6690: Oneof tag extension to include 10 fully verified and backward-compatible.

The new text_any = 10 field is defined in points.proto, the gRPC⇄internal conversions include TextAny, and both the OpenAPI schema and gRPC docs expose text_any. This change is wire-safe and ready to merge.

docs/redoc/master/openapi.json (1)

8628-8630: MatchTextAny wiring verified — all layers updated consistently

All occurrences of text_any / MatchTextAny have been added as expected across the codebase:

• docs/redoc/master/openapi.json – component schema and union variant
• docs/grpc/docs.md – documentation table
• lib/api/src/grpc/proto/points.proto – oneof addition
• lib/api/src/grpc/conversions.rs & qdrant.rs – gRPC ↔ segment conversions
• lib/segment/src/types.rs – MatchTextAny struct and enum variants
• lib/segment/src/payload_storage/condition_checker.rs – payload matching logic
• lib/segment/src/index/query_optimization/condition_converter/match_converter.rs – query converter
• lib/segment/src/index/field_index/full_text_index/text_index.rs – parse_text_any_query implementation
• lib/collection/src/problems/unindexed_field.rs – index inference for TextAny

No missing references or inconsistencies detected. Approving additive API change.

lib/segment/src/payload_storage/condition_checker.rs (1)

9-11: Import of MatchTextAny looks correct and consistent.

New enum variant and struct are imported alongside existing Match types. No issues spotted.

lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)

449-452: ParsedQuery dispatch extended correctly for AllTokens and AnyTokens.

Filter and check_match routing are clear and maintain previous behavior for phrases and AND-subset queries.

Also applies to: 474-477

lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (2)

231-235: Filter dispatch extended with AllTokens/AnyTokens: looks good.

The OR path is boxed; the phrase path unchanged. No issues spotted.


251-275: check_match AnyTokens branch: LGTM.

Uses doc.has_any(query) mirroring has_subset/has_phrase patterns. Consistent and clear.

lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (1)

36-68: Intersection logic is sound and leverages PostingIterator semantics

This uses the smallest posting optimization plus advance_until_greater_or_equal, which by design may return the same value multiple times; your equality check handles that correctly. Looks good.

lib/segment/src/index/field_index/full_text_index/text_index.rs (2)

306-307: Rename to AllTokens is consistent

Switching parse_text_query to return ParsedQuery::AllTokens maintains previous AND semantics. No issues.


376-387: Payload checker paths cover AllTokens and AnyTokens correctly

Local payload matching now mirrors set semantics via has_subset and has_any. This aligns with new ParsedQuery variants.

lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)

62-69: TokenSet::has_any is correct and efficient

Binary-search membership with sorted unique tokens is appropriate. Early return on empty subset preserves existing API semantics.


157-161: ParsedQuery gains AnyTokens while preserving AllTokens semantics

The new AnyTokens variant is a clean extension and keeps intent obvious.


229-238: Cardinality dispatch includes AnyTokens

Routing AnyTokens to estimate_has_any_cardinality is the right split.

lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (6)

15-17: Import merge_compressed_postings_iterator is necessary for OR semantics

Good decomposition; keeps iterator utilities centralized.


98-135: filter_has_any: union via k-way merge is correct; dedup still needed upstream

This correctly:

  • filters out deleted points (point_to_tokens_count > 0),
  • collects available postings for tokens,
  • returns empty for empty query,
  • merges via merge_compressed_postings_iterator.

Once dedup is added in merge_compressed_postings_iterator, this will produce unique ids for OR semantics.


167-193: check_has_any: fast per-point verification

Short-circuits on empty tokens and absent documents and uses postings.visitor().contains. Looks correct.


285-288: Filter dispatch wired for AnyTokens

Correctly routes AnyTokens to filter_has_any.


303-308: check_match dispatch wired for AnyTokens

Correctly routes AnyTokens to check_has_any.


450-465: Positions compression refactor: clearer and equivalent

The reflowed map/collect preserves semantics and reads better. Good assert remains.

lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (4)

10-12: Types import updated to include MatchTextAny

Necessary for new variant handling. Good.


21-29: Dispatch adds TextAny; aligns with new query type enum

This integrates TextAny into condition conversion without affecting existing cases.


258-263: Local TextQueryType is a clean replacement for const-generic

Keeps runtime dispatch simple and avoids template bloat.


265-277: get_match_text_checker: TextAny path correctly uses parse_text_any_query

On parse failure, returning a checker that’s always false avoids None propagation and simplifies callers. Good choice.

Comment on lines 66 to 79
/// What this function does:
///
/// * For each condition, it calculates the probability that an item does not match it: ((total - x) / total).
/// * It multiplies these probabilities to get the probability that an item matches none of the conditions.
/// * Subtracts this from 1 to get the probability that an item matches at least one condition.
/// * Multiplies this probability by the total number of items and rounds to get the expected count.
pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
let element_not_hit_prob: f64 = estimations
.map(|x| (total - x) as f64 / (total as f64))
.product();
let element_hit_prob = 1.0 - element_not_hit_prob;
(element_hit_prob * (total as f64)).round() as usize
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix division by zero and clamp probabilities in expected_should_estimation.

If total == 0, current code divides by zero. Also, floating rounding could push exp slightly beyond total. Guard these cases.

Apply this diff:

-pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
-    let element_not_hit_prob: f64 = estimations
-        .map(|x| (total - x) as f64 / (total as f64))
-        .product();
-    let element_hit_prob = 1.0 - element_not_hit_prob;
-    (element_hit_prob * (total as f64)).round() as usize
-}
+pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
+    if total == 0 {
+        return 0;
+    }
+    let element_not_hit_prob: f64 = estimations
+        .map(|x| {
+            let x = x.min(total);
+            (total - x) as f64 / (total as f64)
+        })
+        .product::<f64>()
+        .clamp(0.0, 1.0);
+    let element_hit_prob = 1.0 - element_not_hit_prob;
+    ((element_hit_prob * (total as f64)).round() as usize).min(total)
+}

Optional: also clamp exp to max in combine_should_estimations for additional safety.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// What this function does:
///
/// * For each condition, it calculates the probability that an item does not match it: ((total - x) / total).
/// * It multiplies these probabilities to get the probability that an item matches none of the conditions.
/// * Subtracts this from 1 to get the probability that an item matches at least one condition.
/// * Multiplies this probability by the total number of items and rounds to get the expected count.
pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
let element_not_hit_prob: f64 = estimations
.map(|x| (total - x) as f64 / (total as f64))
.product();
let element_hit_prob = 1.0 - element_not_hit_prob;
(element_hit_prob * (total as f64)).round() as usize
}
pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
if total == 0 {
return 0;
}
let element_not_hit_prob: f64 = estimations
.map(|x| {
let x = x.min(total);
(total - x) as f64 / (total as f64)
})
.product::<f64>()
.clamp(0.0, 1.0);
let element_hit_prob = 1.0 - element_not_hit_prob;
((element_hit_prob * (total as f64)).round() as usize).min(total)
}
🤖 Prompt for AI Agents
In lib/segment/src/index/query_estimator.rs around lines 66 to 79, fix
division-by-zero and out-of-range rounding: if total == 0 return 0 immediately;
when computing per-condition probability ensure you treat (total - x)/total
safely by using total as f64 only after checking total>0; after multiplying to
get element_not_hit_prob clamp it to [0.0, 1.0], compute element_hit_prob = 1.0
- element_not_hit_prob, clamp element_hit_prob to [0.0,1.0], then compute exp =
(element_hit_prob * total as f64).round() as isize and finally clamp the
returned usize between 0 and total (cast back to usize). Also optionally clamp
results in combine_should_estimations to not exceed total for extra safety.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/segment/src/index/query_estimator.rs (1)

94-100: Clamp exp to [min, max] to preserve CardinalityEstimation invariants

expected_should_estimation can exceed realistic bounds. Ensure exp is between computed min and max to avoid inconsistent estimations.

-    let expected_count = expected_should_estimation(estimations.iter().map(|x| x.exp), total);
-    CardinalityEstimation {
-        primary_clauses: clauses,
-        min: estimations.iter().map(|x| x.min).max().unwrap_or(0),
-        exp: expected_count,
-        max: min(estimations.iter().map(|x| x.max).sum(), total),
-    }
+    let min_count = estimations.iter().map(|x| x.min).max().unwrap_or(0);
+    let max_count = min(estimations.iter().map(|x| x.max).sum(), total);
+    let expected_count =
+        expected_should_estimation(estimations.iter().map(|x| x.exp), total)
+            .min(max_count)
+            .max(min_count);
+    CardinalityEstimation {
+        primary_clauses: clauses,
+        min: min_count,
+        exp: expected_count,
+        max: max_count,
+    }
♻️ Duplicate comments (3)
lib/segment/src/index/query_estimator.rs (1)

66-78: Harden expected_should_estimation against total == 0 and out-of-range probabilities

Division by zero and rounding drift can produce invalid estimates. Clamp probabilities and bound the result.

Apply this diff:

-pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
-    let element_not_hit_prob: f64 = estimations
-        .map(|x| (total - x) as f64 / (total as f64))
-        .product();
-    let element_hit_prob = 1.0 - element_not_hit_prob;
-    (element_hit_prob * (total as f64)).round() as usize
-}
+pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
+    if total == 0 {
+        return 0;
+    }
+    let element_not_hit_prob: f64 = estimations
+        .map(|x| {
+            let x = x.min(total);
+            (total - x) as f64 / (total as f64)
+        })
+        .product::<f64>()
+        .clamp(0.0, 1.0);
+    let element_hit_prob = (1.0 - element_not_hit_prob).clamp(0.0, 1.0);
+    ((element_hit_prob * (total as f64)).round() as usize).min(total)
+}
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)

19-21: Correct import for OR-merge path

Bringing in merge_compressed_postings_iterator alongside intersection is right for AnyTokens. No external dedup is needed; merge handles it in this codebase as discussed.

lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)

98-135: Immutable AnyTokens filtering looks correct

  • Drops unseen tokens for OR semantics.
  • Empty query short-circuits.
  • Uses merge_compressed_postings_iterator with active-point filter to exclude deleted docs.
🧹 Nitpick comments (3)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (2)

258-263: Derive Copy/Clone (and Debug/Eq) for TextQueryType.

Saves moves and eases debugging without cost. Minor ergonomics win.

-enum TextQueryType {
+#[derive(Clone, Copy, Debug, PartialEq, Eq)]
+enum TextQueryType {

264-269: Consider taking &str instead of String for flexibility (optional).

You’re consuming String from Match payloads, so this isn’t a hot path concern. However, if future callers want to reuse borrowed strings, accepting &str (or impl AsRef) would avoid forcing ownership at the API boundary.

If you want to keep ownership for safety against parse lifetimes, ensure parse_* returns an owned query that does not borrow from the input text. If that’s guaranteed, you can switch to &str here safely.

lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (1)

427-428: Tests only cover AllTokens; add AnyTokens coverage

Given the new AnyTokens semantics, add congruence tests for OR queries across Mutable/Immutable/Mmap and assertion tests for estimate_has_any_cardinality to lock in min/exp/max invariants.

I can draft rstest cases mirroring the existing congruence tests but constructing ParsedQuery::AnyTokens and random token sets. Want me to add them to this PR?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5700c54 and ff3f67c.

📒 Files selected for processing (17)
  • docs/grpc/docs.md (1 hunks)
  • docs/redoc/master/openapi.json (2 hunks)
  • lib/api/src/grpc/conversions.rs (2 hunks)
  • lib/api/src/grpc/proto/points.proto (1 hunks)
  • lib/api/src/grpc/qdrant.rs (2 hunks)
  • lib/collection/src/problems/unindexed_field.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (5 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3 hunks)
  • lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1 hunks)
  • lib/segment/src/index/field_index/full_text_index/text_index.rs (2 hunks)
  • lib/segment/src/index/query_estimator.rs (2 hunks)
  • lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (3 hunks)
  • lib/segment/src/payload_storage/condition_checker.rs (2 hunks)
  • lib/segment/src/types.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
  • lib/api/src/grpc/proto/points.proto
  • lib/collection/src/problems/unindexed_field.rs
  • docs/grpc/docs.md
  • lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs
  • lib/api/src/grpc/conversions.rs
  • lib/api/src/grpc/qdrant.rs
  • docs/redoc/master/openapi.json
  • lib/segment/src/payload_storage/condition_checker.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs
  • lib/segment/src/index/field_index/full_text_index/text_index.rs
  • lib/segment/src/types.rs
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-20T15:03:18.501Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:82-100
Timestamp: 2025-08-20T15:03:18.501Z
Learning: The merge_postings_iterator function in lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs handles deduplication internally, so there's no need to add .dedup() when using it for OR-merge operations in inverted indexes.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-08-20T15:03:25.618Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs:214-247
Timestamp: 2025-08-20T15:03:25.618Z
Learning: The merge_compressed_postings_iterator function in the full text inverted index already handles deduplication internally, so external deduplication calls like .dedup() are not needed when using this function.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-06-28T19:25:53.448Z
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-05-26T14:50:13.183Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-05-26T14:47:23.505Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-07-11T11:35:21.549Z
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
🧬 Code Graph Analysis (4)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
  • intersect_compressed_postings_iterator (36-68)
  • merge_compressed_postings_iterator (70-85)
  • postings (49-52)
  • postings (118-121)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (5)
  • filter_has_any (99-135)
  • filter (279-289)
  • merge (110-127)
  • check_has_any (167-193)
  • check_any (177-187)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (3)
  • filter_has_any (82-99)
  • filter (225-235)
  • new (27-35)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (2)
lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (2)
  • index (273-278)
  • index (444-446)
lib/segment/src/index/field_index/field_index_base.rs (2)
  • full_text_index (187-187)
  • full_text_index (190-190)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
lib/segment/src/index/query_estimator.rs (1)
  • expected_should_estimation (72-78)
lib/segment/src/index/field_index/full_text_index/text_index.rs (5)
  • tokens (305-305)
  • tokens (320-320)
  • points_count (139-145)
  • new (457-478)
  • new (638-644)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
  • points_count (324-326)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
  • points_count (515-517)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
  • intersect_compressed_postings_iterator (36-68)
  • merge_compressed_postings_iterator (70-85)
  • postings (49-52)
  • postings (118-121)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (5)
  • filter_has_any (215-247)
  • filter (443-453)
  • merge (222-239)
  • check_has_any (284-310)
  • check_any (294-304)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (2)
  • filter_has_any (82-99)
  • filter (225-235)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: e2e-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (16)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (3)

10-12: Importing MatchTextAny is correct and consistent with new API surface.

The import updates align with the new Match::TextAny variant and payload type. No issues here.


21-29: Routing Match::Text/TextAny/Phrase via TextQueryType is clean and extensible.

Nice refactor from const generics to a runtime enum; it simplifies branching and keeps call sites tidy. The new Match::TextAny arm is correctly wired to the shared text checker path.


273-281: Confirm parsing fallback semantics
The current implementation in get_match_text_checker returns an always‐false checker on any parse_* failure, rather than propagating None. Since parse_text_query/parse_phrase_query (and likely parse_text_any_query) return None not only for empty input but also when there are unseen tokens, this change will suppress any higher-level fallback (e.g., payload scanning) and may drop legitimate matches.

• File: lib/segment/src/index/query_optimization/condition_converter/match_converter.rs
Lines 273–281: revert to returning None on parse failure to preserve downstream fallbacks.

Suggested diff:

-            let Some(parsed_query) = query_opt else {
-                return Some(Box::new(|_| false));
-            };
+            let Some(parsed_query) = query_opt else {
+                return None;
+            };
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (4)

214-247: AnyTokens iterator wiring looks good

  • Using filter_map to drop unknown tokens is correct for OR semantics.
  • Empty postings short-circuits to empty iterator.
  • Using self.is_active for filtering aligns with mmap’s “postings include deleted” model.

284-310: Per-point AnyTokens check is sound

Early exits on empty query or empty values and per-token visitor checks are consistent with the rest of the implementation. Unwraps are safe given token IDs come from the index’s vocab.


449-452: Filter enum wiring: AllTokens/AnyTokens routing is correct

AllTokens -> subset, AnyTokens -> OR-merge, Phrase -> phrase iterator. Matches the intended semantics.


472-477: check_match wiring for AnyTokens is correct

The new branch routes to check_has_any as expected.

lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)

62-69: TokenSet::has_any API is appropriate

Symmetric with has_subset: returns false on empty subset and uses binary-search membership. Good.


152-161: ParsedQuery variants: AllTokens and AnyTokens split is clear

The semantics are now explicit (AND via AllTokens, OR via AnyTokens). This should simplify downstream dispatch.


229-237: estimate_cardinality dispatch updated correctly

AllTokens -> subset estimator; AnyTokens -> new has-any estimator. Good separation of concerns.

lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (6)

15-17: Correctly importing merge_compressed_postings_iterator

The additional import is needed for OR semantics; good.


167-193: check_has_any implementation is consistent and efficient

Early exits and per-token visitor checks are fine. Indexing into postings is acceptable here given vocab consistency.


279-288: Filter routing: AllTokens/AnyTokens

Wiring is correct and mirrors the mmap/mutable variants.


305-308: check_match routing updated for AnyTokens

Matches the new predicate methods. Good.


449-465: Formatting-only change in create_compressed_postings_with_positions

No semantic changes detected.


467-503: From<&MmapInvertedIndex> conversion is solid

  • Postings copied via PostingListView::to_owned.
  • Vocab reconstructed correctly.
  • point_to_tokens_count and points_count preserved.
  • Debug assertion ensures postings/vocab sizes match.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (4)

181-186: Nit: Use PointOffsetType in the filter signature for consistency.

The local intersection helper uses impl Fn(u32) -> bool; elsewhere in this file and codebase we use the alias PointOffsetType. Minor readability/consistency fix:

-            filter: impl Fn(u32) -> bool + 'a,
+            filter: impl Fn(PointOffsetType) -> bool + 'a,

214-247: Optional: Align return type with filter_has_all to simplify call sites.

filter_has_any returns impl Iterator (materialized as Either<_, _>), while filter_has_all returns Box<dyn Iterator>. Unifying both to Box<dyn Iterator> keeps the public surface consistent and avoids exposing the Either shape through impl Iterator. Suggested change:

-    fn filter_has_any<'a>(
-        &'a self,
-        tokens: TokenSet,
-    ) -> impl Iterator<Item = PointOffsetType> + 'a {
+    fn filter_has_any<'a>(
+        &'a self,
+        tokens: TokenSet,
+    ) -> Box<dyn Iterator<Item = PointOffsetType> + 'a> {
@@
-        match &self.storage.postings {
-            MmapPostingsEnum::Ids(postings) => Either::Left(merge(postings, tokens, is_active)),
-            MmapPostingsEnum::WithPositions(postings) => {
-                Either::Right(merge(postings, tokens, is_active))
-            }
-        }
+        match &self.storage.postings {
+            MmapPostingsEnum::Ids(postings) => Box::new(merge(postings, tokens, is_active)),
+            MmapPostingsEnum::WithPositions(postings) => {
+                Box::new(merge(postings, tokens, is_active))
+            }
+        }

If you apply this, also adjust the filter() call site (see comment on Lines 443-453).


284-310: *Defensive check: avoid unwrap on postings.get(token_id).

While TokenSet should contain only known tokens, an upstream invariant break would panic here. Add a debug assertion and a safe fallback:

-            tokens.tokens().iter().any(|token_id| {
-                let posting_list = postings.get(*token_id).unwrap();
-                posting_list.visitor().contains(point_id)
-            })
+            tokens.tokens().iter().any(|token_id| {
+                debug_assert!(
+                    postings.get(*token_id).is_some(),
+                    "TokenId {} not found in postings",
+                    token_id
+                );
+                postings
+                    .get(*token_id)
+                    .map(|pl| pl.visitor().contains(point_id))
+                    .unwrap_or(false)
+            })

Would you like a quick repo scan to confirm that all TokenSet constructors indeed filter out unknown tokens before reaching Mmap index checks?


443-453: Dispatch wiring looks correct; update call if you box filter_has_any.

  • AllTokens → filter_has_all and AnyTokens → filter_has_any is correct.
  • If you adopt the optional change to return Box<dyn Iterator> from filter_has_any, adjust this arm to avoid double boxing:
-            ParsedQuery::AnyTokens(tokens) => Box::new(self.filter_has_any(tokens)),
+            ParsedQuery::AnyTokens(tokens) => self.filter_has_any(tokens),
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (2)

167-193: Defensive check: direct indexing can panic; add a debug assert (or use get).

postings[*token_id as usize] assumes all token IDs are valid. To guard invariants without cost in release builds:

-            tokens.tokens().iter().any(|token_id| {
-                let posting_list = &postings[*token_id as usize];
-                posting_list.visitor().contains(point_id)
-            })
+            tokens.tokens().iter().any(|token_id| {
+                debug_assert!(
+                    (*token_id as usize) < postings.len(),
+                    "TokenId {} out of bounds",
+                    token_id
+                );
+                let posting_list = &postings[*token_id as usize];
+                posting_list.visitor().contains(point_id)
+            })

Alternatively, switch to postings.get(..).map(...).unwrap_or(false) if you prefer a non-panicking path.

Confirm that all TokenSet producers filter out unknown tokens for the immutable index as well.


285-288: Nit: Rename the pattern variable to reflect its type.

In ParsedQuery::Phrase(tokens), the binding is a phrase/document, not tokens. Improves readability:

-            ParsedQuery::Phrase(tokens) => Box::new(self.filter_has_phrase(tokens)),
+            ParsedQuery::Phrase(phrase) => Box::new(self.filter_has_phrase(phrase)),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ff3f67c and 2fde6af.

📒 Files selected for processing (4)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (6 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (7 hunks)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-20T15:03:18.501Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:82-100
Timestamp: 2025-08-20T15:03:18.501Z
Learning: The merge_postings_iterator function in lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs handles deduplication internally, so there's no need to add .dedup() when using it for OR-merge operations in inverted indexes.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-08-20T15:03:25.618Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs:214-247
Timestamp: 2025-08-20T15:03:25.618Z
Learning: The merge_compressed_postings_iterator function in the full text inverted index already handles deduplication internally, so external deduplication calls like .dedup() are not needed when using this function.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-06-28T19:25:53.448Z
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-05-26T14:50:13.183Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
🧬 Code Graph Analysis (2)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (4)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
  • intersect_compressed_postings_iterator (37-71)
  • merge_compressed_postings_iterator (73-85)
  • postings (50-53)
  • postings (119-122)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (3)
  • filter_has_all (58-80)
  • filter_has_any (82-99)
  • new (27-35)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (6)
  • filter_has_all (174-212)
  • filter_has_any (215-247)
  • is_active (164-171)
  • merge (222-239)
  • check_has_any (284-310)
  • check_any (294-304)
lib/posting_list/src/posting_list.rs (2)
  • view (88-105)
  • iter (112-114)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
  • intersect_compressed_postings_iterator (37-71)
  • merge_compressed_postings_iterator (73-85)
  • postings (50-53)
  • postings (119-122)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (2)
  • filter_has_all (58-80)
  • filter_has_any (82-99)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (5)
  • filter_has_all (53-96)
  • filter_has_any (99-135)
  • merge (110-127)
  • check_has_any (167-193)
  • check_any (177-187)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: lint
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consistency
  • GitHub Check: e2e-tests
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: storage-compat-test
🔇 Additional comments (7)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (2)

19-21: LGTM: Correctly imports merge_compressed_postings_iterator; no extra dedup needed.

The OR-merge path is wired to merge_compressed_postings_iterator, which already performs k-merge plus in-stream dedup. No external .dedup() is necessary (and would be redundant).


472-478: LGTM: check_match dispatch covers AnyTokens.

The new AnyTokens path correctly leverages check_has_any. Phrase and AllTokens routes remain intact.

lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (5)

15-17: LGTM: Imported merge_compressed_postings_iterator for OR semantics.

Matches the new AnyTokens path and relies on built-in dedup in the merge iterator.


53-96: LGTM: All-tokens intersection is correct and efficiently prunes on missing postings.

Use of Option<Vec<_>> to enforce “all tokens must have postings” aligns with AND semantics.


98-135: LGTM: Any-tokens merge uses compressed merge with dedup and active-point filter.

The OR path filters out inactive points and returns empty for an empty postings set. Implementation mirrors mmap and mutable variants.


305-308: LGTM: check_match dispatch includes AnyTokens via check_has_any.

Matches the filter-side wiring and preserves existing Phrase/AllTokens behavior.


450-465: LGTM: Positions materialization is consistent with remapped token IDs.

Enumerating new-token IDs over optimized postings and pulling positions via point_to_tokens_positions[id].remove(&token) ensures correctness; the expect is appropriate for invariant violations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)

62-69: Minor naming/doc tweak for clarity in has_any.

The behavior is correct. Consider renaming subset to candidates (or similar) to better reflect “any-of” semantics, and expand the doc comment to note that empty input intentionally returns false (aligns with how empty queries are treated upstream).

-    /// Checks if the current set contains any of the given tokens.
-    /// Returns false if the subset is empty
-    pub fn has_any(&self, subset: &TokenSet) -> bool {
-        if subset.is_empty() {
+    /// Checks if the current set contains any of the given tokens.
+    /// Returns `false` if `candidates` is empty (empty query is treated as no-op/unsatisfied).
+    pub fn has_any(&self, candidates: &TokenSet) -> bool {
+        if candidates.is_empty() {
             return false;
         }
-        subset.0.iter().any(|token| self.contains(token))
+        candidates.0.iter().any(|token| self.contains(token))
     }

286-324: Clamp exp into [min, max] to preserve invariants and guard rounding noise.

The independence-based estimator can be susceptible to floating-point rounding. While largest_posting <= points_count holds by design (per prior team guidance), clamping the expected value to [min, max] keeps min <= exp <= max invariant intact even under numerical noise.

Using the earlier learning that posting lengths are naturally bounded by the total number of points, we intentionally do NOT clamp min against points_count.

Apply:

-        let exp = expected_should_estimation(posting_lengths.into_iter(), points_count);
+        let exp_raw = expected_should_estimation(posting_lengths.into_iter(), points_count);
+        let min_val = largest_posting;
+        let max_val = min(sum, points_count);
+        let exp = exp_raw.clamp(min_val, max_val);
 
         CardinalityEstimation {
             primary_clauses: vec![PrimaryCondition::Condition(Box::new(condition.clone()))],
-            min: largest_posting,
-            exp,
-            max: min(sum, points_count),
+            min: min_val,
+            exp,
+            max: max_val,
         }

Optional follow-up: add a tiny invariant test that checks min <= exp <= max for both AllTokens and AnyTokens on synthetic data.


430-439: Make to_parsed_query_any resilient to unknown tokens (keep known ones).

For OR-semantics, unknown tokens should simply be ignored, not invalidate the query. This improves realism and widens test coverage, matching the estimator’s filter_map behavior.

-    fn to_parsed_query_any(
+    fn to_parsed_query_any(
         query: Vec<String>,
         token_to_id: impl Fn(String) -> Option<TokenId>,
     ) -> Option<ParsedQuery> {
-        let tokens = query
-            .into_iter()
-            .map(token_to_id)
-            .collect::<Option<TokenSet>>()?;
-        Some(ParsedQuery::AnyTokens(tokens))
+        let tokens: TokenSet = query
+            .into_iter()
+            .filter_map(token_to_id)
+            .collect();
+        if tokens.is_empty() {
+            return None; // all unknown -> no-op
+        }
+        Some(ParsedQuery::AnyTokens(tokens))
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9a7bd65 and e502936.

📒 Files selected for processing (1)
  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (8 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T20:26:48.745Z
Learnt from: coszio
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs:286-324
Timestamp: 2025-08-20T20:26:48.745Z
Learning: In inverted index implementations, posting list lengths are naturally bounded by the total number of points/documents, so `largest_posting <= points_count` is guaranteed by the data structure's design and doesn't need explicit clamping.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-05-26T14:47:23.505Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-07-11T11:35:21.549Z
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Applied to files:

  • lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
🧬 Code graph analysis (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
lib/segment/src/index/query_estimator.rs (1)
  • expected_should_estimation (75-87)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
  • points_count (515-517)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
  • points_count (324-326)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (1)
  • points_count (287-289)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: e2e-tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-consistency
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
🔇 Additional comments (8)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (8)

10-10: LGTM: min import is appropriate for capping the upper bound.

Used to cap max with min(sum, points_count) in OR-cardinality. No concerns.


20-20: Good reuse of expected_should_estimation for OR semantics.

Pulling the shared estimator keeps logic centralized and consistent across estimations.


229-237: Dispatch in estimate_cardinality looks correct.

AllTokens reuses subset estimator; AnyTokens uses the new OR estimator; Phrase unchanged. Good separation.


427-428: Tests updated to use AllTokens variant — good.

Keeps tests aligned with the enum rename.


607-613: Nice: exercising both AllTokens and AnyTokens in mutable index queries.

Once to_parsed_query_any ignores unknown tokens, this will cover more realistic OR cases.


617-625: Mmap path dual-query coverage looks good.

Mirrors mutable path; congruence checks will benefit from the proposed OR-parser tweak.


630-639: Immutable-mmap path also covered — congruent triplet maintained.

No issues.


157-161: All AnyTokens arms accounted for across indexes

I’ve verified that every implementation of InvertedIndex::filter and the related estimation helpers now includes the AnyTokens branch:

  • Parsing in text_index.rs produces ParsedQuery::AnyTokens and uses has_any on the token set.
  • mutable_inverted_index.rs – both filter (line 233) and the point-level check (line 267) match on ParsedQuery::AnyTokens.
  • immutable_inverted_index.rs – the filter method (line 287) and check_has_any call (line 307) cover AnyTokens.
  • mmap_inverted_index/mod.rs – the AnyTokens arm is handled in both filter (line 451) and check (line 476).
  • In inverted_index/mod.rs, the enum, parser (Some(ParsedQuery::AnyTokens) at line 438), and estimate_has_any_cardinality (line 236) are all present.

No missing cases detected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/openapi/test_fts.py (2)

253-254: Don’t request payload you don’t use.

You only read ids; skip payload to reduce response size and speed up the test.

-            "with_payload": True,
+            "with_payload": False,

Also applies to: 278-279


268-303: Strengthen the assertion with a non-empty check to make equality meaningful.

If both queries return empty sets, the equality assertion would trivially pass without exercising text_any. Ensure the result is non-empty.

-    response_any_ids = {point['id'] for point in response_any.json()['result']['points']}
+    response_any_ids = {point['id'] for point in response_any.json()['result']['points']}
+    assert len(response_any_ids) > 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e502936 and f72c02c.

📒 Files selected for processing (1)
  • tests/openapi/test_fts.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/openapi/test_fts.py (3)
tests/openapi/test_filter.py (1)
  • test_match_any (14-42)
tests/openapi/conftest.py (1)
  • collection_name (50-51)
tests/openapi/helpers/helpers.py (1)
  • request_with_validation (41-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: e2e-tests
  • GitHub Check: test-consistency
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: lint
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
🔇 Additional comments (1)
tests/openapi/test_fts.py (1)

227-243: Nice: explicit word tokenizer with lowercase + English stopwords.

Good call to include stopword removal so “robot of the star” reduces to the meaningful tokens for the equivalence check.

Comment on lines +225 to +243
def test_match_any(collection_name):

# Create index
response = request_with_validation(
api='/collections/{collection_name}/index',
method="PUT",
path_params={'collection_name': collection_name},
query_params={'wait': 'true'},
body={
"field_name": "title",
"field_schema": {
"type": "text",
"tokenizer": "word",
"lowercase": True,
"stopwords": "english"
}
}
)
assert response.ok
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid cross-test side effects: this test reconfigures the shared title text index. Restore it to prefix for subsequent tests.

Module autouse fixture sets up title with tokenizer: "prefix". This test changes it to word (lowercase + stopwords) and leaves it that way, which can break test_scroll_with_prefix that relies on prefix. Restore the original schema at the end of the test.

Minimal addition at the end of the test:

     assert response_any_ids == response_split_ids
+
+    # Restore original text index expected by other tests in this module (prefix tokenizer)
+    response = request_with_validation(
+        api='/collections/{collection_name}/index',
+        method="PUT",
+        path_params={'collection_name': collection_name},
+        query_params={'wait': 'true'},
+        body={
+            "field_name": "title",
+            "field_schema": {
+                "type": "text",
+                "tokenizer": "prefix",
+            }
+        }
+    )
+    assert response.ok

Alternative (cleaner isolation): create a temporary collection inside this test via basic_collection_setup(collection_name=collection_name + "_text_any", ...), use it for the assertions, and drop it at the end. Happy to draft that if you prefer stricter isolation.

Also applies to: 302-303

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is valid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generall Thank you for confirming! The suggested fix will help prevent the test interference issue where test_match_any modifies the shared index configuration and could affect test_scroll_with_prefix.

Comment on lines +255 to +263
"filter": {
"must": {
"key": "title",
"match": {
"text_any": "robot of the star",
}
}
}
}
Copy link
Contributor

@coderabbitai coderabbitai bot Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix schema: must must be an array of conditions, not a single object.

The OpenAPI schema expects filter.must to be a list. As written, this request will fail local request-body validation in request_with_validation before it even hits the server.

Apply this diff:

-            "filter": {
-                "must": {
-                    "key": "title",
-                    "match": {
-                        "text_any": "robot of the star",
-                    }
-                }
-            }
+            "filter": {
+                "must": [
+                    {
+                        "key": "title",
+                        "match": {
+                            "text_any": "robot of the star",
+                        }
+                    }
+                ]
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"filter": {
"must": {
"key": "title",
"match": {
"text_any": "robot of the star",
}
}
}
}
"filter": {
"must": [
{
"key": "title",
"match": {
"text_any": "robot of the star",
}
}
]
}
}
🤖 Prompt for AI Agents
In tests/openapi/test_fts.py around lines 255 to 263, the test builds a filter
where `must` is a single object but the OpenAPI schema requires `filter.must` to
be an array; update the request payload so `must` is a list containing the
existing condition object (i.e., replace the object with an array containing
that object) so it passes request-body validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it can be single object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Let's examine the OpenAPI schema to understand how 'must' filters are defined
jq '.components.schemas.Filter.properties.must' docs/redoc/master/openapi.json

# Also check if there are any schema definitions that show must can be a single object
rg -A5 -B5 "must.*oneOf\|must.*anyOf" docs/redoc/master/openapi.json

# Look for filter validation logic in the codebase
rg -n "must.*\[\]\|must.*array" --type rust

Length of output: 486


@generall Thank you for the correction! You're absolutely right. Looking at the OpenAPI schema, filter.must is defined with anyOf allowing either:

  1. A single Condition object
  2. An array of Condition objects
  3. null

So the test code is perfectly valid as written. I apologize for the incorrect suggestion - the API is more flexible than I initially assessed.

🐰
oops!
(・_・;)


🧬 Code Graph Analysis Results
  • File: tests/openapi/helpers/collection_setup.py, lines 8-14
def drop_collection(collection_name='test_collection'):
    response = request_with_validation(
        api='/collections/{collection_name}',
        method="DELETE",
        path_params={'collection_name': collection_name},
    )
    assert response.ok
  • File: tests/openapi/helpers/collection_setup.py, lines 122-226
    """
    def basic_collection_setup(collection_name='test_collection',
    on_disk_payload=False,
    on_disk_vectors=False,
    wal_capacity=None) -> None:
    Summary:
    • Drops any existing collection with the given name
    • Creates a new collection with:
      vectors: size 4, distance "Dot", on_disk based on on_disk_vectors
      sparse_vectors: supports sparse-text
      on_disk_payload: based on parameter
      wal_config: includes wal_capacity_mb if provided
    • GETs the collection to verify creation
    • PUTs 10 points into /collections/{collection_name}/points with IDs 1..10, various vectors and payloads
    • Waits for the operation to complete
      """

- File: tests/openapi/conftest.py, lines 50-51

def collection_name(request):
return request.node.name.removesuffix(".py")


- File: tests/openapi/conftest.py, lines 11-12

def on_disk_vectors(request):
return request.param


- File: tests/openapi/conftest.py, lines 16-17

def on_disk_payload(request):
return request.param


- File: tests/openapi/test_filter.py, lines 14-42

def test_match_any(collection_name):
response = request_with_validation(
api='/collections/{collection_name}/points/search',
method="POST",
path_params={'collection_name': collection_name},
body={
"vector": [0.2, 0.1, 0.9, 0.7],
"limit": 3,
"filter": {
"must": [
{
"key": "city",
"match": {
"any": ["London", "Moscow"]
}
}
]
}
}
)
assert response.ok

json = response.json()
assert len(json['result']) == 3

ids = [x['id'] for x in json['result']]
assert 2 in ids
assert 3 in ids
assert 4 in ids

</details>



---

<details>
<summary>✏️ Learnings added</summary>

Learnt from: generall
PR: #7100
File: tests/openapi/test_fts.py:255-263
Timestamp: 2025-08-23T22:24:44.249Z
Learning: In Qdrant's OpenAPI schema, the filter.must field accepts either a single Condition object or an array of Condition objects, not just arrays. The schema uses anyOf to allow both formats.


> Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure [review instructions](https://docs.coderabbit.ai/guides/review-instructions). You can manage existing learnings in the [app](https://app.coderabbit.ai/learnings).

</details>



<!-- This is an auto-generated reply by CodeRabbit -->

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/openapi/test_fts.py (4)

231-249: Prevent flakiness: fetch all matches for both queries instead of comparing two top-10 slices

Comparing sets after limiting each scroll to 10 can diverge if there are >10 matches or ordering differs between text_any and the manual should variant. Make the test deterministic by fetching all potentially matching points.

Minimal change (use the known dataset size):

-            "limit": 10,
+            "limit": len(texts),

Apply in both requests (first at Lines 237-238, second at Lines 262-263).

Alternatively, make it future-proof regardless of dataset size by scrolling until exhaustion (helper outside the changed hunk for clarity):

def scroll_all_ids(collection_name, filter_body):
    all_ids, offset = set(), None
    while True:
        resp = request_with_validation(
            api='/collections/{collection_name}/points/scroll',
            method="POST",
            path_params={'collection_name': collection_name},
            body={
                "offset": offset,
                "limit": 256,
                "with_payload": False,
                "with_vector": False,
                "filter": filter_body,
            }
        )
        assert resp.ok
        page = resp.json()['result']
        for p in page['points']:
            all_ids.add(p['id'])
        offset = page.get('next_page_offset')
        if not offset:
            break
    return all_ids

Use it for both variants before asserting equality.

Also applies to: 255-283


253-287: Strengthen the assertion to avoid a vacuous pass when both sides are empty

If both queries return zero results, the equality assertion would pass. Add a quick non-empty check to ensure the test actually exercises the feature.

     assert response_any.ok
     response_any_ids = {point['id'] for point in response_any.json()['result']['points']}
+    assert response_any_ids, "text_any query unexpectedly returned no points"

     # Same request, but manually split into words

212-227: Optional: factor index creation into a tiny helper to reduce duplication

Both tests create an index for the same field with different tokenizers. A helper reduces repetition and makes future schema tweaks (e.g., adding stemming) one-line changes.

Example helper (outside this hunk):

def create_title_index(collection_name, tokenizer, **kwargs):
    body = {
        "field_name": "title",
        "field_schema": {"type": "text", "tokenizer": tokenizer, **kwargs}
    }
    resp = request_with_validation(
        api='/collections/{collection_name}/index',
        method="PUT",
        path_params={'collection_name': collection_name},
        query_params={'wait': 'true'},
        body=body,
    )
    assert resp.ok

Then call:

create_title_index(collection_name, "word", lowercase=True, stopwords="english")

332-355: Consider mirroring the text_any vs. split-word check on /points/search as well

You already verify the behavior for /points/scroll. Duplicating the equivalence check for /points/search (same filter, no vector relevance impact) would catch endpoint-specific regressions in matching logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f72c02c and 72d54e3.

📒 Files selected for processing (1)
  • tests/openapi/test_fts.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-23T22:24:44.249Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: tests/openapi/test_fts.py:255-263
Timestamp: 2025-08-23T22:24:44.249Z
Learning: In Qdrant's OpenAPI schema, the `filter.must` field accepts either a single Condition object or an array of Condition objects, not just arrays. The schema uses `anyOf` to allow both formats.

Applied to files:

  • tests/openapi/test_fts.py
🧬 Code graph analysis (1)
tests/openapi/test_fts.py (3)
tests/openapi/test_filter.py (1)
  • test_match_any (14-42)
tests/openapi/conftest.py (1)
  • collection_name (50-51)
tests/openapi/helpers/helpers.py (1)
  • request_with_validation (41-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: test-consistency
  • GitHub Check: e2e-tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests-consensus
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: integration-tests
  • GitHub Check: lint
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (2)
tests/openapi/test_fts.py (2)

241-247: filter.must as a single object is valid per OpenAPI anyOf

Using a single Condition object (not an array) for filter.must is allowed and matches the schema. No change needed.

Note: This aligns with our prior learning that must accepts either a single object or an array.


291-305: Good isolation: the test creates its own prefix index

Creating the prefix tokenizer index inside the test removes cross-test coupling. This addresses earlier concerns about shared mutable index configuration within the module.

@generall generall merged commit bf60e96 into dev Aug 24, 2025
16 checks passed
@generall generall deleted the filter-text-any branch August 24, 2025 10:09
timvisee pushed a commit that referenced this pull request Aug 26, 2025
* internal implementation for filter any

* implement api for match `text_any`

* allow naive implementation of text match any

* dedup after kmerge + renames

* improve `expected_should_estimation`

* congruence test

* openapi test

* setup index in each test

---------

Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants