Conversation
📝 WalkthroughWalkthroughAdds a new TextAny full-text match variant (MatchTextAny / Match::TextAny) across the API surface: proto, gRPC conversions, OpenAPI/docs, and public types. Introduces explicit AllTokens (previously Tokens) and new AnyTokens semantics with OR-style matching, implements union merging for postings (mutable, mmap, immutable inverted indexes), adds mmap→immutable conversion, updates query parsing (parse_text_any_query), condition conversion (TextQueryType), payload matching, cardinality estimation (expected_should_estimation), index inference, and an integration test verifying text_any behavior. Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/segment/src/index/field_index/full_text_index/text_index.rs (2)
582-601: Filter for Match::TextAny is missing from FullTextIndex::filter (FieldIndex path)FieldIndex::filter currently supports Text and Phrase only. Without TextAny, the index cannot prefilter candidates for text_any conditions, potentially degrading performance.
Proposed change:
- let parsed_query_opt = match &condition.r#match { - Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter), - Some(Match::Phrase(MatchPhrase { phrase })) => { - self.parse_phrase_query(phrase, hw_counter) - } - _ => return None, - }; + let parsed_query_opt = match &condition.r#match { + Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter), + Some(Match::TextAny(MatchTextAny { text_any })) => { + self.parse_text_any_query(text_any, hw_counter) + } + Some(Match::Phrase(MatchPhrase { phrase })) => { + self.parse_phrase_query(phrase, hw_counter) + } + _ => return None, + };Note: you’ll need to import MatchTextAny at the top:
- use crate::types::{FieldCondition, Match, MatchPhrase, MatchText, MatchTextAny, PayloadKeyType};
602-619: Cardinality estimation for TextAny is not wiredSame gap exists in estimate_cardinality. Wire it to parse_text_any_query so the estimator can use estimate_has_any_cardinality.
Proposed change:
- let parsed_query_opt = match &condition.r#match { - Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter), - Some(Match::Phrase(MatchPhrase { phrase })) => { - self.parse_phrase_query(phrase, hw_counter) - } - _ => return None, - }; + let parsed_query_opt = match &condition.r#match { + Some(Match::Text(MatchText { text })) => self.parse_text_query(text, hw_counter), + Some(Match::TextAny(MatchTextAny { text_any })) => { + self.parse_text_any_query(text_any, hw_counter) + } + Some(Match::Phrase(MatchPhrase { phrase })) => { + self.parse_phrase_query(phrase, hw_counter) + } + _ => return None, + };
🧹 Nitpick comments (11)
lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (1)
205-208: AddAnyTokenscoverage to full-text index congruence testsNo lingering references to the old
ParsedQuery::Tokensvariant were found. However, the existingtest_congruence.rssuite only exercises theAllTokens(AND) andPhrasepaths. To fully validate the new split, please add a test case forParsedQuery::AnyTokens(OR semantics) in:
- lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs
This will ensure OR‐query behavior is covered alongside the existing cases.
docs/grpc/docs.md (1)
3370-3371: Good addition; clarify semantics for usersConsider clarifying that text_any:
- Matches any token after the text pipeline (tokenizer, lowercase, stopwords, stemming if configured).
- Is OR semantics vs match.text, which is AND semantics.
This will help users choose between text vs text_any correctly.
lib/api/src/grpc/qdrant.rs (1)
6724-6727: Add clearer docs for TextAny semantics (OR vs AND) in the proto.The comment “Match any word in the text” is terse. Given “Text” and “Phrase” already exist, this should explicitly state OR-semantics across tokenized terms (vs. Text’s AND-semantics, if that’s the case), and note it applies to text-indexed fields, with normalizer/stopwords/stemming behavior identical to Text.
Consider updating the proto docstring so it propagates into generated code and public docs:
- /// Match any word in the text + /// Match any token in the text (OR semantics). + /// Uses the same tokenization/normalization/stopwords/stemming as `text`. + /// Contrast with: + /// - `text`: all tokens must match (AND semantics). + /// - `phrase`: exact phrase match.If this is not the intended behavior, please clarify in the comment accordingly.
Also add tests (in the relevant layer) to cover:
- simple single-token match,
- multi-token with OR semantics,
- interaction with stopwords/stemming,
- contrast cases where
textfails buttext_anymatches, and vice versa.docs/redoc/master/openapi.json (1)
8680-8691: Clarify semantics and tighten schema for MatchTextAnyTwo small improvements:
- Make OR-semantics explicit in the description to contrast it with
MatchText.- Optionally guard against empty strings with
minLength: 1and add an example. This mirrors typical full-text usage and improves generated SDK hints. Backward-compatible and low-risk.Apply this diff:
"MatchTextAny": { - "description": "Full-text match of at least one token of the string.", + "description": "Full-text match of at least one token of the string (OR semantics).", "type": "object", "required": [ "text_any" ], "properties": { "text_any": { - "type": "string" + "type": "string", + "minLength": 1, + "example": "quick fox" } } },lib/segment/src/payload_storage/condition_checker.rs (1)
168-173: TextAny fallback semantics: OK, but confirm alignment with tokenizer behavior.This adds substring-based OR over whitespace-split tokens for string payloads. It matches the existing substring approach for Text and Phrase in this ValueChecker, so behavior is consistent here. However, the index-backed path likely uses normalized/tokenized terms; on unindexed fields, substring matching may yield slightly different results (case, punctuation, CJK). If this divergence is acceptable for unindexed checks, we’re good. Otherwise, consider reusing the same tokenization/normalization used by the full-text index for parity.
Would you like me to wire in the tokenizer used in the text index here for closer parity on unindexed fields?
lib/segment/src/types.rs (1)
2169-2175: Adding Match::TextAny: API shape and conversions LGTM; consider convenience constructor.
- New struct MatchTextAny{text_any} and TextAny variants in MatchInterface/Match look good.
- From mapping and FieldCondition::input_size update are correct.
- Serialization fields are unambiguous among Text/TextAny/Phrase, which is important for untagged enums.
Nice-to-haves:
- Add a helper constructor for ergonomics, mirroring new_text/new_phrase:
impl Match { + pub fn new_text_any(text_any: &str) -> Self { + Self::TextAny(MatchTextAny { text_any: text_any.into() }) + } }
- Optionally implement From<&str> for MatchTextAny, if that’s a pattern you like elsewhere.
One cross-cutting check: PayloadSchemaType::supports_match for Text fields remains false (unchanged). That’s intentional in this codepath; the planner/condition converter should route TextAny through the text index or flag unindexed usage separately (which your PR mentions). Just ensure the unindexed-field diagnostics and text index inference cover TextAny as they do Text/Phrase.
Also applies to: 2217-2218, 2229-2233, 2270-2273, 2790-2790
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
284-311: Unwrap on postings.get for check_has_any: safe under current construction, but worth a note.As with check_has_subset, this assumes TokenSet only contains known token IDs. That’s consistent with how ParsedQuery is built (unknown tokens are dropped) and matches the comment above. No change requested.
If there’s any path that can produce TokenSet with unknown IDs, switch to
if let Some(pl) = postings.get(*token_id) { … }to be defensive.lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (2)
27-34: OR-merge yields duplicates; dedup consecutive idskmerge_by will output the same point id multiple times when it appears in multiple postings. For OR semantics we typically want unique point ids. Add a dedup step since streams are sorted.
Proposed change:
pub fn merge_postings_iterator<'a>( postings: Vec<&'a PostingList>, ) -> impl Iterator<Item = PointOffsetType> + 'a { postings .into_iter() .map(PostingList::iter) .kmerge_by(|a, b| a < b) + .dedup() }Please confirm downstream consumers of InvertedIndex::filter expect unique ids (most filters do). If uniqueness is required, we should also dedup in the compressed variant below.
70-85: Compressed OR-merge also outputs duplicates; dedup themSame duplicate issue as uncompressed merge. Because kmerge_by orders by id, consecutive duplicates can be removed with dedup.
Proposed change:
pub fn merge_compressed_postings_iterator<'a, V: PostingValue + 'a>( postings: Vec<PostingListView<'a, V>>, is_active: impl Fn(PointOffsetType) -> bool + 'a, ) -> impl Iterator<Item = PointOffsetType> + 'a { postings .into_iter() .map(PostingListView::into_iter) .kmerge_by(|a, b| a.id < b.id) .filter_map(move |elem| { if is_active(elem.id) { Some(elem.id) } else { None } }) + .dedup() }lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
309-322: parse_text_any_query: consider returning None when no known tokensCurrent implementation returns Some(AnyTokens(empty)) when nothing is recognized. That ultimately yields no matches, but returning None would be consistent with the “unseen tokens -> no matches” contract and allow upstream to short-circuit earlier.
Proposed change:
pub fn parse_text_any_query( &self, text: &str, hw_counter: &HardwareCounterCell, ) -> Option<ParsedQuery> { let mut tokens = AHashSet::new(); self.get_tokenizer().tokenize_query(text, |token| { if let Some(token_id) = self.get_token(token.as_ref(), hw_counter) { tokens.insert(token_id); } }); - let tokens = tokens.into_iter().collect::<TokenSet>(); - Some(ParsedQuery::AnyTokens(tokens)) + let tokens = tokens.into_iter().collect::<TokenSet>(); + if tokens.is_empty() { + None + } else { + Some(ParsedQuery::AnyTokens(tokens)) + } }lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (1)
286-325: Has-any estimator: bounds/expectation look good; consider min bound nuance
- min: using largest individual posting length as a lower bound is safe (union can’t be smaller).
- max: min(sum, points_count) is correct.
- exp: expected_should_estimation is a good fit.
No functional issues. You can drop the unused variable name comment if you like; otherwise LGTM.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (17)
docs/grpc/docs.md(1 hunks)docs/redoc/master/openapi.json(2 hunks)lib/api/src/grpc/conversions.rs(2 hunks)lib/api/src/grpc/proto/points.proto(1 hunks)lib/api/src/grpc/qdrant.rs(2 hunks)lib/collection/src/problems/unindexed_field.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs(7 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs(3 hunks)lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs(2 hunks)lib/segment/src/index/query_estimator.rs(2 hunks)lib/segment/src/index/query_optimization/condition_converter/match_converter.rs(3 hunks)lib/segment/src/payload_storage/condition_checker.rs(2 hunks)lib/segment/src/types.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-26T14:50:13.183Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
📚 Learning: 2025-06-28T19:25:53.448Z
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-05-26T14:47:23.505Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-07-11T11:35:21.549Z
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
🧬 Code Graph Analysis (7)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)
iter(82-86)iter(146-146)into_iter(139-141)lib/segment/src/index/field_index/full_text_index/inverted_index/posting_list.rs (1)
iter(34-36)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
is_active(164-171)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3)
merge_compressed_postings_iterator(70-85)postings(49-52)postings(118-121)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (2)
lib/segment/src/index/query_estimator.rs (1)
expected_should_estimation(72-78)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (1)
new(27-35)
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (1)
merge_postings_iterator(27-34)lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
tokens(40-42)tokens(111-113)iter(82-86)iter(146-146)lib/segment/src/index/field_index/full_text_index/inverted_index/posting_list.rs (1)
iter(34-36)
lib/segment/src/index/field_index/full_text_index/text_index.rs (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
tokens(40-42)tokens(111-113)new(99-101)query(423-426)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (3)
merge_compressed_postings_iterator(70-85)postings(49-52)postings(118-121)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (1)
lib/segment/src/index/field_index/field_index_base.rs (2)
full_text_index(187-187)full_text_index(190-190)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: e2e-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: integration-tests
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: storage-compat-test
- GitHub Check: lint
🔇 Additional comments (27)
lib/collection/src/problems/unindexed_field.rs (2)
228-236: Mapping TextAny to Text index is appropriateTreating Match::TextAny as requiring a Text index is consistent with the existing Text/Phrase split. No changes needed here.
213-236: TextAny branch coverage verified across all Match handlersAll exhaustive
match Matchstatements now account forMatch::TextAny:
- lib/collection/src/problems/unindexed_field.rs (infer_index_from_field_condition) handles
Match::TextAny- lib/segment/src/payload_storage/condition_checker.rs includes a
Match::TextAnyarm- lib/segment/src/index/query_optimization/condition_converter/match_converter.rs includes
Match::TextAnyand callsparse_text_any_query- The full-text index parser implements
parse_text_any_query, ensuringTextAnyis supported end-to-endNo remaining
match Matchblocks omit theTextAnybranch.lib/api/src/grpc/proto/points.proto (1)
1117-1118: Additive oneof field verified: conversions in place and tag is unique
- lib/api/src/grpc/conversions.rs: covers MatchValue::TextAny ↔ internal Match::TextAny in both directions
- lib/segment/src/types.rs, payload_storage/condition_checker.rs & query_optimization/match_converter.rs: all handle Match::TextAny
- lib/api/src/grpc/proto/points.proto: tag 10 is only used by text_any (no collisions)
Please rebuild (cargo build) to regenerate Prost/Tonic stubs and double-check your OpenAPI spec/docs include this new field.
lib/api/src/grpc/conversions.rs (2)
1898-1900: TextAny gRPC → internal conversion looks correctMaps
MatchValue::TextAny(text_any)tosegment::types::Match::TextAny(MatchTextAny { text_any }). This aligns with the new segment-side type and keeps error paths intact for malformed matches.
1941-1943: ✅ TextAny round-trip verifiedAll paths for the new
TextAnyvariant are wired correctly:
- Proto oneof (lib/api/src/grpc/proto/points.proto:1117):
string text_any = 10;- Generated gRPC enum (lib/api/src/grpc/qdrant.rs:6726):
TextAny(::prost::alloc::string::String)- Conversions (lib/api/src/grpc/conversions.rs:1898–1900 & 1941–1943): both directions handled
- Segment types (lib/segment/src/types.rs:2172 & 2217):
MatchTextAnystruct andMatch::TextAnyvariant present- JSON/REST surfaces (docs/grpc/docs.md:3370 & docs/redoc/master/openapi.json:8680):
text_anyandMatchTextAnyreferencedLGTM—approving.
lib/api/src/grpc/qdrant.rs (1)
6688-6690: Oneof tag extension to include 10 fully verified and backward-compatible.The new
text_any = 10field is defined inpoints.proto, the gRPC⇄internal conversions includeTextAny, and both the OpenAPI schema and gRPC docs exposetext_any. This change is wire-safe and ready to merge.docs/redoc/master/openapi.json (1)
8628-8630: MatchTextAny wiring verified — all layers updated consistentlyAll occurrences of
text_any/MatchTextAnyhave been added as expected across the codebase:• docs/redoc/master/openapi.json – component schema and union variant
• docs/grpc/docs.md – documentation table
• lib/api/src/grpc/proto/points.proto – oneof addition
• lib/api/src/grpc/conversions.rs & qdrant.rs – gRPC ↔ segment conversions
• lib/segment/src/types.rs –MatchTextAnystruct and enum variants
• lib/segment/src/payload_storage/condition_checker.rs – payload matching logic
• lib/segment/src/index/query_optimization/condition_converter/match_converter.rs – query converter
• lib/segment/src/index/field_index/full_text_index/text_index.rs –parse_text_any_queryimplementation
• lib/collection/src/problems/unindexed_field.rs – index inference forTextAnyNo missing references or inconsistencies detected. Approving additive API change.
lib/segment/src/payload_storage/condition_checker.rs (1)
9-11: Import of MatchTextAny looks correct and consistent.New enum variant and struct are imported alongside existing Match types. No issues spotted.
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
449-452: ParsedQuery dispatch extended correctly for AllTokens and AnyTokens.Filter and check_match routing are clear and maintain previous behavior for phrases and AND-subset queries.
Also applies to: 474-477
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (2)
231-235: Filter dispatch extended with AllTokens/AnyTokens: looks good.The OR path is boxed; the phrase path unchanged. No issues spotted.
251-275: check_match AnyTokens branch: LGTM.Uses doc.has_any(query) mirroring has_subset/has_phrase patterns. Consistent and clear.
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (1)
36-68: Intersection logic is sound and leverages PostingIterator semanticsThis uses the smallest posting optimization plus advance_until_greater_or_equal, which by design may return the same value multiple times; your equality check handles that correctly. Looks good.
lib/segment/src/index/field_index/full_text_index/text_index.rs (2)
306-307: Rename to AllTokens is consistentSwitching parse_text_query to return ParsedQuery::AllTokens maintains previous AND semantics. No issues.
376-387: Payload checker paths cover AllTokens and AnyTokens correctlyLocal payload matching now mirrors set semantics via has_subset and has_any. This aligns with new ParsedQuery variants.
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)
62-69: TokenSet::has_any is correct and efficientBinary-search membership with sorted unique tokens is appropriate. Early return on empty subset preserves existing API semantics.
157-161: ParsedQuery gains AnyTokens while preserving AllTokens semanticsThe new AnyTokens variant is a clean extension and keeps intent obvious.
229-238: Cardinality dispatch includes AnyTokensRouting AnyTokens to estimate_has_any_cardinality is the right split.
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (6)
15-17: Import merge_compressed_postings_iterator is necessary for OR semanticsGood decomposition; keeps iterator utilities centralized.
98-135: filter_has_any: union via k-way merge is correct; dedup still needed upstreamThis correctly:
- filters out deleted points (point_to_tokens_count > 0),
- collects available postings for tokens,
- returns empty for empty query,
- merges via merge_compressed_postings_iterator.
Once dedup is added in merge_compressed_postings_iterator, this will produce unique ids for OR semantics.
167-193: check_has_any: fast per-point verificationShort-circuits on empty tokens and absent documents and uses postings.visitor().contains. Looks correct.
285-288: Filter dispatch wired for AnyTokensCorrectly routes AnyTokens to filter_has_any.
303-308: check_match dispatch wired for AnyTokensCorrectly routes AnyTokens to check_has_any.
450-465: Positions compression refactor: clearer and equivalentThe reflowed map/collect preserves semantics and reads better. Good assert remains.
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (4)
10-12: Types import updated to include MatchTextAnyNecessary for new variant handling. Good.
21-29: Dispatch adds TextAny; aligns with new query type enumThis integrates TextAny into condition conversion without affecting existing cases.
258-263: Local TextQueryType is a clean replacement for const-genericKeeps runtime dispatch simple and avoids template bloat.
265-277: get_match_text_checker: TextAny path correctly uses parse_text_any_queryOn parse failure, returning a checker that’s always false avoids None propagation and simplifies callers. Good choice.
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
Show resolved
Hide resolved
lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
Show resolved
Hide resolved
| /// What this function does: | ||
| /// | ||
| /// * For each condition, it calculates the probability that an item does not match it: ((total - x) / total). | ||
| /// * It multiplies these probabilities to get the probability that an item matches none of the conditions. | ||
| /// * Subtracts this from 1 to get the probability that an item matches at least one condition. | ||
| /// * Multiplies this probability by the total number of items and rounds to get the expected count. | ||
| pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize { | ||
| let element_not_hit_prob: f64 = estimations | ||
| .map(|x| (total - x) as f64 / (total as f64)) | ||
| .product(); | ||
| let element_hit_prob = 1.0 - element_not_hit_prob; | ||
| (element_hit_prob * (total as f64)).round() as usize | ||
| } | ||
|
|
There was a problem hiding this comment.
Fix division by zero and clamp probabilities in expected_should_estimation.
If total == 0, current code divides by zero. Also, floating rounding could push exp slightly beyond total. Guard these cases.
Apply this diff:
-pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
- let element_not_hit_prob: f64 = estimations
- .map(|x| (total - x) as f64 / (total as f64))
- .product();
- let element_hit_prob = 1.0 - element_not_hit_prob;
- (element_hit_prob * (total as f64)).round() as usize
-}
+pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize {
+ if total == 0 {
+ return 0;
+ }
+ let element_not_hit_prob: f64 = estimations
+ .map(|x| {
+ let x = x.min(total);
+ (total - x) as f64 / (total as f64)
+ })
+ .product::<f64>()
+ .clamp(0.0, 1.0);
+ let element_hit_prob = 1.0 - element_not_hit_prob;
+ ((element_hit_prob * (total as f64)).round() as usize).min(total)
+}Optional: also clamp exp to max in combine_should_estimations for additional safety.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// What this function does: | |
| /// | |
| /// * For each condition, it calculates the probability that an item does not match it: ((total - x) / total). | |
| /// * It multiplies these probabilities to get the probability that an item matches none of the conditions. | |
| /// * Subtracts this from 1 to get the probability that an item matches at least one condition. | |
| /// * Multiplies this probability by the total number of items and rounds to get the expected count. | |
| pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize { | |
| let element_not_hit_prob: f64 = estimations | |
| .map(|x| (total - x) as f64 / (total as f64)) | |
| .product(); | |
| let element_hit_prob = 1.0 - element_not_hit_prob; | |
| (element_hit_prob * (total as f64)).round() as usize | |
| } | |
| pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize { | |
| if total == 0 { | |
| return 0; | |
| } | |
| let element_not_hit_prob: f64 = estimations | |
| .map(|x| { | |
| let x = x.min(total); | |
| (total - x) as f64 / (total as f64) | |
| }) | |
| .product::<f64>() | |
| .clamp(0.0, 1.0); | |
| let element_hit_prob = 1.0 - element_not_hit_prob; | |
| ((element_hit_prob * (total as f64)).round() as usize).min(total) | |
| } |
🤖 Prompt for AI Agents
In lib/segment/src/index/query_estimator.rs around lines 66 to 79, fix
division-by-zero and out-of-range rounding: if total == 0 return 0 immediately;
when computing per-condition probability ensure you treat (total - x)/total
safely by using total as f64 only after checking total>0; after multiplying to
get element_not_hit_prob clamp it to [0.0, 1.0], compute element_hit_prob = 1.0
- element_not_hit_prob, clamp element_hit_prob to [0.0,1.0], then compute exp =
(element_hit_prob * total as f64).round() as isize and finally clamp the
returned usize between 0 and total (cast back to usize). Also optionally clamp
results in combine_should_estimations to not exceed total for extra safety.
5700c54 to
ff3f67c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/segment/src/index/query_estimator.rs (1)
94-100: Clamp exp to [min, max] to preserve CardinalityEstimation invariantsexpected_should_estimation can exceed realistic bounds. Ensure exp is between computed min and max to avoid inconsistent estimations.
- let expected_count = expected_should_estimation(estimations.iter().map(|x| x.exp), total); - CardinalityEstimation { - primary_clauses: clauses, - min: estimations.iter().map(|x| x.min).max().unwrap_or(0), - exp: expected_count, - max: min(estimations.iter().map(|x| x.max).sum(), total), - } + let min_count = estimations.iter().map(|x| x.min).max().unwrap_or(0); + let max_count = min(estimations.iter().map(|x| x.max).sum(), total); + let expected_count = + expected_should_estimation(estimations.iter().map(|x| x.exp), total) + .min(max_count) + .max(min_count); + CardinalityEstimation { + primary_clauses: clauses, + min: min_count, + exp: expected_count, + max: max_count, + }
♻️ Duplicate comments (3)
lib/segment/src/index/query_estimator.rs (1)
66-78: Harden expected_should_estimation against total == 0 and out-of-range probabilitiesDivision by zero and rounding drift can produce invalid estimates. Clamp probabilities and bound the result.
Apply this diff:
-pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize { - let element_not_hit_prob: f64 = estimations - .map(|x| (total - x) as f64 / (total as f64)) - .product(); - let element_hit_prob = 1.0 - element_not_hit_prob; - (element_hit_prob * (total as f64)).round() as usize -} +pub fn expected_should_estimation(estimations: impl Iterator<Item = usize>, total: usize) -> usize { + if total == 0 { + return 0; + } + let element_not_hit_prob: f64 = estimations + .map(|x| { + let x = x.min(total); + (total - x) as f64 / (total as f64) + }) + .product::<f64>() + .clamp(0.0, 1.0); + let element_hit_prob = (1.0 - element_not_hit_prob).clamp(0.0, 1.0); + ((element_hit_prob * (total as f64)).round() as usize).min(total) +}lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
19-21: Correct import for OR-merge pathBringing in merge_compressed_postings_iterator alongside intersection is right for AnyTokens. No external dedup is needed; merge handles it in this codebase as discussed.
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
98-135: Immutable AnyTokens filtering looks correct
- Drops unseen tokens for OR semantics.
- Empty query short-circuits.
- Uses merge_compressed_postings_iterator with active-point filter to exclude deleted docs.
🧹 Nitpick comments (3)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (2)
258-263: Derive Copy/Clone (and Debug/Eq) for TextQueryType.Saves moves and eases debugging without cost. Minor ergonomics win.
-enum TextQueryType { +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +enum TextQueryType {
264-269: Consider taking &str instead of String for flexibility (optional).You’re consuming String from Match payloads, so this isn’t a hot path concern. However, if future callers want to reuse borrowed strings, accepting &str (or impl AsRef) would avoid forcing ownership at the API boundary.
If you want to keep ownership for safety against parse lifetimes, ensure parse_* returns an owned query that does not borrow from the input text. If that’s guaranteed, you can switch to &str here safely.
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (1)
427-428: Tests only cover AllTokens; add AnyTokens coverageGiven the new AnyTokens semantics, add congruence tests for OR queries across Mutable/Immutable/Mmap and assertion tests for estimate_has_any_cardinality to lock in min/exp/max invariants.
I can draft rstest cases mirroring the existing congruence tests but constructing ParsedQuery::AnyTokens and random token sets. Want me to add them to this PR?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (17)
docs/grpc/docs.md(1 hunks)docs/redoc/master/openapi.json(2 hunks)lib/api/src/grpc/conversions.rs(2 hunks)lib/api/src/grpc/proto/points.proto(1 hunks)lib/api/src/grpc/qdrant.rs(2 hunks)lib/collection/src/problems/unindexed_field.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs(7 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs(5 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs(3 hunks)lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs(1 hunks)lib/segment/src/index/field_index/full_text_index/text_index.rs(2 hunks)lib/segment/src/index/query_estimator.rs(2 hunks)lib/segment/src/index/query_optimization/condition_converter/match_converter.rs(3 hunks)lib/segment/src/payload_storage/condition_checker.rs(2 hunks)lib/segment/src/types.rs(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
- lib/api/src/grpc/proto/points.proto
- lib/collection/src/problems/unindexed_field.rs
- docs/grpc/docs.md
- lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs
- lib/api/src/grpc/conversions.rs
- lib/api/src/grpc/qdrant.rs
- docs/redoc/master/openapi.json
- lib/segment/src/payload_storage/condition_checker.rs
- lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
- lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs
- lib/segment/src/index/field_index/full_text_index/text_index.rs
- lib/segment/src/types.rs
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-08-20T15:03:18.501Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:82-100
Timestamp: 2025-08-20T15:03:18.501Z
Learning: The merge_postings_iterator function in lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs handles deduplication internally, so there's no need to add .dedup() when using it for OR-merge operations in inverted indexes.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-08-20T15:03:25.618Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs:214-247
Timestamp: 2025-08-20T15:03:25.618Z
Learning: The merge_compressed_postings_iterator function in the full text inverted index already handles deduplication internally, so external deduplication calls like .dedup() are not needed when using this function.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-06-28T19:25:53.448Z
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rslib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs
📚 Learning: 2025-05-26T14:50:13.183Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-05-26T14:47:23.505Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-07-11T11:35:21.549Z
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
🧬 Code Graph Analysis (4)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
intersect_compressed_postings_iterator(36-68)merge_compressed_postings_iterator(70-85)postings(49-52)postings(118-121)lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (5)
filter_has_any(99-135)filter(279-289)merge(110-127)check_has_any(167-193)check_any(177-187)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (3)
filter_has_any(82-99)filter(225-235)new(27-35)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (2)
lib/segment/src/index/field_index/full_text_index/tests/test_congruence.rs (2)
index(273-278)index(444-446)lib/segment/src/index/field_index/field_index_base.rs (2)
full_text_index(187-187)full_text_index(190-190)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
lib/segment/src/index/query_estimator.rs (1)
expected_should_estimation(72-78)lib/segment/src/index/field_index/full_text_index/text_index.rs (5)
tokens(305-305)tokens(320-320)points_count(139-145)new(457-478)new(638-644)lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
points_count(324-326)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
points_count(515-517)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
intersect_compressed_postings_iterator(36-68)merge_compressed_postings_iterator(70-85)postings(49-52)postings(118-121)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (5)
filter_has_any(215-247)filter(443-453)merge(222-239)check_has_any(284-310)check_any(294-304)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (2)
filter_has_any(82-99)filter(225-235)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests
- GitHub Check: test-consistency
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: e2e-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: lint
- GitHub Check: storage-compat-test
🔇 Additional comments (16)
lib/segment/src/index/query_optimization/condition_converter/match_converter.rs (3)
10-12: Importing MatchTextAny is correct and consistent with new API surface.The import updates align with the new Match::TextAny variant and payload type. No issues here.
21-29: Routing Match::Text/TextAny/Phrase via TextQueryType is clean and extensible.Nice refactor from const generics to a runtime enum; it simplifies branching and keeps call sites tidy. The new Match::TextAny arm is correctly wired to the shared text checker path.
273-281: Confirm parsing fallback semantics
The current implementation in get_match_text_checker returns an always‐false checker on any parse_* failure, rather than propagating None. Since parse_text_query/parse_phrase_query (and likely parse_text_any_query) return None not only for empty input but also when there are unseen tokens, this change will suppress any higher-level fallback (e.g., payload scanning) and may drop legitimate matches.• File: lib/segment/src/index/query_optimization/condition_converter/match_converter.rs
Lines 273–281: revert to returning None on parse failure to preserve downstream fallbacks.Suggested diff:
- let Some(parsed_query) = query_opt else { - return Some(Box::new(|_| false)); - }; + let Some(parsed_query) = query_opt else { + return None; + };lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (4)
214-247: AnyTokens iterator wiring looks good
- Using filter_map to drop unknown tokens is correct for OR semantics.
- Empty postings short-circuits to empty iterator.
- Using self.is_active for filtering aligns with mmap’s “postings include deleted” model.
284-310: Per-point AnyTokens check is soundEarly exits on empty query or empty values and per-token visitor checks are consistent with the rest of the implementation. Unwraps are safe given token IDs come from the index’s vocab.
449-452: Filter enum wiring: AllTokens/AnyTokens routing is correctAllTokens -> subset, AnyTokens -> OR-merge, Phrase -> phrase iterator. Matches the intended semantics.
472-477: check_match wiring for AnyTokens is correctThe new branch routes to check_has_any as expected.
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)
62-69: TokenSet::has_any API is appropriateSymmetric with has_subset: returns false on empty subset and uses binary-search membership. Good.
152-161: ParsedQuery variants: AllTokens and AnyTokens split is clearThe semantics are now explicit (AND via AllTokens, OR via AnyTokens). This should simplify downstream dispatch.
229-237: estimate_cardinality dispatch updated correctlyAllTokens -> subset estimator; AnyTokens -> new has-any estimator. Good separation of concerns.
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (6)
15-17: Correctly importing merge_compressed_postings_iteratorThe additional import is needed for OR semantics; good.
167-193: check_has_any implementation is consistent and efficientEarly exits and per-token visitor checks are fine. Indexing into postings is acceptable here given vocab consistency.
279-288: Filter routing: AllTokens/AnyTokensWiring is correct and mirrors the mmap/mutable variants.
305-308: check_match routing updated for AnyTokensMatches the new predicate methods. Good.
449-465: Formatting-only change in create_compressed_postings_with_positionsNo semantic changes detected.
467-503: From<&MmapInvertedIndex> conversion is solid
- Postings copied via PostingListView::to_owned.
- Vocab reconstructed correctly.
- point_to_tokens_count and points_count preserved.
- Debug assertion ensures postings/vocab sizes match.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (4)
181-186: Nit: Use PointOffsetType in the filter signature for consistency.The local intersection helper uses
impl Fn(u32) -> bool; elsewhere in this file and codebase we use the aliasPointOffsetType. Minor readability/consistency fix:- filter: impl Fn(u32) -> bool + 'a, + filter: impl Fn(PointOffsetType) -> bool + 'a,
214-247: Optional: Align return type with filter_has_all to simplify call sites.filter_has_any returns
impl Iterator(materialized asEither<_, _>), while filter_has_all returnsBox<dyn Iterator>. Unifying both toBox<dyn Iterator>keeps the public surface consistent and avoids exposing theEithershape throughimpl Iterator. Suggested change:- fn filter_has_any<'a>( - &'a self, - tokens: TokenSet, - ) -> impl Iterator<Item = PointOffsetType> + 'a { + fn filter_has_any<'a>( + &'a self, + tokens: TokenSet, + ) -> Box<dyn Iterator<Item = PointOffsetType> + 'a> { @@ - match &self.storage.postings { - MmapPostingsEnum::Ids(postings) => Either::Left(merge(postings, tokens, is_active)), - MmapPostingsEnum::WithPositions(postings) => { - Either::Right(merge(postings, tokens, is_active)) - } - } + match &self.storage.postings { + MmapPostingsEnum::Ids(postings) => Box::new(merge(postings, tokens, is_active)), + MmapPostingsEnum::WithPositions(postings) => { + Box::new(merge(postings, tokens, is_active)) + } + }If you apply this, also adjust the
filter()call site (see comment on Lines 443-453).
284-310: *Defensive check: avoid unwrap on postings.get(token_id).While
TokenSetshould contain only known tokens, an upstream invariant break would panic here. Add a debug assertion and a safe fallback:- tokens.tokens().iter().any(|token_id| { - let posting_list = postings.get(*token_id).unwrap(); - posting_list.visitor().contains(point_id) - }) + tokens.tokens().iter().any(|token_id| { + debug_assert!( + postings.get(*token_id).is_some(), + "TokenId {} not found in postings", + token_id + ); + postings + .get(*token_id) + .map(|pl| pl.visitor().contains(point_id)) + .unwrap_or(false) + })Would you like a quick repo scan to confirm that all
TokenSetconstructors indeed filter out unknown tokens before reaching Mmap index checks?
443-453: Dispatch wiring looks correct; update call if you box filter_has_any.
- AllTokens →
filter_has_alland AnyTokens →filter_has_anyis correct.- If you adopt the optional change to return
Box<dyn Iterator>fromfilter_has_any, adjust this arm to avoid double boxing:- ParsedQuery::AnyTokens(tokens) => Box::new(self.filter_has_any(tokens)), + ParsedQuery::AnyTokens(tokens) => self.filter_has_any(tokens),lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (2)
167-193: Defensive check: direct indexing can panic; add a debug assert (or use get).
postings[*token_id as usize]assumes all token IDs are valid. To guard invariants without cost in release builds:- tokens.tokens().iter().any(|token_id| { - let posting_list = &postings[*token_id as usize]; - posting_list.visitor().contains(point_id) - }) + tokens.tokens().iter().any(|token_id| { + debug_assert!( + (*token_id as usize) < postings.len(), + "TokenId {} out of bounds", + token_id + ); + let posting_list = &postings[*token_id as usize]; + posting_list.visitor().contains(point_id) + })Alternatively, switch to
postings.get(..).map(...).unwrap_or(false)if you prefer a non-panicking path.Confirm that all
TokenSetproducers filter out unknown tokens for the immutable index as well.
285-288: Nit: Rename the pattern variable to reflect its type.In
ParsedQuery::Phrase(tokens), the binding is a phrase/document, not tokens. Improves readability:- ParsedQuery::Phrase(tokens) => Box::new(self.filter_has_phrase(tokens)), + ParsedQuery::Phrase(phrase) => Box::new(self.filter_has_phrase(phrase)),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs(7 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs(6 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs(7 hunks)lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs
- lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-08-20T15:03:18.501Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:82-100
Timestamp: 2025-08-20T15:03:18.501Z
Learning: The merge_postings_iterator function in lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs handles deduplication internally, so there's no need to add .dedup() when using it for OR-merge operations in inverted indexes.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-08-20T15:03:25.618Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs:214-247
Timestamp: 2025-08-20T15:03:25.618Z
Learning: The merge_compressed_postings_iterator function in the full text inverted index already handles deduplication internally, so external deduplication calls like .dedup() are not needed when using this function.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-06-28T19:25:53.448Z
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rslib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
📚 Learning: 2025-05-26T14:50:13.183Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs
🧬 Code Graph Analysis (2)
lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (4)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
intersect_compressed_postings_iterator(37-71)merge_compressed_postings_iterator(73-85)postings(50-53)postings(119-122)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (3)
filter_has_all(58-80)filter_has_any(82-99)new(27-35)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (6)
filter_has_all(174-212)filter_has_any(215-247)is_active(164-171)merge(222-239)check_has_any(284-310)check_any(294-304)lib/posting_list/src/posting_list.rs (2)
view(88-105)iter(112-114)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/postings_iterator.rs (4)
intersect_compressed_postings_iterator(37-71)merge_compressed_postings_iterator(73-85)postings(50-53)postings(119-122)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (2)
filter_has_all(58-80)filter_has_any(82-99)lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (5)
filter_has_all(53-96)filter_has_any(99-135)merge(110-127)check_has_any(167-193)check_any(177-187)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: lint
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: test-consistency
- GitHub Check: e2e-tests
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: storage-compat-test
🔇 Additional comments (7)
lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (2)
19-21: LGTM: Correctly imports merge_compressed_postings_iterator; no extra dedup needed.The OR-merge path is wired to merge_compressed_postings_iterator, which already performs k-merge plus in-stream dedup. No external .dedup() is necessary (and would be redundant).
472-478: LGTM: check_match dispatch covers AnyTokens.The new AnyTokens path correctly leverages
check_has_any. Phrase and AllTokens routes remain intact.lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (5)
15-17: LGTM: Imported merge_compressed_postings_iterator for OR semantics.Matches the new AnyTokens path and relies on built-in dedup in the merge iterator.
53-96: LGTM: All-tokens intersection is correct and efficiently prunes on missing postings.Use of
Option<Vec<_>>to enforce “all tokens must have postings” aligns with AND semantics.
98-135: LGTM: Any-tokens merge uses compressed merge with dedup and active-point filter.The OR path filters out inactive points and returns empty for an empty postings set. Implementation mirrors mmap and mutable variants.
305-308: LGTM: check_match dispatch includes AnyTokens via check_has_any.Matches the filter-side wiring and preserves existing Phrase/AllTokens behavior.
450-465: LGTM: Positions materialization is consistent with remapped token IDs.Enumerating new-token IDs over optimized postings and pulling positions via
point_to_tokens_positions[id].remove(&token)ensures correctness; theexpectis appropriate for invariant violations.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (3)
62-69: Minor naming/doc tweak for clarity inhas_any.The behavior is correct. Consider renaming
subsettocandidates(or similar) to better reflect “any-of” semantics, and expand the doc comment to note that empty input intentionally returnsfalse(aligns with how empty queries are treated upstream).- /// Checks if the current set contains any of the given tokens. - /// Returns false if the subset is empty - pub fn has_any(&self, subset: &TokenSet) -> bool { - if subset.is_empty() { + /// Checks if the current set contains any of the given tokens. + /// Returns `false` if `candidates` is empty (empty query is treated as no-op/unsatisfied). + pub fn has_any(&self, candidates: &TokenSet) -> bool { + if candidates.is_empty() { return false; } - subset.0.iter().any(|token| self.contains(token)) + candidates.0.iter().any(|token| self.contains(token)) }
286-324: Clampexpinto [min, max] to preserve invariants and guard rounding noise.The independence-based estimator can be susceptible to floating-point rounding. While
largest_posting <= points_countholds by design (per prior team guidance), clamping the expected value to[min, max]keepsmin <= exp <= maxinvariant intact even under numerical noise.Using the earlier learning that posting lengths are naturally bounded by the total number of points, we intentionally do NOT clamp
minagainstpoints_count.Apply:
- let exp = expected_should_estimation(posting_lengths.into_iter(), points_count); + let exp_raw = expected_should_estimation(posting_lengths.into_iter(), points_count); + let min_val = largest_posting; + let max_val = min(sum, points_count); + let exp = exp_raw.clamp(min_val, max_val); CardinalityEstimation { primary_clauses: vec![PrimaryCondition::Condition(Box::new(condition.clone()))], - min: largest_posting, - exp, - max: min(sum, points_count), + min: min_val, + exp, + max: max_val, }Optional follow-up: add a tiny invariant test that checks
min <= exp <= maxfor bothAllTokensandAnyTokenson synthetic data.
430-439: Maketo_parsed_query_anyresilient to unknown tokens (keep known ones).For OR-semantics, unknown tokens should simply be ignored, not invalidate the query. This improves realism and widens test coverage, matching the estimator’s
filter_mapbehavior.- fn to_parsed_query_any( + fn to_parsed_query_any( query: Vec<String>, token_to_id: impl Fn(String) -> Option<TokenId>, ) -> Option<ParsedQuery> { - let tokens = query - .into_iter() - .map(token_to_id) - .collect::<Option<TokenSet>>()?; - Some(ParsedQuery::AnyTokens(tokens)) + let tokens: TokenSet = query + .into_iter() + .filter_map(token_to_id) + .collect(); + if tokens.is_empty() { + return None; // all unknown -> no-op + } + Some(ParsedQuery::AnyTokens(tokens)) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs(8 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-20T20:26:48.745Z
Learnt from: coszio
PR: qdrant/qdrant#7100
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs:286-324
Timestamp: 2025-08-20T20:26:48.745Z
Learning: In inverted index implementations, posting list lengths are naturally bounded by the total number of points/documents, so `largest_posting <= points_count` is guaranteed by the data structure's design and doesn't need explicit clamping.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-05-26T14:47:23.505Z
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
📚 Learning: 2025-07-11T11:35:21.549Z
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.549Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.
Applied to files:
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs
🧬 Code graph analysis (1)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (4)
lib/segment/src/index/query_estimator.rs (1)
expected_should_estimation(75-87)lib/segment/src/index/field_index/full_text_index/inverted_index/mmap_inverted_index/mod.rs (1)
points_count(515-517)lib/segment/src/index/field_index/full_text_index/inverted_index/immutable_inverted_index.rs (1)
points_count(324-326)lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs (1)
points_count(287-289)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: e2e-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
🔇 Additional comments (8)
lib/segment/src/index/field_index/full_text_index/inverted_index/mod.rs (8)
10-10: LGTM:minimport is appropriate for capping the upper bound.Used to cap
maxwithmin(sum, points_count)in OR-cardinality. No concerns.
20-20: Good reuse ofexpected_should_estimationfor OR semantics.Pulling the shared estimator keeps logic centralized and consistent across estimations.
229-237: Dispatch inestimate_cardinalitylooks correct.
AllTokensreuses subset estimator;AnyTokensuses the new OR estimator;Phraseunchanged. Good separation.
427-428: Tests updated to useAllTokensvariant — good.Keeps tests aligned with the enum rename.
607-613: Nice: exercising bothAllTokensandAnyTokensin mutable index queries.Once
to_parsed_query_anyignores unknown tokens, this will cover more realistic OR cases.
617-625: Mmap path dual-query coverage looks good.Mirrors mutable path; congruence checks will benefit from the proposed OR-parser tweak.
630-639: Immutable-mmap path also covered — congruent triplet maintained.No issues.
157-161: AllAnyTokensarms accounted for across indexesI’ve verified that every implementation of
InvertedIndex::filterand the related estimation helpers now includes theAnyTokensbranch:
- Parsing in
text_index.rsproducesParsedQuery::AnyTokensand useshas_anyon the token set.mutable_inverted_index.rs– bothfilter(line 233) and the point-level check (line 267) match onParsedQuery::AnyTokens.immutable_inverted_index.rs– thefiltermethod (line 287) andcheck_has_anycall (line 307) coverAnyTokens.mmap_inverted_index/mod.rs– theAnyTokensarm is handled in both filter (line 451) and check (line 476).- In
inverted_index/mod.rs, the enum, parser (Some(ParsedQuery::AnyTokens)at line 438), andestimate_has_any_cardinality(line 236) are all present.No missing cases detected.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tests/openapi/test_fts.py (2)
253-254: Don’t request payload you don’t use.You only read
ids; skip payload to reduce response size and speed up the test.- "with_payload": True, + "with_payload": False,Also applies to: 278-279
268-303: Strengthen the assertion with a non-empty check to make equality meaningful.If both queries return empty sets, the equality assertion would trivially pass without exercising
text_any. Ensure the result is non-empty.- response_any_ids = {point['id'] for point in response_any.json()['result']['points']} + response_any_ids = {point['id'] for point in response_any.json()['result']['points']} + assert len(response_any_ids) > 0
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/openapi/test_fts.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/openapi/test_fts.py (3)
tests/openapi/test_filter.py (1)
test_match_any(14-42)tests/openapi/conftest.py (1)
collection_name(50-51)tests/openapi/helpers/helpers.py (1)
request_with_validation(41-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: e2e-tests
- GitHub Check: test-consistency
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: lint
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
🔇 Additional comments (1)
tests/openapi/test_fts.py (1)
227-243: Nice: explicitwordtokenizer with lowercase + English stopwords.Good call to include stopword removal so “robot of the star” reduces to the meaningful tokens for the equivalence check.
| def test_match_any(collection_name): | ||
|
|
||
| # Create index | ||
| response = request_with_validation( | ||
| api='/collections/{collection_name}/index', | ||
| method="PUT", | ||
| path_params={'collection_name': collection_name}, | ||
| query_params={'wait': 'true'}, | ||
| body={ | ||
| "field_name": "title", | ||
| "field_schema": { | ||
| "type": "text", | ||
| "tokenizer": "word", | ||
| "lowercase": True, | ||
| "stopwords": "english" | ||
| } | ||
| } | ||
| ) | ||
| assert response.ok |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Avoid cross-test side effects: this test reconfigures the shared title text index. Restore it to prefix for subsequent tests.
Module autouse fixture sets up title with tokenizer: "prefix". This test changes it to word (lowercase + stopwords) and leaves it that way, which can break test_scroll_with_prefix that relies on prefix. Restore the original schema at the end of the test.
Minimal addition at the end of the test:
assert response_any_ids == response_split_ids
+
+ # Restore original text index expected by other tests in this module (prefix tokenizer)
+ response = request_with_validation(
+ api='/collections/{collection_name}/index',
+ method="PUT",
+ path_params={'collection_name': collection_name},
+ query_params={'wait': 'true'},
+ body={
+ "field_name": "title",
+ "field_schema": {
+ "type": "text",
+ "tokenizer": "prefix",
+ }
+ }
+ )
+ assert response.okAlternative (cleaner isolation): create a temporary collection inside this test via basic_collection_setup(collection_name=collection_name + "_text_any", ...), use it for the assertions, and drop it at the end. Happy to draft that if you prefer stricter isolation.
Also applies to: 302-303
There was a problem hiding this comment.
@generall Thank you for confirming! The suggested fix will help prevent the test interference issue where test_match_any modifies the shared index configuration and could affect test_scroll_with_prefix.
| "filter": { | ||
| "must": { | ||
| "key": "title", | ||
| "match": { | ||
| "text_any": "robot of the star", | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Fix schema: must must be an array of conditions, not a single object.
The OpenAPI schema expects filter.must to be a list. As written, this request will fail local request-body validation in request_with_validation before it even hits the server.
Apply this diff:
- "filter": {
- "must": {
- "key": "title",
- "match": {
- "text_any": "robot of the star",
- }
- }
- }
+ "filter": {
+ "must": [
+ {
+ "key": "title",
+ "match": {
+ "text_any": "robot of the star",
+ }
+ }
+ ]
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "filter": { | |
| "must": { | |
| "key": "title", | |
| "match": { | |
| "text_any": "robot of the star", | |
| } | |
| } | |
| } | |
| } | |
| "filter": { | |
| "must": [ | |
| { | |
| "key": "title", | |
| "match": { | |
| "text_any": "robot of the star", | |
| } | |
| } | |
| ] | |
| } | |
| } |
🤖 Prompt for AI Agents
In tests/openapi/test_fts.py around lines 255 to 263, the test builds a filter
where `must` is a single object but the OpenAPI schema requires `filter.must` to
be an array; update the request payload so `must` is a list containing the
existing condition object (i.e., replace the object with an array containing
that object) so it passes request-body validation.
There was a problem hiding this comment.
nope, it can be single object
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Let's examine the OpenAPI schema to understand how 'must' filters are defined
jq '.components.schemas.Filter.properties.must' docs/redoc/master/openapi.json
# Also check if there are any schema definitions that show must can be a single object
rg -A5 -B5 "must.*oneOf\|must.*anyOf" docs/redoc/master/openapi.json
# Look for filter validation logic in the codebase
rg -n "must.*\[\]\|must.*array" --type rust
Length of output: 486
@generall Thank you for the correction! You're absolutely right. Looking at the OpenAPI schema, filter.must is defined with anyOf allowing either:
- A single
Conditionobject - An array of
Conditionobjects null
So the test code is perfectly valid as written. I apologize for the incorrect suggestion - the API is more flexible than I initially assessed.
🐰
oops!
(・_・;)
🧬 Code Graph Analysis Results
- File: tests/openapi/helpers/collection_setup.py, lines 8-14
def drop_collection(collection_name='test_collection'):
response = request_with_validation(
api='/collections/{collection_name}',
method="DELETE",
path_params={'collection_name': collection_name},
)
assert response.ok
- File: tests/openapi/helpers/collection_setup.py, lines 122-226
"""
def basic_collection_setup(collection_name='test_collection',
on_disk_payload=False,
on_disk_vectors=False,
wal_capacity=None) -> None:
Summary:- Drops any existing collection with the given name
- Creates a new collection with:
vectors: size 4, distance "Dot", on_disk based on on_disk_vectors
sparse_vectors: supports sparse-text
on_disk_payload: based on parameter
wal_config: includes wal_capacity_mb if provided - GETs the collection to verify creation
- PUTs 10 points into /collections/{collection_name}/points with IDs 1..10, various vectors and payloads
- Waits for the operation to complete
"""
- File: tests/openapi/conftest.py, lines 50-51
def collection_name(request):
return request.node.name.removesuffix(".py")
- File: tests/openapi/conftest.py, lines 11-12
def on_disk_vectors(request):
return request.param
- File: tests/openapi/conftest.py, lines 16-17
def on_disk_payload(request):
return request.param
- File: tests/openapi/test_filter.py, lines 14-42
def test_match_any(collection_name):
response = request_with_validation(
api='/collections/{collection_name}/points/search',
method="POST",
path_params={'collection_name': collection_name},
body={
"vector": [0.2, 0.1, 0.9, 0.7],
"limit": 3,
"filter": {
"must": [
{
"key": "city",
"match": {
"any": ["London", "Moscow"]
}
}
]
}
}
)
assert response.ok
json = response.json()
assert len(json['result']) == 3
ids = [x['id'] for x in json['result']]
assert 2 in ids
assert 3 in ids
assert 4 in ids
</details>
---
<details>
<summary>✏️ Learnings added</summary>
Learnt from: generall
PR: #7100
File: tests/openapi/test_fts.py:255-263
Timestamp: 2025-08-23T22:24:44.249Z
Learning: In Qdrant's OpenAPI schema, the filter.must field accepts either a single Condition object or an array of Condition objects, not just arrays. The schema uses anyOf to allow both formats.
> Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure [review instructions](https://docs.coderabbit.ai/guides/review-instructions). You can manage existing learnings in the [app](https://app.coderabbit.ai/learnings).
</details>
<!-- This is an auto-generated reply by CodeRabbit -->
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/openapi/test_fts.py (4)
231-249: Prevent flakiness: fetch all matches for both queries instead of comparing two top-10 slicesComparing sets after limiting each scroll to 10 can diverge if there are >10 matches or ordering differs between
text_anyand the manualshouldvariant. Make the test deterministic by fetching all potentially matching points.Minimal change (use the known dataset size):
- "limit": 10, + "limit": len(texts),Apply in both requests (first at Lines 237-238, second at Lines 262-263).
Alternatively, make it future-proof regardless of dataset size by scrolling until exhaustion (helper outside the changed hunk for clarity):
def scroll_all_ids(collection_name, filter_body): all_ids, offset = set(), None while True: resp = request_with_validation( api='/collections/{collection_name}/points/scroll', method="POST", path_params={'collection_name': collection_name}, body={ "offset": offset, "limit": 256, "with_payload": False, "with_vector": False, "filter": filter_body, } ) assert resp.ok page = resp.json()['result'] for p in page['points']: all_ids.add(p['id']) offset = page.get('next_page_offset') if not offset: break return all_idsUse it for both variants before asserting equality.
Also applies to: 255-283
253-287: Strengthen the assertion to avoid a vacuous pass when both sides are emptyIf both queries return zero results, the equality assertion would pass. Add a quick non-empty check to ensure the test actually exercises the feature.
assert response_any.ok response_any_ids = {point['id'] for point in response_any.json()['result']['points']} + assert response_any_ids, "text_any query unexpectedly returned no points" # Same request, but manually split into words
212-227: Optional: factor index creation into a tiny helper to reduce duplicationBoth tests create an index for the same field with different tokenizers. A helper reduces repetition and makes future schema tweaks (e.g., adding stemming) one-line changes.
Example helper (outside this hunk):
def create_title_index(collection_name, tokenizer, **kwargs): body = { "field_name": "title", "field_schema": {"type": "text", "tokenizer": tokenizer, **kwargs} } resp = request_with_validation( api='/collections/{collection_name}/index', method="PUT", path_params={'collection_name': collection_name}, query_params={'wait': 'true'}, body=body, ) assert resp.okThen call:
create_title_index(collection_name, "word", lowercase=True, stopwords="english")
332-355: Consider mirroring thetext_anyvs. split-word check on/points/searchas wellYou already verify the behavior for
/points/scroll. Duplicating the equivalence check for/points/search(same filter, no vector relevance impact) would catch endpoint-specific regressions in matching logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/openapi/test_fts.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-23T22:24:44.249Z
Learnt from: generall
PR: qdrant/qdrant#7100
File: tests/openapi/test_fts.py:255-263
Timestamp: 2025-08-23T22:24:44.249Z
Learning: In Qdrant's OpenAPI schema, the `filter.must` field accepts either a single Condition object or an array of Condition objects, not just arrays. The schema uses `anyOf` to allow both formats.
Applied to files:
tests/openapi/test_fts.py
🧬 Code graph analysis (1)
tests/openapi/test_fts.py (3)
tests/openapi/test_filter.py (1)
test_match_any(14-42)tests/openapi/conftest.py (1)
collection_name(50-51)tests/openapi/helpers/helpers.py (1)
request_with_validation(41-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: test-consistency
- GitHub Check: e2e-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests-consensus
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: integration-tests
- GitHub Check: lint
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (2)
tests/openapi/test_fts.py (2)
241-247:filter.mustas a single object is valid per OpenAPI anyOfUsing a single Condition object (not an array) for
filter.mustis allowed and matches the schema. No change needed.Note: This aligns with our prior learning that
mustaccepts either a single object or an array.
291-305: Good isolation: the test creates its own prefix indexCreating the
prefixtokenizer index inside the test removes cross-test coupling. This addresses earlier concerns about shared mutable index configuration within the module.
* internal implementation for filter any * implement api for match `text_any` * allow naive implementation of text match any * dedup after kmerge + renames * improve `expected_should_estimation` * congruence test * openapi test * setup index in each test --------- Co-authored-by: Luis Cossío <luis.cossio@outlook.com>
text_anyToDo: