fix check for pure primary condition filter by generall · Pull Request #6840 · qdrant/qdrant

generall · 2025-07-09T23:22:15Z

Fix total_conditions_count, so it doesn't count nested must conditions as multiple conditions
Change PrimaryCondition type to store external ids for: a) easier comparison with original condition b) avoiding mapping of the conditions, which might not be even used
Change criteria for checking if filter consists only of primary conditions. Previous problem: min_should can generate more primary conditions then there are conditions defined (by duplicating conditions)

coderabbitai · 2025-07-09T23:24:45Z

📝 Walkthrough

Walkthrough

The changes remove a Clippy lint attribute from the remove_point method of MutableFullTextIndex without altering its logic. A new method is_primary is added to the CardinalityEstimation struct to check if a given Condition matches any primary clause, with PrimaryCondition::Ids updated to use PointIdType wrapped in MaybeArc. The handling of IDs in query filtering and cardinality estimation methods is adjusted to map external IDs properly and refine primary condition checks. The counting logic for sub-conditions in the Condition enum and total conditions in the Filter struct is revised, particularly for the HasVector variant. Additionally, the Clone trait implementation for MaybeArc is changed from derived to manual, and a clippy lint attribute on StorageType is modified. A new test verifies filtering behavior with min_should conditions.

Possibly related PRs

[phrase match] Introduce ordered document #6486: Modifies full-text inverted index methods including renaming and restructuring removal methods, related by touching the remove/remove_point method in full-text index code.
Add mutable full text payload index on top of Gridstore #6639: Refactors the MutableFullTextIndex struct and its remove_point method to support Gridstore alongside RocksDB, overlapping with changes to the same method and struct in this PR.

Suggested reviewers

generall

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d4e610 and 00c8c7d.

📒 Files selected for processing (1)

tests/openapi/test_filter_min_should.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/segment/src/index/field_index/full_text_index/inverted_index.rs:330-344
Timestamp: 2025-05-26T14:45:01.364Z
Learning: For future phrase capabilities in the full-text index, equality checks need to compare both the ID and value of PostingElement items, not just the IDs. This supports use cases like position-aware or frequency-weighted searches where the value component contains important metadata.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-consensus-compose
GitHub Check: test-consistency
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: integration-tests-consensus
GitHub Check: integration-tests
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-low-resources
GitHub Check: storage-compat-test
GitHub Check: lint

🔇 Additional comments (1)

tests/openapi/test_filter_min_should.py (1)

14-42: Excellent test case for validating min_should filter fixes.

This test effectively validates the bug fixes mentioned in the PR objectives. The test uses mutually exclusive has_id conditions with min_count=2, ensuring that no single point can satisfy multiple conditions simultaneously. This directly tests the corrected condition counting logic and validates that the min_should filter now properly handles primary conditions without duplication issues.

The test design is solid:

Uses primary conditions (has_id) which align with the PR's focus on primary condition filters

Tests the edge case where conditions are mutually exclusive

Correctly expects 0 results, validating the fixed counting logic

Follows the established testing patterns in the file

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

lib/segment/src/index/struct_payload_index.rs (1)
45-45: Consider keeping #[expect] instead of #[allow] for better lint management.

The expect attribute is preferred over allow when you know the lint will fire, as it documents that the suppression is intentional and will catch if the lint stops firing due to code changes.
-#[allow(clippy::enum_variant_names)]
+#[expect(clippy::enum_variant_names)]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a7972d and c8a567f.

📒 Files selected for processing (5)

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (0 hunks)
lib/segment/src/index/field_index/mod.rs (3 hunks)
lib/segment/src/index/query_estimator.rs (2 hunks)
lib/segment/src/index/struct_payload_index.rs (5 hunks)
lib/segment/src/types.rs (2 hunks)

💤 Files with no reviewable changes (1)

lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/segment/src/index/field_index/full_text_index/inverted_index.rs:330-344
Timestamp: 2025-05-26T14:45:01.364Z
Learning: For future phrase capabilities in the full-text index, equality checks need to compare both the ID and value of PostingElement items, not just the IDs. This supports use cases like position-aware or frequency-weighted searches where the value component contains important metadata.

lib/segment/src/index/query_estimator.rs (4)

Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.

Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.

Learnt from: xzfc
PR: qdrant/qdrant#6245
File: lib/segment/src/index/hnsw_index/point_scorer.rs:116-121
Timestamp: 2025-04-22T23:17:41.734Z
Learning: The method `is_none_or` exists in the Qdrant codebase and compiles correctly, despite not being part of standard Rust's Option type. Code reviews should verify compilation issues before reporting them.

Learnt from: IvanPleshkov
PR: qdrant/qdrant#6663
File: lib/quantization/src/encoded_vectors_binary.rs:270-276
Timestamp: 2025-06-18T09:23:14.977Z
Learning: Pattern matching on `&Option<T>` with `Some(x)` is valid and idiomatic in Rust. The bound variable `x` will have type `&T` and can be used to access fields through automatic dereferencing. No need to use `.as_ref()` in such cases.

lib/segment/src/index/struct_payload_index.rs (7)

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.

Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Learnt from: JojiiOfficial
PR: qdrant/qdrant#5951
File: lib/api/src/grpc/qdrant.rs:6493-6496
Timestamp: 2025-03-12T09:25:16.340Z
Learning: The `payload_index_io_read` field in `HardwareUsage` was introduced in PR #5951 and is safe to modify as it was not used in production before this PR.

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.

Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.

Learnt from: coszio
PR: qdrant/qdrant#6609
File: lib/gridstore/src/blob.rs:46-59
Timestamp: 2025-06-02T18:10:47.203Z
Learning: In the Qdrant codebase, zerocopy crate is extensively used for safe byte-level operations across GPU operations, HNSW indices, memory-mapped structures, and serialization. When implementing Blob trait for generic Vec<T>, using zerocopy's FromBytes and IntoBytes traits is preferred over size_of::<T>() because it guarantees memory layout equals byte representation, making serialization safe and correct.

Learnt from: generall
PR: qdrant/qdrant#6274
File: lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs:233-233
Timestamp: 2025-03-29T17:43:28.174Z
Learning: In the qdrant codebase, `debug_assert!` is preferred over runtime checks that would panic in release builds to ensure production stability, even when validating conditions like vector sorting.

lib/segment/src/types.rs (1)

undefined

<retrieved_learning>
Learnt from: timvisee
PR: #6569
File: lib/segment/src/vector_storage/sparse/volatile_sparse_vector_storage.rs:56-76
Timestamp: 2025-05-21T12:52:15.229Z
Learning: In volatile sparse vector storage implementations, the total_sparse_size field should be updated when vectors are inserted, updated, or deleted to maintain an accurate count of the total number of non-zero elements across all vectors.
</retrieved_learning>

lib/segment/src/index/field_index/mod.rs (4)

Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.

Learnt from: timvisee
PR: qdrant/qdrant#6503
File: lib/segment/src/index/field_index/geo_index/mmap_geo_index.rs:60-74
Timestamp: 2025-05-12T12:54:27.872Z
Learning: In the Qdrant codebase, using `pub(super)` visibility is preferred when fields need to be accessed by sibling modules (particularly for index type conversions), as it provides the necessary access without bloating the interface with numerous getters.

Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.

Learnt from: coszio
PR: qdrant/qdrant#6609
File: lib/gridstore/src/blob.rs:46-59
Timestamp: 2025-06-02T18:10:47.203Z
Learning: In the Qdrant codebase, zerocopy crate is extensively used for safe byte-level operations across GPU operations, HNSW indices, memory-mapped structures, and serialization. When implementing Blob trait for generic Vec<T>, using zerocopy's FromBytes and IntoBytes traits is preferred over size_of::<T>() because it guarantees memory layout equals byte representation, making serialization safe and correct.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-low-resources
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-consensus-compose
GitHub Check: integration-tests-consensus
GitHub Check: integration-tests
GitHub Check: test-consistency
GitHub Check: storage-compat-test
GitHub Check: rust-tests (macos-latest)
GitHub Check: lint
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)

🔇 Additional comments (10)

lib/segment/src/index/query_estimator.rs (1)

272-272: LGTM! Import and ID handling changes are correct.

The addition of std::ops::Deref import and the simplified construction of PrimaryCondition::Ids using deref().clone() correctly aligns with the PR's objective of storing external IDs directly.

Also applies to: 321-321

lib/segment/src/index/struct_payload_index.rs (3)

108-115: Correct implementation of external to internal ID mapping.

The changes properly handle the transition to storing external IDs in PrimaryCondition::Ids by mapping them to internal offsets through the ID tracker. Filtering out None values ensures only valid internal IDs are returned.

362-369: Cardinality estimation correctly updated for external ID usage.

The changes properly adapt the cardinality estimation to work with external IDs stored in PrimaryCondition::Ids, maintaining the correct count and ID set.

461-464: Clean implementation using the new is_primary method.

The refactored logic using query_cardinality.is_primary(condition) provides a more maintainable approach to checking if all filter conditions are primary clauses.

lib/segment/src/types.rs (2)

2919-2919: Approve the change from 0 to 1 for HasVector condition count.

This change correctly updates the sub_conditions_count method for HasVector conditions to return 1 instead of 0, which aligns with the PR's objective to fix condition counting logic. The HasVector condition represents a single condition that should be counted as such.

2297-2301: Approve the simplified condition counting logic.

The updated count_all_conditions function now correctly counts only the recursive sub-condition counts by summing j.sub_conditions_count() for each condition, rather than additionally counting the direct length of the condition vector. This change addresses the PR objective to prevent counting nested "must" conditions multiple times and provides more accurate condition counting.

The logic now relies solely on the recursive sub-condition counts, which is the correct approach for hierarchical condition structures.

lib/segment/src/index/field_index/mod.rs (4)

1-1: Import addition looks correct.

The Deref trait import is needed for the deref() call in the new is_primary method.

5-5: Import consolidation improves organization.

Good consolidation of imports from the types module, making the dependencies clearer.

30-30: Type change aligns with PR objectives.

The change from PointOffsetType to PointIdType for the Ids variant correctly implements the PR objective of storing external IDs to facilitate easier comparison with original conditions.

89-111: Well-implemented method for primary condition checking.

The is_primary method correctly implements the logic to check if a given condition matches any primary clause. The pattern matching is exhaustive and handles all three variants of PrimaryCondition appropriately:

Condition variant: Compares boxed field conditions

Ids variant: Uses deref() to compare ID collections

HasVector variant: Compares vector names

The use of deref() on line 101 is correct for comparing the underlying collection types.

generall · 2025-07-10T08:05:34Z

lib/segment/src/index/struct_payload_index.rs


 #[derive(Debug)]
-#[expect(clippy::enum_variant_names)]
+#[allow(clippy::enum_variant_names)]


Clippy with --no-default-features was complaining, I couldn't find a better way to fix it. Not directly related to PR

* fix check for pure primary condition filter * avoid cloning has_id for primary condition * fix test * add test --------- Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>

fix check for pure primary condition filter

c8a567f

generall requested a review from agourlay July 9, 2025 23:22

coderabbitai bot reviewed Jul 9, 2025

View reviewed changes

generall commented Jul 10, 2025

View reviewed changes

generall and others added 3 commits July 10, 2025 10:14

avoid cloning has_id for primary condition

2124205

fix test

6d4e610

add test

00c8c7d

agourlay approved these changes Jul 10, 2025

View reviewed changes

agourlay merged commit f47e499 into dev Jul 10, 2025
18 checks passed

agourlay deleted the fix-primary-condition-check branch July 10, 2025 09:40

generall mentioned this pull request Jul 11, 2025

fix cardinality estimation for has id #6854

Merged

coderabbitai bot mentioned this pull request Sep 5, 2025

Fix missing API validation for MinShould #7217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix check for pure primary condition filter#6840

fix check for pure primary condition filter#6840
agourlay merged 4 commits intodevfrom
fix-primary-condition-check

generall commented Jul 9, 2025

Uh oh!

coderabbitai bot commented Jul 9, 2025 •

edited

Loading

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

generall Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

generall commented Jul 9, 2025

Uh oh!

coderabbitai bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

generall Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Jul 9, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)