Skip to content

fix check for pure primary condition filter#6840

Merged
agourlay merged 4 commits intodevfrom
fix-primary-condition-check
Jul 10, 2025
Merged

fix check for pure primary condition filter#6840
agourlay merged 4 commits intodevfrom
fix-primary-condition-check

Conversation

@generall
Copy link
Member

@generall generall commented Jul 9, 2025

  • Fix total_conditions_count, so it doesn't count nested must conditions as multiple conditions
  • Change PrimaryCondition type to store external ids for: a) easier comparison with original condition b) avoiding mapping of the conditions, which might not be even used
  • Change criteria for checking if filter consists only of primary conditions. Previous problem: min_should can generate more primary conditions then there are conditions defined (by duplicating conditions)

@generall generall requested a review from agourlay July 9, 2025 23:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 9, 2025

📝 Walkthrough

Walkthrough

The changes remove a Clippy lint attribute from the remove_point method of MutableFullTextIndex without altering its logic. A new method is_primary is added to the CardinalityEstimation struct to check if a given Condition matches any primary clause, with PrimaryCondition::Ids updated to use PointIdType wrapped in MaybeArc. The handling of IDs in query filtering and cardinality estimation methods is adjusted to map external IDs properly and refine primary condition checks. The counting logic for sub-conditions in the Condition enum and total conditions in the Filter struct is revised, particularly for the HasVector variant. Additionally, the Clone trait implementation for MaybeArc is changed from derived to manual, and a clippy lint attribute on StorageType is modified. A new test verifies filtering behavior with min_should conditions.

Possibly related PRs

Suggested reviewers

  • generall

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d4e610 and 00c8c7d.

📒 Files selected for processing (1)
  • tests/openapi/test_filter_min_should.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/segment/src/index/field_index/full_text_index/inverted_index.rs:330-344
Timestamp: 2025-05-26T14:45:01.364Z
Learning: For future phrase capabilities in the full-text index, equality checks need to compare both the ID and value of PostingElement items, not just the IDs. This supports use cases like position-aware or frequency-weighted searches where the value component contains important metadata.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-consistency
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: storage-compat-test
  • GitHub Check: lint
🔇 Additional comments (1)
tests/openapi/test_filter_min_should.py (1)

14-42: Excellent test case for validating min_should filter fixes.

This test effectively validates the bug fixes mentioned in the PR objectives. The test uses mutually exclusive has_id conditions with min_count=2, ensuring that no single point can satisfy multiple conditions simultaneously. This directly tests the corrected condition counting logic and validates that the min_should filter now properly handles primary conditions without duplication issues.

The test design is solid:

  • Uses primary conditions (has_id) which align with the PR's focus on primary condition filters
  • Tests the edge case where conditions are mutually exclusive
  • Correctly expects 0 results, validating the fixed counting logic
  • Follows the established testing patterns in the file
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/segment/src/index/struct_payload_index.rs (1)

45-45: Consider keeping #[expect] instead of #[allow] for better lint management.

The expect attribute is preferred over allow when you know the lint will fire, as it documents that the suppression is intentional and will catch if the lint stops firing due to code changes.

-#[allow(clippy::enum_variant_names)]
+#[expect(clippy::enum_variant_names)]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a7972d and c8a567f.

📒 Files selected for processing (5)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs (0 hunks)
  • lib/segment/src/index/field_index/mod.rs (3 hunks)
  • lib/segment/src/index/query_estimator.rs (2 hunks)
  • lib/segment/src/index/struct_payload_index.rs (5 hunks)
  • lib/segment/src/types.rs (2 hunks)
💤 Files with no reviewable changes (1)
  • lib/segment/src/index/field_index/full_text_index/mutable_text_index.rs
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/segment/src/index/field_index/full_text_index/inverted_index.rs:330-344
Timestamp: 2025-05-26T14:45:01.364Z
Learning: For future phrase capabilities in the full-text index, equality checks need to compare both the ID and value of PostingElement items, not just the IDs. This supports use cases like position-aware or frequency-weighted searches where the value component contains important metadata.
lib/segment/src/index/query_estimator.rs (4)
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.
Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.
Learnt from: xzfc
PR: qdrant/qdrant#6245
File: lib/segment/src/index/hnsw_index/point_scorer.rs:116-121
Timestamp: 2025-04-22T23:17:41.734Z
Learning: The method `is_none_or` exists in the Qdrant codebase and compiles correctly, despite not being part of standard Rust's Option type. Code reviews should verify compilation issues before reporting them.
Learnt from: IvanPleshkov
PR: qdrant/qdrant#6663
File: lib/quantization/src/encoded_vectors_binary.rs:270-276
Timestamp: 2025-06-18T09:23:14.977Z
Learning: Pattern matching on `&Option<T>` with `Some(x)` is valid and idiomatic in Rust. The bound variable `x` will have type `&T` and can be used to access fields through automatic dereferencing. No need to use `.as_ref()` in such cases.
lib/segment/src/index/struct_payload_index.rs (7)
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5951
File: lib/api/src/grpc/qdrant.rs:6493-6496
Timestamp: 2025-03-12T09:25:16.340Z
Learning: The `payload_index_io_read` field in `HardwareUsage` was introduced in PR #5951 and is safe to modify as it was not used in production before this PR.
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.
Learnt from: coszio
PR: qdrant/qdrant#6609
File: lib/gridstore/src/blob.rs:46-59
Timestamp: 2025-06-02T18:10:47.203Z
Learning: In the Qdrant codebase, zerocopy crate is extensively used for safe byte-level operations across GPU operations, HNSW indices, memory-mapped structures, and serialization. When implementing Blob trait for generic Vec<T>, using zerocopy's FromBytes and IntoBytes traits is preferred over size_of::<T>() because it guarantees memory layout equals byte representation, making serialization safe and correct.
Learnt from: generall
PR: qdrant/qdrant#6274
File: lib/segment/src/vector_storage/sparse/mmap_sparse_vector_storage.rs:233-233
Timestamp: 2025-03-29T17:43:28.174Z
Learning: In the qdrant codebase, `debug_assert!` is preferred over runtime checks that would panic in release builds to ensure production stability, even when validating conditions like vector sorting.
lib/segment/src/types.rs (1)

undefined

<retrieved_learning>
Learnt from: timvisee
PR: #6569
File: lib/segment/src/vector_storage/sparse/volatile_sparse_vector_storage.rs:56-76
Timestamp: 2025-05-21T12:52:15.229Z
Learning: In volatile sparse vector storage implementations, the total_sparse_size field should be updated when vectors are inserted, updated, or deleted to maintain an accurate count of the total number of non-zero elements across all vectors.
</retrieved_learning>

lib/segment/src/index/field_index/mod.rs (4)
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
Learnt from: timvisee
PR: qdrant/qdrant#6503
File: lib/segment/src/index/field_index/geo_index/mmap_geo_index.rs:60-74
Timestamp: 2025-05-12T12:54:27.872Z
Learning: In the Qdrant codebase, using `pub(super)` visibility is preferred when fields need to be accessed by sibling modules (particularly for index type conversions), as it provides the necessary access without bloating the interface with numerous getters.
Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.
Learnt from: coszio
PR: qdrant/qdrant#6609
File: lib/gridstore/src/blob.rs:46-59
Timestamp: 2025-06-02T18:10:47.203Z
Learning: In the Qdrant codebase, zerocopy crate is extensively used for safe byte-level operations across GPU operations, HNSW indices, memory-mapped structures, and serialization. When implementing Blob trait for generic Vec<T>, using zerocopy's FromBytes and IntoBytes traits is preferred over size_of::<T>() because it guarantees memory layout equals byte representation, making serialization safe and correct.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-low-resources
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests-consensus
  • GitHub Check: integration-tests
  • GitHub Check: test-consistency
  • GitHub Check: storage-compat-test
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: lint
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (10)
lib/segment/src/index/query_estimator.rs (1)

272-272: LGTM! Import and ID handling changes are correct.

The addition of std::ops::Deref import and the simplified construction of PrimaryCondition::Ids using deref().clone() correctly aligns with the PR's objective of storing external IDs directly.

Also applies to: 321-321

lib/segment/src/index/struct_payload_index.rs (3)

108-115: Correct implementation of external to internal ID mapping.

The changes properly handle the transition to storing external IDs in PrimaryCondition::Ids by mapping them to internal offsets through the ID tracker. Filtering out None values ensures only valid internal IDs are returned.


362-369: Cardinality estimation correctly updated for external ID usage.

The changes properly adapt the cardinality estimation to work with external IDs stored in PrimaryCondition::Ids, maintaining the correct count and ID set.


461-464: Clean implementation using the new is_primary method.

The refactored logic using query_cardinality.is_primary(condition) provides a more maintainable approach to checking if all filter conditions are primary clauses.

lib/segment/src/types.rs (2)

2919-2919: Approve the change from 0 to 1 for HasVector condition count.

This change correctly updates the sub_conditions_count method for HasVector conditions to return 1 instead of 0, which aligns with the PR's objective to fix condition counting logic. The HasVector condition represents a single condition that should be counted as such.


2297-2301: Approve the simplified condition counting logic.

The updated count_all_conditions function now correctly counts only the recursive sub-condition counts by summing j.sub_conditions_count() for each condition, rather than additionally counting the direct length of the condition vector. This change addresses the PR objective to prevent counting nested "must" conditions multiple times and provides more accurate condition counting.

The logic now relies solely on the recursive sub-condition counts, which is the correct approach for hierarchical condition structures.

lib/segment/src/index/field_index/mod.rs (4)

1-1: Import addition looks correct.

The Deref trait import is needed for the deref() call in the new is_primary method.


5-5: Import consolidation improves organization.

Good consolidation of imports from the types module, making the dependencies clearer.


30-30: Type change aligns with PR objectives.

The change from PointOffsetType to PointIdType for the Ids variant correctly implements the PR objective of storing external IDs to facilitate easier comparison with original conditions.


89-111: Well-implemented method for primary condition checking.

The is_primary method correctly implements the logic to check if a given condition matches any primary clause. The pattern matching is exhaustive and handles all three variants of PrimaryCondition appropriately:

  • Condition variant: Compares boxed field conditions
  • Ids variant: Uses deref() to compare ID collections
  • HasVector variant: Compares vector names

The use of deref() on line 101 is correct for comparing the underlying collection types.


#[derive(Debug)]
#[expect(clippy::enum_variant_names)]
#[allow(clippy::enum_variant_names)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy with --no-default-features was complaining, I couldn't find a better way to fix it. Not directly related to PR

@agourlay agourlay merged commit f47e499 into dev Jul 10, 2025
18 checks passed
@agourlay agourlay deleted the fix-primary-condition-check branch July 10, 2025 09:40
generall added a commit that referenced this pull request Jul 17, 2025
* fix check for pure primary condition filter

* avoid cloning has_id for primary condition

* fix test

* add test

---------

Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants