Skip to content

fix cardinality estimation for has id#6854

Merged
agourlay merged 2 commits intodevfrom
fix-cardinality-estimation-for-has-id
Jul 11, 2025
Merged

fix cardinality estimation for has id#6854
agourlay merged 2 commits intodevfrom
fix-cardinality-estimation-for-has-id

Conversation

@generall
Copy link
Member

Last fix in #6840 broke the cardinality estimation for HasId condition.

Instead of counting points which actually exist in segment, it used total number of points.

This PR returns previous behavior.

@generall generall requested a review from agourlay July 11, 2025 11:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 11, 2025

📝 Walkthrough

Walkthrough

A new struct, ResolvedHasId, was introduced to encapsulate both the original point IDs and their resolved internal point offsets. The PrimaryCondition::Ids variant was updated to use this struct instead of directly holding the point ID set. Corresponding changes were made in methods and tests to construct and utilize ResolvedHasId, ensuring that both external and internal representations of point IDs are consistently available. The code in methods handling ID-based conditions and queries was refactored to use ResolvedHasId, and related imports were updated for consistency.

Suggested reviewers

  • timvisee
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bac2816 and c8b785a.

📒 Files selected for processing (3)
  • lib/segment/src/index/field_index/mod.rs (3 hunks)
  • lib/segment/src/index/query_estimator.rs (2 hunks)
  • lib/segment/src/index/struct_payload_index.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
lib/segment/src/index/query_estimator.rs (2)
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.
Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.
lib/segment/src/index/field_index/mod.rs (3)
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.
lib/segment/src/index/struct_payload_index.rs (6)
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5951
File: lib/api/src/grpc/qdrant.rs:6493-6496
Timestamp: 2025-03-12T09:25:16.340Z
Learning: The `payload_index_io_read` field in `HardwareUsage` was introduced in PR #5951 and is safe to modify as it was not used in production before this PR.
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Learnt from: generall
PR: qdrant/qdrant#6403
File: lib/segment/src/index/struct_payload_index.rs:485-497
Timestamp: 2025-04-20T18:32:43.471Z
Learning: In StructPayloadIndex, checking if a field exists in `self.config.indexed_fields` is sufficient to determine if an index should exist, as the initialization process in `load_all_fields()` ensures that `field_indexes` is synchronized with the configuration, automatically rebuilding any missing indexes.
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.
🧬 Code Graph Analysis (1)
lib/segment/src/index/struct_payload_index.rs (1)
lib/segment/src/id_tracker/id_tracker_base.rs (2)
  • external_id (47-47)
  • external_id (236-244)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: lint
  • GitHub Check: storage-compat-test
  • GitHub Check: test-snapshot-operations-s3-minio
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: Basic TLS/HTTPS tests
  • GitHub Check: test-low-resources
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (3)
lib/segment/src/index/field_index/mod.rs (1)

30-37: Well-structured implementation of ResolvedHasId.

The new struct properly encapsulates both the original point IDs and their resolved offsets, with clear documentation explaining the purpose of each field.

lib/segment/src/index/struct_payload_index.rs (2)

356-367: Correct fix for cardinality estimation.

The implementation now properly resolves external IDs to internal offsets and only counts points that actually exist in the segment, fixing the regression from PR #6840.


108-108: Efficient use of pre-resolved offsets.

Good optimization - directly iterating over pre-resolved offsets avoids redundant ID resolution during query execution.

Comment on lines +320 to +327
primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId {
point_ids: has_id.has_id.clone(),
resolved_point_offsets: has_id
.has_id
.iter()
.map(|id| id.to_string().parse().unwrap())
.collect(),
})],
Copy link
Contributor

@coderabbitai coderabbitai bot Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect ID resolution logic in test.

The test is converting IDs to strings and parsing them back as numbers, which doesn't accurately reflect how production code resolves IDs. Production code uses id_tracker.internal_id() to map external IDs to internal offsets.

Consider updating the test to better reflect real-world behavior:

-                primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId {
-                    point_ids: has_id.has_id.clone(),
-                    resolved_point_offsets: has_id
-                        .has_id
-                        .iter()
-                        .map(|id| id.to_string().parse().unwrap())
-                        .collect(),
-                })],
+                // In tests, assume all IDs exist and map directly to offsets
+                primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId {
+                    point_ids: has_id.has_id.clone(),
+                    resolved_point_offsets: has_id
+                        .has_id
+                        .iter()
+                        .map(|id| id.point_id_number as PointOffsetType)
+                        .collect(),
+                })],

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In lib/segment/src/index/query_estimator.rs around lines 320 to 327, the test
incorrectly converts IDs to strings and parses them back as numbers to resolve
internal offsets. Instead, update the test to use the id_tracker.internal_id()
method to map external IDs to internal offsets, matching the production code's
ID resolution logic for accuracy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of test, and it restores previous logic. Also point_id_number doesn't exist

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generall Thank you for the clarification! You're absolutely right:

  1. This is test code, so it doesn't need to match the exact production ID resolution logic
  2. The suggestion was incorrect - point_id_number doesn't exist
  3. Since this restores the previous logic as part of fixing the cardinality estimation regression, the test approach is appropriate

I apologize for the confusion in the original comment. The test's simplified ID resolution using id.to_string().parse().unwrap() is perfectly fine for testing purposes, and the focus should be on the cardinality estimation fix rather than the ID resolution mechanism.


✏️ Learnings added
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.517Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants