fix cardinality estimation for has id by generall · Pull Request #6854 · qdrant/qdrant

generall · 2025-07-11T11:26:36Z

Last fix in #6840 broke the cardinality estimation for HasId condition.

Instead of counting points which actually exist in segment, it used total number of points.

This PR returns previous behavior.

coderabbitai · 2025-07-11T11:29:14Z

📝 Walkthrough

Walkthrough

A new struct, ResolvedHasId, was introduced to encapsulate both the original point IDs and their resolved internal point offsets. The PrimaryCondition::Ids variant was updated to use this struct instead of directly holding the point ID set. Corresponding changes were made in methods and tests to construct and utilize ResolvedHasId, ensuring that both external and internal representations of point IDs are consistently available. The code in methods handling ID-based conditions and queries was refactored to use ResolvedHasId, and related imports were updated for consistency.

Suggested reviewers

timvisee

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bac2816 and c8b785a.

📒 Files selected for processing (3)

lib/segment/src/index/field_index/mod.rs (3 hunks)
lib/segment/src/index/query_estimator.rs (2 hunks)
lib/segment/src/index/struct_payload_index.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

lib/segment/src/index/query_estimator.rs (2)

Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.

Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.

lib/segment/src/index/field_index/mod.rs (3)

Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.

Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.

lib/segment/src/index/struct_payload_index.rs (6)

Learnt from: JojiiOfficial
PR: qdrant/qdrant#5951
File: lib/api/src/grpc/qdrant.rs:6493-6496
Timestamp: 2025-03-12T09:25:16.340Z
Learning: The `payload_index_io_read` field in `HardwareUsage` was introduced in PR #5951 and is safe to modify as it was not used in production before this PR.

Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Learnt from: generall
PR: qdrant/qdrant#6403
File: lib/segment/src/index/struct_payload_index.rs:485-497
Timestamp: 2025-04-20T18:32:43.471Z
Learning: In StructPayloadIndex, checking if a field exists in `self.config.indexed_fields` is sufficient to determine if an index should exist, as the initialization process in `load_all_fields()` ensures that `field_indexes` is synchronized with the configuration, automatically rebuilding any missing indexes.

Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.

Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.

Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.

🧬 Code Graph Analysis (1)

lib/segment/src/index/struct_payload_index.rs (1)

lib/segment/src/id_tracker/id_tracker_base.rs (2)

external_id (47-47)

external_id (236-244)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: lint
GitHub Check: storage-compat-test
GitHub Check: test-snapshot-operations-s3-minio
GitHub Check: test-consensus-compose
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: Basic TLS/HTTPS tests
GitHub Check: test-low-resources
GitHub Check: test-consistency
GitHub Check: integration-tests
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)

🔇 Additional comments (3)

lib/segment/src/index/field_index/mod.rs (1)

30-37: Well-structured implementation of ResolvedHasId.

The new struct properly encapsulates both the original point IDs and their resolved offsets, with clear documentation explaining the purpose of each field.

lib/segment/src/index/struct_payload_index.rs (2)

356-367: Correct fix for cardinality estimation.

The implementation now properly resolves external IDs to internal offsets and only counts points that actually exist in the segment, fixing the regression from PR #6840.

108-108: Efficient use of pre-resolved offsets.

Good optimization - directly iterating over pre-resolved offsets avoids redundant ID resolution during query execution.

coderabbitai · 2025-07-11T11:29:17Z

lib/segment/src/index/query_estimator.rs

+                primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId {
+                    point_ids: has_id.has_id.clone(),
+                    resolved_point_offsets: has_id
+                        .has_id
+                        .iter()
+                        .map(|id| id.to_string().parse().unwrap())
+                        .collect(),
+                })],


⚠️ Potential issue

Fix incorrect ID resolution logic in test.

The test is converting IDs to strings and parsing them back as numbers, which doesn't accurately reflect how production code resolves IDs. Production code uses id_tracker.internal_id() to map external IDs to internal offsets.

Consider updating the test to better reflect real-world behavior:

- primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId { - point_ids: has_id.has_id.clone(), - resolved_point_offsets: has_id - .has_id - .iter() - .map(|id| id.to_string().parse().unwrap()) - .collect(), - })], + // In tests, assume all IDs exist and map directly to offsets + primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId { + point_ids: has_id.has_id.clone(), + resolved_point_offsets: has_id + .has_id + .iter() + .map(|id| id.point_id_number as PointOffsetType) + .collect(), + })],

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In lib/segment/src/index/query_estimator.rs around lines 320 to 327, the test incorrectly converts IDs to strings and parses them back as numbers to resolve internal offsets. Instead, update the test to use the id_tracker.internal_id() method to map external IDs to internal offsets, matching the production code's ID resolution logic for accuracy.

this is part of test, and it restores previous logic. Also point_id_number doesn't exist

@generall Thank you for the clarification! You're absolutely right:

This is test code, so it doesn't need to match the exact production ID resolution logic

The suggestion was incorrect - point_id_number doesn't exist

Since this restores the previous logic as part of fixing the cardinality estimation regression, the test approach is appropriate

I apologize for the confusion in the original comment. The test's simplified ID resolution using id.to_string().parse().unwrap() is perfectly fine for testing purposes, and the focus should be on the cardinality estimation fix rather than the ID resolution mechanism.

✏️ Learnings added

Learnt from: generall PR: qdrant/qdrant#6854 File: lib/segment/src/index/query_estimator.rs:320-327 Timestamp: 2025-07-11T11:35:21.517Z Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: generall PR: qdrant/qdrant#6766 File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282 Timestamp: 2025-06-28T19:25:53.448Z Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.

Learnt from: coszio PR: qdrant/qdrant#6565 File: lib/posting_list/src/iterator.rs:0-0 Timestamp: 2025-05-26T14:50:13.183Z Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.

* fix segment-specific cardinality estimation * fmt

generall added 2 commits July 11, 2025 13:23

fix segment-specific cardinality estimation

8f89a76

fmt

c8b785a

generall requested a review from agourlay July 11, 2025 11:26

coderabbitai bot reviewed Jul 11, 2025

View reviewed changes

agourlay approved these changes Jul 11, 2025

View reviewed changes

agourlay merged commit bb9c28d into dev Jul 11, 2025
18 checks passed

agourlay deleted the fix-cardinality-estimation-for-has-id branch July 11, 2025 11:52

generall added a commit that referenced this pull request Jul 17, 2025

fix cardinality estimation for has id (#6854)

585034c

* fix segment-specific cardinality estimation * fmt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix cardinality estimation for has id#6854

fix cardinality estimation for has id#6854
agourlay merged 2 commits intodevfrom
fix-cardinality-estimation-for-has-id

generall commented Jul 11, 2025

Uh oh!

coderabbitai bot commented Jul 11, 2025

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jul 11, 2025 •

edited

Loading

Uh oh!

generall Jul 11, 2025

Uh oh!

coderabbitai bot Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

generall commented Jul 11, 2025

Uh oh!

coderabbitai bot commented Jul 11, 2025

Walkthrough

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

generall Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Jul 11, 2025 •

edited

Loading