Conversation
📝 WalkthroughWalkthroughA new struct, Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
lib/segment/src/index/field_index/mod.rs(3 hunks)lib/segment/src/index/query_estimator.rs(2 hunks)lib/segment/src/index/struct_payload_index.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
lib/segment/src/index/query_estimator.rs (2)
Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.
Learnt from: timvisee
PR: qdrant/qdrant#6546
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:497-518
Timestamp: 2025-05-19T09:25:54.285Z
Learning: In the Qdrant project, using `.expect()` and `.unwrap()` is considered acceptable in test code since test failures should be loud and obvious, and proper error propagation adds unnecessary complexity to test code.
lib/segment/src/index/field_index/mod.rs (3)
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.
lib/segment/src/index/struct_payload_index.rs (6)
Learnt from: JojiiOfficial
PR: qdrant/qdrant#5951
File: lib/api/src/grpc/qdrant.rs:6493-6496
Timestamp: 2025-03-12T09:25:16.340Z
Learning: The `payload_index_io_read` field in `HardwareUsage` was introduced in PR #5951 and is safe to modify as it was not used in production before this PR.
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Learnt from: generall
PR: qdrant/qdrant#6403
File: lib/segment/src/index/struct_payload_index.rs:485-497
Timestamp: 2025-04-20T18:32:43.471Z
Learning: In StructPayloadIndex, checking if a field exists in `self.config.indexed_fields` is sufficient to determine if an index should exist, as the initialization process in `load_all_fields()` ensures that `field_indexes` is synchronized with the configuration, automatically rebuilding any missing indexes.
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/builder.rs:63-67
Timestamp: 2025-05-26T14:47:23.505Z
Learning: In the posting_list crate's PostingChunk struct, avoid adding extra fields like storing computed chunk_bits to prevent struct bloat and maintain mmap compatibility. The bit calculations from offsets are inexpensive compared to the memory and compatibility benefits of keeping the struct minimal.
Learnt from: coszio
PR: qdrant/qdrant#6528
File: lib/posting_list/src/view.rs:118-118
Timestamp: 2025-05-19T14:40:20.068Z
Learning: In the bitpacking crate, the BitPacker::decompress_strictly_sorted function takes an Option<PointOffsetType> as its first parameter, which means using checked_sub(1) without unwrapping is intentional and correct.
Learnt from: timvisee
PR: qdrant/qdrant#6644
File: lib/gridstore/src/blob.rs:41-57
Timestamp: 2025-06-06T07:52:38.478Z
Learning: In the Blob trait implementations for gridstore (lib/gridstore/src/blob.rs), the maintainer prefers using assert! for input validation over Result-based error handling, as they don't expect invalid inputs in real scenarios and changing to Result would require updating the entire trait interface.
🧬 Code Graph Analysis (1)
lib/segment/src/index/struct_payload_index.rs (1)
lib/segment/src/id_tracker/id_tracker_base.rs (2)
external_id(47-47)external_id(236-244)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (3)
lib/segment/src/index/field_index/mod.rs (1)
30-37: Well-structured implementation of ResolvedHasId.The new struct properly encapsulates both the original point IDs and their resolved offsets, with clear documentation explaining the purpose of each field.
lib/segment/src/index/struct_payload_index.rs (2)
356-367: Correct fix for cardinality estimation.The implementation now properly resolves external IDs to internal offsets and only counts points that actually exist in the segment, fixing the regression from PR #6840.
108-108: Efficient use of pre-resolved offsets.Good optimization - directly iterating over pre-resolved offsets avoids redundant ID resolution during query execution.
| primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId { | ||
| point_ids: has_id.has_id.clone(), | ||
| resolved_point_offsets: has_id | ||
| .has_id | ||
| .iter() | ||
| .map(|id| id.to_string().parse().unwrap()) | ||
| .collect(), | ||
| })], |
There was a problem hiding this comment.
Fix incorrect ID resolution logic in test.
The test is converting IDs to strings and parsing them back as numbers, which doesn't accurately reflect how production code resolves IDs. Production code uses id_tracker.internal_id() to map external IDs to internal offsets.
Consider updating the test to better reflect real-world behavior:
- primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId {
- point_ids: has_id.has_id.clone(),
- resolved_point_offsets: has_id
- .has_id
- .iter()
- .map(|id| id.to_string().parse().unwrap())
- .collect(),
- })],
+ // In tests, assume all IDs exist and map directly to offsets
+ primary_clauses: vec![PrimaryCondition::Ids(ResolvedHasId {
+ point_ids: has_id.has_id.clone(),
+ resolved_point_offsets: has_id
+ .has_id
+ .iter()
+ .map(|id| id.point_id_number as PointOffsetType)
+ .collect(),
+ })],Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In lib/segment/src/index/query_estimator.rs around lines 320 to 327, the test
incorrectly converts IDs to strings and parses them back as numbers to resolve
internal offsets. Instead, update the test to use the id_tracker.internal_id()
method to map external IDs to internal offsets, matching the production code's
ID resolution logic for accuracy.
There was a problem hiding this comment.
this is part of test, and it restores previous logic. Also point_id_number doesn't exist
There was a problem hiding this comment.
@generall Thank you for the clarification! You're absolutely right:
- This is test code, so it doesn't need to match the exact production ID resolution logic
- The suggestion was incorrect -
point_id_numberdoesn't exist - Since this restores the previous logic as part of fixing the cardinality estimation regression, the test approach is appropriate
I apologize for the confusion in the original comment. The test's simplified ID resolution using id.to_string().parse().unwrap() is perfectly fine for testing purposes, and the focus should be on the cardinality estimation fix rather than the ID resolution mechanism.
✏️ Learnings added
Learnt from: generall
PR: qdrant/qdrant#6854
File: lib/segment/src/index/query_estimator.rs:320-327
Timestamp: 2025-07-11T11:35:21.517Z
Learning: In test code for Qdrant's query estimator (lib/segment/src/index/query_estimator.rs), simplified ID resolution logic using `id.to_string().parse().unwrap()` is acceptable for testing purposes and doesn't need to match production code's `id_tracker.internal_id()` approach. Test code can use mock implementations that serve the testing goals.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: generall
PR: qdrant/qdrant#6766
File: lib/segment/src/index/field_index/full_text_index/inverted_index/mutable_inverted_index.rs:274-282
Timestamp: 2025-06-28T19:25:53.448Z
Learning: In the full text inverted index implementations (MutableInvertedIndex, ImmutableInvertedIndex, MmapInvertedIndex), there's an important distinction between TokenId and PointOffsetType: TokenId represents a token/word identifier used to index into postings lists, while PointOffsetType represents a point/document identifier. The iter_point_ids method should take a TokenId and return an iterator over PointOffsetType values (points that contain that token), not the other way around.
Learnt from: coszio
PR: qdrant/qdrant#6565
File: lib/posting_list/src/iterator.rs:0-0
Timestamp: 2025-05-26T14:50:13.183Z
Learning: The `advance_until_greater_or_equal` method in `PostingIterator` in `lib/posting_list/src/iterator.rs` is intentionally designed with different semantics than `next()` and can output the same value multiple times. This is intended behavior for a search method that finds the first element with ID >= target_id.
* fix segment-specific cardinality estimation * fmt
Last fix in #6840 broke the cardinality estimation for HasId condition.
Instead of counting points which actually exist in segment, it used total number of points.
This PR returns previous behavior.