Make points upsert ordering deterministic#7130
Conversation
KShivendu
left a comment
There was a problem hiding this comment.
Good idea. thanks for the insights!
📝 WalkthroughWalkthrough
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
lib/collection/src/collection_manager/segments_updater.rs (2)
575-597: Avoid duplicate inserts and skewed “updated” count when input contains duplicate IDsWhen the input has duplicate point IDs that don’t yet exist,
new_point_idscurrently yields duplicates. That leads to inserting the same point multiple times and incorrectly increasing the “updated” count (the second insert becomes a replace). Deduplicate new IDs stably before the insertion loop.I recommend adding a test that upserts duplicates like [1, 1, 2, 1] and asserts: (a) only one insert for 1 happens, (b) final state matches the last occurrence, (c) updated count isn’t inflated.
Apply this diff:
- // Insert new points, which was not updated or existed - let new_point_ids = ids.iter().copied().filter(|x| !updated_points.contains(x)); + // Insert new points, which were not updated or existed (stable de-dup to avoid double inserts) + let mut seen_new = AHashSet::with_capacity(ids.len()); + let new_point_ids: Vec<PointIdType> = ids + .iter() + .copied() + .filter(|id| !updated_points.contains(id)) + .filter(|id| seen_new.insert(*id)) + .collect(); @@ - for point_id in new_point_ids { + for point_id in new_point_ids { let point = points_map[&point_id]; res += usize::from(upsert_with_payload( &mut write_segment, op_num, point_id, point.get_vectors(), point.payload.as_ref(), hw_counter, )?); }
581-582: Nit: grammar in error messagePlural agreement: “segments exist”.
- CollectionError::service_error("No appendable segments exists, expected at least one") + CollectionError::service_error("No appendable segments exist, expected at least one")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
lib/collection/src/collection_manager/segments_updater.rs(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-10T18:30:02.986Z
Learnt from: generall
PR: qdrant/qdrant#7006
File: lib/collection/src/operations/verification/update.rs:158-174
Timestamp: 2025-08-10T18:30:02.986Z
Learning: In Qdrant's strict mode verification code (lib/collection/src/operations/verification/update.rs), exhaustive pattern matching without `..` is intentionally used for structs like PointsBatch and PointsList. This design pattern ensures compilation fails when new fields are added, forcing developers to explicitly consider how new fields should be handled in the indexed_filter_write method. This provides visibility and compile-time safety for struct evolution.
Applied to files:
lib/collection/src/collection_manager/segments_updater.rs
🧬 Code graph analysis (1)
lib/collection/src/collection_manager/segments_updater.rs (2)
lib/collection/src/operations/conversions.rs (1)
ids(894-897)lib/storage/src/rbac/ops_checks.rs (1)
ids(574-574)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: e2e-tests
- GitHub Check: integration-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests-consensus
- GitHub Check: test-consensus-compose
- GitHub Check: test-consistency
- GitHub Check: storage-compat-test
- GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: lint
| // Conserve initial order of points | ||
| let (ids, points_map) = points | ||
| .into_iter() | ||
| .map(|p| (p.id, (p.id, p))) | ||
| .collect::<(Vec<_>, AHashMap<_, _>)>(); | ||
|
|
There was a problem hiding this comment.
Fix: collect::<(Vec<_>, AHashMap<_, _>)>() will not compile; use unzip() with explicit types
Iterator::collect produces a single collection. There is no FromIterator for tuples of collections, so this won’t compile. Use Iterator::unzip with an explicit type to build (Vec<PointIdType>, AHashMap<PointIdType, &PointStructPersisted>) in one pass while preserving input order for ids and last-write-wins for the map.
Apply this diff:
- // Conserve initial order of points
- let (ids, points_map) = points
- .into_iter()
- .map(|p| (p.id, (p.id, p)))
- .collect::<(Vec<_>, AHashMap<_, _>)>();
+ // Conserve initial order of points (ids); points_map is last-write-wins on duplicates
+ let (ids, points_map): (Vec<PointIdType>, AHashMap<PointIdType, &'a PointStructPersisted>) =
+ points
+ .into_iter()
+ .map(|p| (p.id, (p.id, p)))
+ .unzip();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Conserve initial order of points | |
| let (ids, points_map) = points | |
| .into_iter() | |
| .map(|p| (p.id, (p.id, p))) | |
| .collect::<(Vec<_>, AHashMap<_, _>)>(); | |
| // Conserve initial order of points (ids); points_map is last-write-wins on duplicates | |
| let (ids, points_map): (Vec<PointIdType>, AHashMap<PointIdType, &'a PointStructPersisted>) = | |
| points | |
| .into_iter() | |
| .map(|p| (p.id, (p.id, p))) | |
| .unzip(); |
🤖 Prompt for AI Agents
In lib/collection/src/collection_manager/segments_updater.rs around lines
541-546, replace the failing collect::<(Vec<_>, AHashMap<_, _>)>() with
Iterator::unzip and an explicit type annotation: map each point to (id, (id,
point)) so unzip returns (Vec<IdType>, AHashMap<IdType, PointType>), e.g. let
(ids, points_map): (Vec<PointIdType>, AHashMap<PointIdType,
PointStructPersisted>) = points.into_iter().map(|p| { let id = p.id; (id, (id,
p)) }).unzip(); This preserves input order in ids and builds the hashmap with
last-write-wins semantics.
* Make points upsert ordering deterministic * use Iterator magic (thx Tim)
This PR makes the points ordering on upsert deterministic by picking the order from the user.
This improves the debugging and testing experience while taking us one more step towards a more deterministic code base.