Use former scroll_lock to prevent parallel updates during snapshot by generall · Pull Request #7306 · qdrant/qdrant

generall · 2025-09-25T09:26:59Z

This PR includes refactoring of segment-holder related functions to unlock usage of async lock during snapshot.

This PR does the following:

moves snapshot related functions into a dedicated file in both: shard and collection crates
Moves some snapshot-related functions from shard into collection to unblock usage of tokio rwlock
Changes fixture to be ctg testing so some tests can be also moved to collection
Removes update_lock in segment holder
Renames and propagates scroll lock into snapshot_all_segments function

…shard

…ing with updates

coderabbitai · 2025-09-25T13:38:12Z

📝 Walkthrough

Walkthrough

Refactors snapshot and proxy functionality across crates and renames a lock used for update operations. Cargo manifests updated: the collection workspace disables shard default features and adds tar as a dev-dependency; shard crate adds a testing feature and promotes serde_json to a regular dependency. In lib/collection, snapshot-related public APIs are moved into a new snapshot module, LocalShard’s scroll_read_lock is renamed to update_operation_lock, and scroll paths updated. New snapshot helpers and tests added. In lib/shard, SegmentHolder snapshot/proxy logic is split into a new snapshot module and related tests removed; fixtures are exposed behind the testing feature.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Move ProxySegment from collection into shard crate #7093 — Moves/shares ProxySegment and segment-holder snapshot/proxy functionality between collection and shard, matching this refactor.
Move SegmentHolder::scroll_read_lock to LocalShard #7053 — Changes around the scroll lock/update lock propagation that overlap with the lock rename here.
Refactor SegmentHolder to use SegmentConfig (instead of CollectionParams) #7098 — Adjusts SegmentHolder snapshot/proxy APIs and signatures similar to the introduced snapshot module split.

Suggested reviewers

agourlay
timvisee
ffuugoo

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title accurately refers to the refactored use of the former scroll_lock to prevent concurrent updates during snapshots, which is a concrete aspect of the changeset even though additional refactoring steps are included.
Description Check	✅ Passed	The description clearly outlines the refactoring of snapshot-related functions, relocation between crates, removal and renaming of locks, and test fixture adjustments, all of which align with the changes in the pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch refactor-proxy_all_segments_and_apply

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (10)

lib/collection/src/shards/local_shard/scroll.rs (1)

156-156: Hold update_operation_lock for less time to avoid blocking snapshots/updates

Currently the read guard is kept through payload/vector retrieval, which can be slow and will block snapshot attempts (write) and updates longer than necessary. Release it once the candidate IDs are computed.

Apply diffs below in each method.

scroll_by_id:

@@
-        let with_payload = WithPayload::from(with_payload_interface);
-        // update timeout
-        let timeout = timeout.saturating_sub(start.elapsed());
-        let mut records_map = tokio::time::timeout(
+        let with_payload = WithPayload::from(with_payload_interface);
+        // update timeout
+        let timeout = timeout.saturating_sub(start.elapsed());
+        // Release update lock before fetching payload/vector data to reduce contention.
+        drop(scroll_lock);
+        let mut records_map = tokio::time::timeout(
@@
-        .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;
-
-        drop(scroll_lock);
+        .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;

scroll_by_field:

@@
-        let with_payload = WithPayload::from(with_payload_interface);
-
-        // update timeout
-        let timeout = timeout.saturating_sub(start.elapsed());
-
-        // Fetch with the requested vector and payload
-        let records_map = tokio::time::timeout(
+        let with_payload = WithPayload::from(with_payload_interface);
+        // update timeout
+        let timeout = timeout.saturating_sub(start.elapsed());
+        // Release update lock before fetching payload/vector data to reduce contention.
+        drop(scroll_lock);
+        // Fetch with the requested vector and payload
+        let records_map = tokio::time::timeout(
@@
-        .map_err(|_| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;
-
-        drop(scroll_lock);
+        .map_err(|_| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;

scroll_randomly:

@@
-        let with_payload = WithPayload::from(with_payload_interface);
-        // update timeout
-        let timeout = timeout.saturating_sub(start.elapsed());
-        let records_map = tokio::time::timeout(
+        let with_payload = WithPayload::from(with_payload_interface);
+        // update timeout
+        let timeout = timeout.saturating_sub(start.elapsed());
+        // Release update lock before fetching payload/vector data to reduce contention.
+        drop(scroll_lock);
+        let records_map = tokio::time::timeout(
@@
-        .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;
-
-        drop(scroll_lock);
+        .map_err(|_: Elapsed| CollectionError::timeout(timeout.as_secs() as usize, "retrieve"))??;

Also applies to: 241-241, 340-340

lib/collection/src/shards/local_shard/snapshot_tests.rs (1)

71-75: Explicitly drop tar builder and assert count dynamically

Make intent explicit by dropping the builder before reading; assert the expected archive count from the holder to reduce fixture coupling.

Apply:

-    let mut tar = tar::Archive::new(File::open(&snapshot_file).unwrap());
-    let archive_count = tar.entries_with_seek().unwrap().count();
-    // one archive produced per concrete segment in the SegmentHolder
-    assert_eq!(archive_count, 2);
+    // Ensure writer is fully flushed before reading
+    drop(tar);
+    let mut archive = tar::Archive::new(File::open(&snapshot_file).unwrap());
+    let archive_count = archive.entries_with_seek().unwrap().count();
+    // one archive per concrete segment in the SegmentHolder
+    assert_eq!(archive_count, before_ids.len());

lib/shard/Cargo.toml (1)

12-14: Move serde_json to dev-dependencies
All serde_json usages are inside #[cfg(test)] modules; relocate it to [dev-dependencies] or gate it under the testing feature to avoid adding it to production builds.

lib/shard/src/segment_holder/snapshot.rs (2)

63-64: Avoid unwrap on get(); make the invariant explicit (or return an error)

Using unwrap can panic if the mapping changed unexpectedly. Prefer expect with a message, or return an OperationError.
-            let segment = segments_lock.get(segment_id).unwrap();
+            let segment = segments_lock
+                .get(segment_id)
+                .expect("segment_id obtained from segment_ids() must exist while holding upgradable read lock");
49-56: Clean up tmp segment on early failure

If index replication fails before swap, tmp_segment is left behind. Add a cleanup path to drop tmp_segment data on error before returning.

Example approach: wrap creation in a small guard that calls tmp_segment.drop_data() unless “committed”, and mark committed right before returning Ok((proxies, tmp_segment, ...)).

Also applies to: 57-78, 80-89

lib/collection/src/shards/local_shard/mod.rs (4)

208-209: Rename local var to match field and intent

Use update_operation_lock consistently to avoid confusion.

-        let scroll_read_lock = Arc::new(tokio::sync::RwLock::new(()));
+        let update_operation_lock = Arc::new(tokio::sync::RwLock::new(()));
@@
-            scroll_read_lock.clone(),
+            update_operation_lock.clone(),
@@
-            update_operation_lock: scroll_read_lock,
+            update_operation_lock,

Also applies to: 226-228, 261-262

122-124: Use imported alias for consistency

Type already aliased as TokioRwLock above.

-    pub(super) update_operation_lock: Arc<tokio::sync::RwLock<()>>,
+    pub(super) update_operation_lock: Arc<TokioRwLock<()>>,

289-293: Prefer explicit From over Into per guidelines

-        let wal: SerdeWal<OperationWithClockTag> = SerdeWal::new(
-            wal_path.to_str().unwrap(),
-            (&collection_config_read.wal_config).into(),
-        )
+        let wal: SerdeWal<OperationWithClockTag> = SerdeWal::new(
+            wal_path.to_str().unwrap(),
+            WalOptions::from(&collection_config_read.wal_config),
+        )

575-577: Prefer explicit From over Into per guidelines

-        let wal: SerdeWal<OperationWithClockTag> =
-            SerdeWal::new(wal_path.to_str().unwrap(), (&config.wal_config).into())?;
+        let wal: SerdeWal<OperationWithClockTag> =
+            SerdeWal::new(wal_path.to_str().unwrap(), WalOptions::from(&config.wal_config))?;

lib/collection/src/shards/local_shard/snapshot.rs (1)

327-341: Use HashSet for unproxied IDs to avoid O(n²) retain

Switch from Vec to HashSet for membership tests.

-    let mut unproxied_segment_ids = Vec::with_capacity(proxies.len());
+    let mut unproxied_segment_ids = HashSet::with_capacity(proxies.len());
@@
-                Ok(lock) => {
-                    segments_lock = lock;
-                    unproxied_segment_ids.push(*segment_id);
-                }
+                Ok(lock) => {
+                    segments_lock = lock;
+                    unproxied_segment_ids.insert(*segment_id);
+                }
@@
-    proxies.retain(|(id, _)| !unproxied_segment_ids.contains(id));
+    proxies.retain(|(id, _)| !unproxied_segment_ids.contains(id));

Also applies to: 343-344

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between beaea63 and f2ee497.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

Cargo.toml (1 hunks)
lib/collection/Cargo.toml (2 hunks)
lib/collection/src/shards/local_shard/mod.rs (6 hunks)
lib/collection/src/shards/local_shard/scroll.rs (3 hunks)
lib/collection/src/shards/local_shard/snapshot.rs (1 hunks)
lib/collection/src/shards/local_shard/snapshot_tests.rs (1 hunks)
lib/shard/Cargo.toml (2 hunks)
lib/shard/src/lib.rs (1 hunks)
lib/shard/src/segment_holder/mod.rs (2 hunks)
lib/shard/src/segment_holder/snapshot.rs (1 hunks)
lib/shard/src/segment_holder/tests.rs (0 hunks)

💤 Files with no reviewable changes (1)

lib/shard/src/segment_holder/tests.rs

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

lib/collection/src/shards/local_shard/scroll.rs
lib/collection/src/shards/local_shard/snapshot_tests.rs
lib/shard/src/segment_holder/snapshot.rs
lib/collection/src/shards/local_shard/snapshot.rs
lib/shard/src/lib.rs
lib/shard/src/segment_holder/mod.rs
lib/collection/src/shards/local_shard/mod.rs

**/src/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/src/**/*.rs: Prefer exhaustive match arms over a catch-all _ arm to avoid missing new enum variants (except in tests/benchmarks or when provably safe)
Prefer explicit field ignoring with : _ over .. in struct patterns (except in tests/benchmarks or when provably safe)

Files:

lib/collection/src/shards/local_shard/scroll.rs
lib/collection/src/shards/local_shard/snapshot_tests.rs
lib/shard/src/segment_holder/snapshot.rs
lib/collection/src/shards/local_shard/snapshot.rs
lib/shard/src/lib.rs
lib/shard/src/segment_holder/mod.rs
lib/collection/src/shards/local_shard/mod.rs

🧠 Learnings (2)

📚 Learning: 2025-06-14T20:35:10.603Z

Learnt from: generall
PR: qdrant/qdrant#6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.

Applied to files:

lib/shard/src/lib.rs

📚 Learning: 2025-08-15T11:42:00.297Z

Learnt from: IvanPleshkov
PR: qdrant/qdrant#7043
File: lib/segment/src/vector_storage/quantized/quantized_mmap_storage.rs:86-90
Timestamp: 2025-08-15T11:42:00.297Z
Learning: In lib/segment/src/vector_storage/quantized/quantized_mmap_storage.rs, overflow protection for encoded_storage_size computation (quantized_vector_size * vectors_count) is implemented in PR #7048, using checked_mul with u64 casting to prevent silent overflow.

Applied to files:

lib/shard/src/segment_holder/mod.rs

🧬 Code graph analysis (3)

lib/collection/src/shards/local_shard/snapshot_tests.rs (3)

lib/collection/src/shards/local_shard/snapshot.rs (1)

snapshot_all_segments (207-243)

lib/common/common/src/tar_ext.rs (2)

tar (49-51)

new_seekable_owned (84-86)

lib/common/common/src/save_on_disk.rs (1)

load_or_init_default (38-43)

lib/shard/src/segment_holder/snapshot.rs (2)

lib/collection/src/shards/replica_set/mod.rs (1)

payload_index_schema (1183-1185)

lib/shard/src/segment_holder/mod.rs (2)

segment_ids (293-298)

get (249-253)

lib/collection/src/shards/local_shard/snapshot.rs (4)

lib/collection/src/shards/local_shard/mod.rs (5)

segments_path (466-468)

segments (265-267)

new (178-263)

new (1096-1101)

archive_data (1124-1139)

lib/shard/src/segment_holder/snapshot.rs (4)

snapshot_manifest (20-31)

proxy_all_segments (35-114)

try_unproxy_segment (122-174)

unproxy_all_segments (177-241)

lib/shard/src/segment_holder/mod.rs (1)

segments (1031-1034)

lib/shard/src/wal.rs (1)

segment_capacity (210-212)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: lint
GitHub Check: e2e-tests
GitHub Check: integration-tests-consensus
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)

🔇 Additional comments (12)

lib/collection/Cargo.toml (1)

24-24: Test-only tar and shard(dev) with testing feature — LGTM

Dev-deps look scoped correctly; shard default-features disabled in both deps and dev-deps matches the refactor intent.

Also applies to: 29-30, 76-77

lib/shard/src/lib.rs (1)

10-11: Ensure fixture imports are test-only
Found numerous use crate::fixtures (and one use shard::fixtures)—confirm each import lives inside a #[cfg(test)] module or a test-only file so production builds without the testing feature won’t break.

lib/shard/src/segment_holder/mod.rs (1)

1-1: Snapshot implementation now internal — public API unchanged
Public SegmentHolder methods (snapshot_manifest, proxy_all_segments, unproxy_all_segments) remain accessible; no re-exports needed.

lib/shard/src/segment_holder/snapshot.rs (1)

191-201: Good: two-stage propagation minimizes write-lock duration

Batching propagate_to_wrapped under read then write lock is sound and reduces contention.

lib/collection/src/shards/local_shard/mod.rs (2)

108-124: Docs and intent for update_operation_lock look good

Clear scope: block updates during scroll and snapshot phases with a Tokio RwLock.

668-676: Passing update_operation_lock into CollectionUpdater::update is correct

This ensures WAL replays respect the update/write critical section.

Please confirm CollectionUpdater always acquires update_operation_lock before taking a write on SegmentHolder to maintain a consistent lock acquisition order with snapshot paths (read update lock -> write segment lock).

lib/collection/src/shards/local_shard/snapshot.rs (6)

95-116: Good: snapshot runs in spawn_blocking with read barriers

Delegating to spawn_blocking and using update_operation_lock during proxy unwrapping is the right tradeoff to keep writes flowing.

168-197: WAL snapshot: correct to skip sentinel file

Skipping ".wal" avoids Windows handle errors.

25-32: Delegating to SegmentHolder::snapshot_manifest is tidy

Simple and correct surface.

33-59: Restore: filter hidden entries before in-place restore

Looks correct; logs make recovery easier to audit.

132-161: Verify segment_queue_len=0 support
Unable to locate WalOptions definition or generate_empty_wal_starting_at_index implementation in the shard crate—ensure segment_queue_len=0 is valid and won’t panic or error at runtime.

293-344: Incorrect lock‐order concern: no write‐lock on update_operation_lock
No code ever acquires update_operation_lock.write()—only read locks are held in snapshot.rs—so there’s no lock‐order inversion or deadlock to address.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-09-25T13:38:15Z

Cargo.toml

+shard = { path = "lib/shard", default-features = false }
 collection = { path = "lib/collection" }


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Disable shard default features at the workspace root — LGTM; confirm desired feature surface

This aligns with the feature-gating cleanup. Ensure downstream crates explicitly enable any needed shard features (only "testing" exists currently).

List all shard dependency declarations to spot any missing feature flags:

🏁 Script executed:

#!/bin/bash fd -a Cargo.toml | xargs -I{} sh -c "echo '--- {}'; rg -nP '^\s*shard\s*=\s*\{[^}]*\}' {} || true"

Length of output: 1054

Enable missing shard testing feature
lib/collection/Cargo.toml:76 – shard = { path = "../shard", default-features = false } should include features = ["testing"].

🤖 Prompt for AI Agents

In Cargo.toml around lines 120-121 the shard dependency is declared as `shard = { path = "lib/shard", default-features = false }` and the review notes lib/collection/Cargo.toml:76 also declares the shard dependency without the testing feature; update the shard dependency declarations to include `features = ["testing"]` (i.e., change to include features = ["testing"] alongside path and default-features) so the testing feature is enabled where required, and apply the same change in lib/collection/Cargo.toml line 76.

coderabbitai · 2025-09-25T13:38:15Z

lib/collection/src/shards/local_shard/scroll.rs

        let segments = self.segments.clone();

-        let scroll_lock = self.scroll_read_lock.read().await;
+        let scroll_lock = self.update_operation_lock.read().await;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Confirm write-side usage of update_operation_lock

Ensure all update/segment-mutation paths take a write guard on update_operation_lock, otherwise the new read-guards here won’t actually block them.

🏁 Script executed:

#!/bin/bash # Show read/write usages of update_operation_lock across collection rg -nP --type=rust -C2 '\bupdate_operation_lock\s*\.\s*(read|write)\s*\(\s*\)\.await' lib/collection

Length of output: 1361

🏁 Script executed:

#!/bin/bash # Show write usages of update_operation_lock across collection rg -nP --type=rust -C3 '\bupdate_operation_lock\s*\.write\s*\(\s*\)\.await' lib/collection

Length of output: 0

Add write guards for update_operation_lock in mutation paths
No update_operation_lock.write().await found; current read guards in scroll.rs won’t block mutations. Acquire update_operation_lock.write().await in all update/segment-mutation paths.

🤖 Prompt for AI Agents

In lib/collection/src/shards/local_shard/scroll.rs around line 156, the code acquires a read guard on update_operation_lock (let scroll_lock = self.update_operation_lock.read().await) but this path performs mutations and therefore must block concurrent updates; replace the read guard with a write guard by calling self.update_operation_lock.write().await in all mutation/segment-mutation paths in this file, rename the guard variable to reflect a write lock (e.g., update_lock) and ensure the guard is held for the full duration of the mutation and dropped when the operation completes to maintain exclusive access.

coderabbitai · 2025-09-25T13:38:15Z

lib/shard/src/segment_holder/snapshot.rs

+        write_segments.replace(segment_id, wrapped_segment).unwrap();
+
+        // Downgrade write lock to read and give it back
+        Ok(RwLockWriteGuard::downgrade_to_upgradable(write_segments))
+    }


⚠️ Potential issue | 🟠 Major

Avoid panic: handle replace() error instead of unwrap()

Unwrap on replace may panic under rare races. Return the lock on error to let the caller recover.

- write_segments.replace(segment_id, wrapped_segment).unwrap(); - - // Downgrade write lock to read and give it back - Ok(RwLockWriteGuard::downgrade_to_upgradable(write_segments)) + match write_segments.replace(segment_id, wrapped_segment) { + Ok(()) => { + // Downgrade write lock to read and give it back + Ok(RwLockWriteGuard::downgrade_to_upgradable(write_segments)) + } + Err(err) => { + log::error!( + "Failed to replace proxied segment {segment_id} with wrapped segment: {err}" + ); + Err(RwLockWriteGuard::downgrade_to_upgradable(write_segments)) + } + }

Committable suggestion skipped: line range outside the PR's diff.

lib/collection/src/shards/local_shard/mod.rs

lib/collection/src/shards/local_shard/scroll.rs

…ead of read

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

lib/collection/src/shards/local_shard/snapshot.rs (1)
123-161: Minor: error message wording.

Nit: “Error while create empty WAL” → “Error while creating empty WAL”.
-            CollectionError::service_error(format!("Error while create empty WAL: {err}"))
+            CollectionError::service_error(format!("Error while creating empty WAL: {err}"))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2ee497 and 7ec3c38.

📒 Files selected for processing (4)

lib/collection/src/collection_manager/collection_updater.rs (2 hunks)
lib/collection/src/shards/local_shard/scroll.rs (6 hunks)
lib/collection/src/shards/local_shard/snapshot.rs (1 hunks)
lib/collection/src/update_handler.rs (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/collection/src/shards/local_shard/scroll.rs

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

lib/collection/src/collection_manager/collection_updater.rs
lib/collection/src/update_handler.rs
lib/collection/src/shards/local_shard/snapshot.rs

**/src/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/src/**/*.rs: Prefer exhaustive match arms over a catch-all _ arm to avoid missing new enum variants (except in tests/benchmarks or when provably safe)
Prefer explicit field ignoring with : _ over .. in struct patterns (except in tests/benchmarks or when provably safe)

Files:

lib/collection/src/collection_manager/collection_updater.rs
lib/collection/src/update_handler.rs
lib/collection/src/shards/local_shard/snapshot.rs

🧠 Learnings (1)

📚 Learning: 2025-09-01T11:42:06.964Z

Learnt from: timvisee
PR: qdrant/qdrant#7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

Applied to files:

lib/collection/src/update_handler.rs

🧬 Code graph analysis (1)

lib/collection/src/shards/local_shard/snapshot.rs (4)

lib/collection/src/shards/local_shard/mod.rs (5)

segments_path (466-468)

segments (265-267)

new (178-263)

new (1096-1101)

archive_data (1124-1139)

lib/shard/src/segment_holder/snapshot.rs (4)

snapshot_manifest (20-31)

proxy_all_segments (35-114)

try_unproxy_segment (122-174)

unproxy_all_segments (177-241)

lib/shard/src/segment_holder/mod.rs (1)

segments (1031-1034)

lib/shard/src/wal.rs (1)

segment_capacity (210-212)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: e2e-tests
GitHub Check: test-consensus-compose
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: integration-tests-consensus
GitHub Check: test-consistency
GitHub Check: integration-tests
GitHub Check: storage-compat-test
GitHub Check: lint

🔇 Additional comments (5)

lib/collection/src/update_handler.rs (3)

295-315: Recovery path now respects the update-operation lock

Forwarding update_operation_lock through try_recover ensures replayed WAL operations contend on the same gate as live updates—exactly what we need for snapshot safety. Looks good.

620-760: Optimizer worker wiring matches the new lock semantics

Passing update_operation_lock into the optimizer worker keeps post-failure recovery and optimizer triggers consistent with the renamed lock. No issues spotted.

764-809: Update worker correctly propagates the renamed lock

The update worker now hands the same update_operation_lock down to CollectionUpdater::update, so write-side exclusivity remains intact. Implementation looks correct.

lib/collection/src/collection_manager/collection_updater.rs (1)

46-56: Lock rename and usage remain sound

Renaming the parameter and guard to update_operation_lock keeps intent clear, and the blocking_write guard usage is unchanged. All good here.

lib/collection/src/shards/local_shard/snapshot.rs (1)

61-121: LGTM: snapshot orchestration and WAL handling flow.

Using a plunger when not snapshotting WAL is correct.

Offloading heavy work to spawn_blocking and using tar_ext::BuilderExt::descend looks good.

Please confirm tar_ext::BuilderExt is Clone-safe and that descend is infallible for these inputs in your environment (it returns Result, you already handle with ?).

lib/collection/src/shards/local_shard/snapshot.rs

coderabbitai · 2025-09-25T14:14:07Z

lib/collection/src/shards/local_shard/snapshot.rs

+            if entry.file_name() == ".wal" {
+                // This sentinel file is used for WAL locking. Trying to archive
+                // or open it will cause the following error on Windows:
+                // > The process cannot access the file because another process
+                // > has locked a portion of the file. (os error 33)
+                // https://github.com/qdrant/wal/blob/7c9202d0874/src/lib.rs#L125-L145
+                continue;
+            }


⚠️ Potential issue | 🔴 Critical

Fix: invalid comparison of OsString with string literal.

entry.file_name() returns OsString. Comparing it to ".wal" won’t compile on all platforms.

Apply this diff:

- if entry.file_name() == ".wal" { + if entry.file_name() == std::ffi::OsStr::new(".wal") {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if entry.file_name() == ".wal" {

// This sentinel file is used for WAL locking. Trying to archive

// or open it will cause the following error on Windows:

// > The process cannot access the file because another process

// > has locked a portion of the file. (os error 33)

// https://github.com/qdrant/wal/blob/7c9202d0874/src/lib.rs#L125-L145

continue;

}

if entry.file_name() == std::ffi::OsStr::new(".wal") {

// This sentinel file is used for WAL locking. Trying to archive

// or open it will cause the following error on Windows:

// > The process cannot access the file because another process

// > has locked a portion of the file. (os error 33)

// https://github.com/qdrant/wal/blob/7c9202d0874/src/lib.rs#L125-L145

continue;

}

🤖 Prompt for AI Agents

In lib/collection/src/shards/local_shard/snapshot.rs around lines 182 to 189, the code compares entry.file_name() (an OsString) directly to the string literal ".wal", which is invalid on some platforms; replace the comparison with an OsStr comparison: add use std::ffi::OsStr; and change the condition to if entry.file_name() == OsStr::new(".wal") { continue; } so the OsString is compared against an OsStr rather than a &str.

coderabbitai · 2025-09-25T14:14:08Z

lib/collection/src/shards/local_shard/snapshot.rs

+    let segments_lock = segments.upgradable_read();
+
+    // Proxy all segments
+    log::trace!("Proxying all shard segments to apply function");
+    let (mut proxies, tmp_segment, mut segments_lock) = SegmentHolder::proxy_all_segments(
+        segments_lock,
+        segments_path,
+        segment_config,
+        payload_index_schema,
+    )?;
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Deadlock risk: lock-order inversion between segments_lock and update_lock.

You acquire update_lock.blocking_write() while holding segments_lock (upgradable read). If an update path holds update_lock.read() and then tries to acquire a write on segments_lock, this creates a cycle.

Mitigations:

Do not block on update_lock while holding segments_lock. For early unproxy, prefer a non-blocking try_write() and skip if not available.

For the final unproxy, acquire update_lock.write() before reacquiring segments_lock (drop and reacquire), to enforce a consistent global order: update_lock → segments_lock.

Suggested changes:

Early unproxy (avoid blocking while holding segments_lock):

- let _update_guard = update_lock.blocking_write(); - match SegmentHolder::try_unproxy_segment( + if let Some(_update_guard) = update_lock.try_write() { + match SegmentHolder::try_unproxy_segment( segments_lock, *segment_id, proxy_segment.clone(), - ) { - Ok(lock) => { - segments_lock = lock; - unproxied_segment_ids.push(*segment_id); - } - Err(lock) => segments_lock = lock, - } + ) { + Ok(lock) => { + segments_lock = lock; + unproxied_segment_ids.push(*segment_id); + } + Err(lock) => segments_lock = lock, + } + } else { + // Skip early unproxy now; will unproxy in the final step. + }

Final unproxy (enforce lock order update_lock → segments_lock):

- let _update_guard = update_lock.blocking_write(); - SegmentHolder::unproxy_all_segments(segments_lock, proxies, tmp_segment)?; + drop(segments_lock); + let _update_guard = update_lock.blocking_write(); + let segments_lock = segments.upgradable_read(); + SegmentHolder::unproxy_all_segments(segments_lock, proxies, tmp_segment)?;

If the project guarantees that update paths never take segments_lock write while holding update_lock read during snapshotting/proxying, please document it explicitly; otherwise the above changes are necessary to avoid deadlocks.

Run this script to scan for potentially dangerous lock ordering (update lock read then segment holder write) elsewhere:

Also applies to: 329-341, 345-351

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Occurrences where update_operation_lock read is taken before segments holder write/upgradable:" rg -nP --type=rust -C3 '(?s)update_operation_lock\s*\.\s*(?:read|blocking_read|try_read)\s*\([^)]*\).*?(?:self\.)?segments(?:\(\))?\s*\.\s*(?:write|upgradable|upgrade)\s*\(' echo echo "Occurrences where segments holder write/upgradable is taken before update_operation_lock write:" rg -nP --type=rust -C3 '(?s)(?:self\.)?segments(?:\(\))?\s*\.\s*(?:write|upgradable|upgrade)\s*\(.*?\).*?update_operation_lock\s*\.\s*(?:write|blocking_write|try_write)\s*\('

Length of output: 94

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Searching for update_lock.read() then segments.* lock acquisitions:" rg -nP --type=rust '(?s)\bupdate_lock\s*\.\s*(?:read|blocking_read|try_read)\s*\([^)]*\).*?\bsegments\s*\.\s*(?:write|upgradable(?:_read)?|upgrade)\s*\(' -C3 echo echo "Searching for segments.* lock acquisitions then update_lock.write():" rg -nP --type=rust '(?s)\bsegments\s*\.\s*(?:write|upgradable(?:_read)?|upgrade)\s*\([^)]*\).*?\bupdate_lock\s*\.\s*(?:write|blocking_write|try_write)\s*\(' -C3

Length of output: 67

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Displaying lines 250–370 of snapshot.rs for context:" sed -n '250,370p' lib/collection/src/shards/local_shard/snapshot.rs

Length of output: 4565

Prevent lock-order inversion between segments_lock and update_lock. Holding segments_lock (upgradable read) while calling update_lock.blocking_write() can deadlock if another task holds update_lock.read() and then awaits segments_lock.write(). Two fixes are required:

In the early-unproxy loop, replace blocking_write() with try_write() and skip if it fails.

Before the final unproxy_all_segments, drop segments_lock, acquire update_lock.blocking_write(), then reacquire segments.upgradable_read().
Alternatively, document a global invariant that no update path ever takes segments write while holding update_lock.read().

timvisee · 2025-09-25T15:10:44Z

lib/collection/src/shards/local_shard/snapshot.rs

+    // Unproxy all segments
+    // Always do this to prevent leaving proxy segments behind
+    log::trace!("Unproxying all shard segments after function is applied");
+    let _update_guard = update_lock.blocking_write();


💡 The important change (lock) is here, and above before try_unproxy_segment

timvisee · 2025-09-25T15:11:36Z

✔️ Great idea to share the lock

…7306) * separate snapshot-related functions into a new file * move out snapshot-related function in local shard * move snapshot test to local shard level * move snapshot_all_segments and proxy_all_segments_and_apply to local shard * use former scroll update lock to prevent snapshot operations overlapping with updates * review fixes: rename all scroll_lock, use write lock for unproxy instead of read

generall added 2 commits September 25, 2025 11:42

separate snapshot-related functions into a new file

7821587

move out snapshot-related function in local shard

c7b84d4

generall force-pushed the refactor-proxy_all_segments_and_apply branch from 5826bcd to c7b84d4 Compare September 25, 2025 09:42

generall added 3 commits September 25, 2025 13:57

move snapshot test to local shard level

f7b6743

move snapshot_all_segments and proxy_all_segments_and_apply to local …

1908d19

…shard

use former scroll update lock to prevent snapshot operations overlapp…

f2ee497

…ing with updates

generall requested a review from agourlay September 25, 2025 13:24

generall marked this pull request as ready for review September 25, 2025 13:29

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

generall commented Sep 25, 2025

View reviewed changes

lib/collection/src/shards/local_shard/mod.rs Show resolved Hide resolved

generall commented Sep 25, 2025

View reviewed changes

lib/collection/src/shards/local_shard/scroll.rs Show resolved Hide resolved

review fixes: rename all scroll_lock, use write lock for unproxy inst…

7ec3c38

…ead of read

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

agourlay approved these changes Sep 25, 2025

View reviewed changes

generall merged commit 7d7f8a3 into dev Sep 25, 2025
16 checks passed

timvisee reviewed Sep 25, 2025

View reviewed changes

generall deleted the refactor-proxy_all_segments_and_apply branch September 25, 2025 15:10

timvisee mentioned this pull request Sep 29, 2025

Bump version to 1.15.5 #7323

Merged

coderabbitai bot mentioned this pull request Oct 6, 2025

non appendable proxies #7345

Merged

This was referenced Oct 14, 2025

fix segment repair on load #7400

Merged

address parallel flushing problem in snapshot creation #7404

Merged

This was referenced Nov 7, 2025

feat: add read-only mode and integration tests for it #7498

Open

Audit all spawn blocking calls, prematurely abort them #7533

Merged

This was referenced Nov 27, 2025

Make snapshot_manifest methods async #7629

Merged

refactor update workers #7642

Merged

Explicit gracefull stop for LocalShard #7640

Merged

Cleanup a few clones #7681

Merged

coderabbitai bot mentioned this pull request Jan 2, 2026

CoW defines flush topology #7850

Merged

This was referenced Jan 23, 2026

Remove obsolete collection update lock #7979

Merged

Persist applied seq at regular interval #7976

Merged

This was referenced Feb 2, 2026

Drop wal from api #8000

Merged

faster segment holder locks #8007

Merged

Revert "Drop wal from api" #8048

Closed

Don't lock SegmentHolder for the entire duration of read operations #8056

Merged

		shard = { path = "lib/shard", default-features = false }
		collection = { path = "lib/collection" }

Conversation

generall commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timvisee Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

generall commented Sep 25, 2025 •

edited

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading