Test continuous snapshot data loss by agourlay · Pull Request #7308 · qdrant/qdrant

agourlay · 2025-09-25T09:45:14Z

This test used to be broken with:

RUST_LOG=debug cargo nextest run -p collection test_continuous_snapshot --nocapture

...
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment af3cee4d-a4a5-4474-a344-749a12aca0ba
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment c496b4a2-e09e-42c9-97ba-f15e5f362061
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment 5387da80-1800-4d41-ae7e-d81437068c1e
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment c496b4a2-e09e-42c9-97ba-f15e5f362061
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Segment c496b4a2-e09e-42c9-97ba-f15e5f362061 is already snapshotted!
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment 7497eab1-38e7-42bf-86ab-db21a3fe4e12
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment c496b4a2-e09e-42c9-97ba-f15e5f362061
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Segment c496b4a2-e09e-42c9-97ba-f15e5f362061 is already snapshotted!
[2025-09-25T09:37:08Z DEBUG collection::shards::replica_set::execute_read_operation] Read operation failed: Service internal error: No version for point 0
[2025-09-25T09:37:08Z DEBUG collection::update_handler] Optimizer 'merge' running on segments: [1692, 1693, 1694]
[2025-09-25T09:37:08Z DEBUG collection::collection_manager::optimizers::segment_optimizer] Available space: 27.23 GiB, needed for optimization: 1.33 MiB
[2025-09-25T09:37:08Z DEBUG collection::common::sha_256] Computing checksum for file: "/tmp/temp_dirEg6kVn/test-0-2025-09-25-09-37-08.snapshot-arc-J3NO0Y"
[2025-09-25T09:37:08Z INFO  collection::collection::snapshots] Creating collection snapshot test-0-2025-09-25-09-37-08.snapshot into "/tmp/test_snapshotsi2ngdD/test-0-2025-09-25-09-37-08.snapshot"
[2025-09-25T09:37:08Z DEBUG shard::proxy_segment] Double proxy segment creation
[2025-09-25T09:37:08Z DEBUG shard::proxy_segment] Double proxy segment creation
[2025-09-25T09:37:08Z DEBUG shard::proxy_segment] Double proxy segment creation
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment 9a4bc3fd-87a9-41f1-8137-fd43d5a4b84a

thread 'continuous_snapshot_test::test_continuous_snapshot' panicked at lib/collection/tests/integration/continuous_snapshot_test.rs:244:31:
points_task error: Service internal error: 1 of 1 read operations failed:
  Service internal error: No version for point 0

It is now fixed after reworking how proxy segments work in #7345

The PR can now be merged to act as non regression test.

agourlay · 2025-09-25T13:22:02Z

The root cause is again that the proxy's write segments are shared and modified without respecting the locks hierarchy,

qdrant/lib/shard/src/segment_holder/mod.rs

Lines 749 to 764 in 20db14f

    
           let segments = self.non_appendable_then_appendable_segments(); 
        
           let mut read_points = 0; 
        
           for segment in segments { 
        
               let segment_arc = segment.get(); 
        
               let read_segment = segment_arc.read(); 
        
               let points = ids 
        
                   .iter() 
        
                   .cloned() 
        
                   .check_stop(|| is_stopped.load(Ordering::Relaxed)) 
        
                   .filter(|id| read_segment.has_point(*id)); 
        
               for point in points { 
        
                   let is_ok = f(point, &read_segment)?; 
        
                   read_points += usize::from(is_ok); 
        
               } 
        
           }

The delete operation lands exactly between the filter(|id| read_segment.has_point(*id)); and the application of the read operation.

The point is supposed to be present because it passed the filter a nano second previously.

Grabbing the update_lock for each segment iteration fixes the issue.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3ca182 and 936484d.

📒 Files selected for processing (1)

lib/collection/tests/integration/continuous_snapshot_test.rs (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

lib/collection/tests/integration/continuous_snapshot_test.rs

**/{tests,benches}/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

Using .unwrap() and panic!() in tests and benchmarks is acceptable and should not be flagged

Files:

lib/collection/tests/integration/continuous_snapshot_test.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: e2e-tests
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: test-consensus-compose
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: integration-tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: integration-tests-consensus

🔇 Additional comments (1)

lib/collection/tests/integration/continuous_snapshot_test.rs (1)

128-128: LGTM!

The comment clarifies the behavior of the following code block.

coderabbitai · 2025-10-08T08:30:17Z

lib/collection/tests/integration/continuous_snapshot_test.rs

+
+                // Retrieve one point at a time again with payload
+                for i in 0..points_count {
+                    let retrieve_point = PointRequestInternal {
+                        ids: vec![i.into()],
+                        with_payload: Some(true.into()),
+                        with_vector: WithVector::Bool(false),
+                    };
+                    let hw_counter = HwMeasurementAcc::disposable();
+                    let retrieve_result = collection
+                        .retrieve(
+                            retrieve_point,
+                            None,
+                            &ShardSelectorInternal::All,
+                            None,
+                            hw_counter,
+                        )
+                        .await?;
+                    assert_eq!(retrieve_result.len(), 1);
+                }


🛠️ Refactor suggestion | 🟠 Major

Use explicit type conversion instead of .into().

Line 200 uses true.into() which violates the coding guideline: "Prefer explicit SomeType::from(x) over implicit x.into() in Rust code". Use an explicit conversion to improve code clarity.

As per coding guidelines.

Apply this diff to use explicit conversion:

let retrieve_point = PointRequestInternal { ids: vec![i.into()], - with_payload: Some(true.into()), + with_payload: Some(WithPayloadInterface::from(true)), with_vector: WithVector::Bool(false), };

Note: The new retrieval loop appropriately exercises the code path with payload requests, which is essential for this regression test. The lack of explicit payload verification is acceptable since the race condition manifests as retrieval errors rather than incorrect data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Retrieve one point at a time again with payload

for i in 0..points_count {

let retrieve_point = PointRequestInternal {

ids: vec![i.into()],

with_payload: Some(true.into()),

with_vector: WithVector::Bool(false),

};

let hw_counter = HwMeasurementAcc::disposable();

let retrieve_result = collection

.retrieve(

retrieve_point,

None,

&ShardSelectorInternal::All,

None,

hw_counter,

)

.await?;

assert_eq!(retrieve_result.len(), 1);

}

// Retrieve one point at a time again with payload

for i in 0..points_count {

let retrieve_point = PointRequestInternal {

ids: vec![i.into()],

with_payload: Some(WithPayloadInterface::from(true)),

with_vector: WithVector::Bool(false),

};

let hw_counter = HwMeasurementAcc::disposable();

let retrieve_result = collection

.retrieve(

retrieve_point,

None,

&ShardSelectorInternal::All,

None,

hw_counter,

)

.await?;

assert_eq!(retrieve_result.len(), 1);

}

🤖 Prompt for AI Agents

In lib/collection/tests/integration/continuous_snapshot_test.rs around lines 195 to 214, replace the implicit conversion Some(true.into()) with an explicit conversion using the concrete payload type's from method (e.g., Some(WithPayload::from(true))) so the code follows the guideline "Prefer explicit SomeType::from(x) over implicit x.into()"; update the loop to call the concrete_type::from(true) for with_payload and keep the rest of the retrieval logic unchanged.

timvisee · 2025-10-08T09:22:26Z

lib/collection/tests/integration/continuous_snapshot_test.rs

+                    let retrieve_point = PointRequestInternal {
+                        ids: vec![i.into()],
+                        with_payload: Some(true.into()),
+                        with_vector: WithVector::Bool(false),


Nit: can we retrieve it including vectors and also assert we got both a vector and payload?

coderabbitai · 2025-10-08T09:48:53Z

📝 Walkthrough

Walkthrough

The test lib/collection/tests/integration/continuous_snapshot_test.rs now performs a retrieval after each point insertion. The retrieval requests both payload and vector data (with_payload=true, with_vector enabled) and asserts that exactly one result is returned and that both the vector and payload are present. No public or exported declarations were changed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Fix non visible points during snapshots #7248 — Also updates lib/collection/tests/integration/continuous_snapshot_test.rs to add per-point retrievals and payload-related checks, touching the same test logic.

Suggested reviewers

timvisee
generall

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “Test continuous snapshot data loss” succinctly captures the primary change of this pull request, which is the addition of a non-regression test to verify that continuous snapshot operations do not lose data. It directly reflects the main intent of the changeset without extraneous detail or ambiguity.
Description Check	✅ Passed	The pull request description clearly explains the historical test failure, includes relevant logs demonstrating the issue, references the upstream fix in PR #7345, and states that this PR adds a non-regression test for the resolved race condition. The content is directly related to the changeset and provides sufficient context for reviewers.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch continuous-snapshot-data-loss-bis

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3ca182 and 2d47867.

📒 Files selected for processing (1)

lib/collection/tests/integration/continuous_snapshot_test.rs (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

lib/collection/tests/integration/continuous_snapshot_test.rs

**/{tests,benches}/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

Using .unwrap() and panic!() in tests and benchmarks is acceptable and should not be flagged

Files:

lib/collection/tests/integration/continuous_snapshot_test.rs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: e2e-tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: integration-tests-consensus
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: test-consistency
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
GitHub Check: lint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

* Continuous snapshot data loss * assert vector & payload are present

agourlay mentioned this pull request Sep 25, 2025

[WIP] Continuous snapshot data loss #7295

Closed

agourlay force-pushed the continuous-snapshot-data-loss-bis branch from a36ea32 to c77273d Compare September 25, 2025 09:49

agourlay mentioned this pull request Sep 25, 2025

[WIP] Mitigate concurrent deletion propagation #7314

Closed

agourlay mentioned this pull request Oct 6, 2025

non appendable proxies #7345

Merged

Continuous snapshot data loss

936484d

agourlay force-pushed the continuous-snapshot-data-loss-bis branch from c77273d to 936484d Compare October 8, 2025 08:21

agourlay marked this pull request as ready for review October 8, 2025 08:26

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

agourlay changed the title ~~Continuous snapshot data loss~~ Test continuous snapshot data loss Oct 8, 2025

agourlay requested a review from timvisee October 8, 2025 08:53

qdrant deleted a comment from coderabbitai bot Oct 8, 2025

timvisee approved these changes Oct 8, 2025

View reviewed changes

assert vector & payload are present

2d47867

agourlay merged commit 1fcb7a2 into dev Oct 8, 2025
15 checks passed

agourlay deleted the continuous-snapshot-data-loss-bis branch October 8, 2025 10:20

timvisee pushed a commit that referenced this pull request Nov 14, 2025

Test continuous snapshot data loss (#7308)

c87057a

* Continuous snapshot data loss * assert vector & payload are present

coderabbitai bot mentioned this pull request Mar 4, 2026

Adjust points selection for deferred points update #8279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test continuous snapshot data loss#7308

Test continuous snapshot data loss#7308
agourlay merged 2 commits intodevfrom
continuous-snapshot-data-loss-bis

agourlay commented Sep 25, 2025 •

edited

Loading

Uh oh!

agourlay commented Sep 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 8, 2025

Uh oh!

timvisee Oct 8, 2025

Uh oh!

coderabbitai bot commented Oct 8, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agourlay commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agourlay commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Oct 8, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agourlay commented Sep 25, 2025 •

edited

Loading

agourlay commented Sep 25, 2025 •

edited

Loading