Skip to content

Test continuous snapshot data loss#7308

Merged
agourlay merged 2 commits intodevfrom
continuous-snapshot-data-loss-bis
Oct 8, 2025
Merged

Test continuous snapshot data loss#7308
agourlay merged 2 commits intodevfrom
continuous-snapshot-data-loss-bis

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Sep 25, 2025

This test used to be broken with:

RUST_LOG=debug cargo nextest run -p collection test_continuous_snapshot --nocapture

...
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment af3cee4d-a4a5-4474-a344-749a12aca0ba
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment c496b4a2-e09e-42c9-97ba-f15e5f362061
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment 5387da80-1800-4d41-ae7e-d81437068c1e
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment c496b4a2-e09e-42c9-97ba-f15e5f362061
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Segment c496b4a2-e09e-42c9-97ba-f15e5f362061 is already snapshotted!
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment 7497eab1-38e7-42bf-86ab-db21a3fe4e12
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment c496b4a2-e09e-42c9-97ba-f15e5f362061
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Segment c496b4a2-e09e-42c9-97ba-f15e5f362061 is already snapshotted!
[2025-09-25T09:37:08Z DEBUG collection::shards::replica_set::execute_read_operation] Read operation failed: Service internal error: No version for point 0
[2025-09-25T09:37:08Z DEBUG collection::update_handler] Optimizer 'merge' running on segments: [1692, 1693, 1694]
[2025-09-25T09:37:08Z DEBUG collection::collection_manager::optimizers::segment_optimizer] Available space: 27.23 GiB, needed for optimization: 1.33 MiB
[2025-09-25T09:37:08Z DEBUG collection::common::sha_256] Computing checksum for file: "/tmp/temp_dirEg6kVn/test-0-2025-09-25-09-37-08.snapshot-arc-J3NO0Y"
[2025-09-25T09:37:08Z INFO  collection::collection::snapshots] Creating collection snapshot test-0-2025-09-25-09-37-08.snapshot into "/tmp/test_snapshotsi2ngdD/test-0-2025-09-25-09-37-08.snapshot"
[2025-09-25T09:37:08Z DEBUG shard::proxy_segment] Double proxy segment creation
[2025-09-25T09:37:08Z DEBUG shard::proxy_segment] Double proxy segment creation
[2025-09-25T09:37:08Z DEBUG shard::proxy_segment] Double proxy segment creation
[2025-09-25T09:37:08Z INFO  shard::proxy_segment::snapshot_entry] Taking a snapshot of a proxy segment
[2025-09-25T09:37:08Z DEBUG segment::segment::snapshot] Taking snapshot of segment 9a4bc3fd-87a9-41f1-8137-fd43d5a4b84a

thread 'continuous_snapshot_test::test_continuous_snapshot' panicked at lib/collection/tests/integration/continuous_snapshot_test.rs:244:31:
points_task error: Service internal error: 1 of 1 read operations failed:
  Service internal error: No version for point 0

It is now fixed after reworking how proxy segments work in #7345

The PR can now be merged to act as non regression test.

@agourlay agourlay force-pushed the continuous-snapshot-data-loss-bis branch from a36ea32 to c77273d Compare September 25, 2025 09:49
@agourlay
Copy link
Member Author

agourlay commented Sep 25, 2025

The root cause is again that the proxy's write segments are shared and modified without respecting the locks hierarchy,

let segments = self.non_appendable_then_appendable_segments();
let mut read_points = 0;
for segment in segments {
let segment_arc = segment.get();
let read_segment = segment_arc.read();
let points = ids
.iter()
.cloned()
.check_stop(|| is_stopped.load(Ordering::Relaxed))
.filter(|id| read_segment.has_point(*id));
for point in points {
let is_ok = f(point, &read_segment)?;
read_points += usize::from(is_ok);
}
}

The delete operation lands exactly between the filter(|id| read_segment.has_point(*id)); and the application of the read operation.

The point is supposed to be present because it passed the filter a nano second previously.

Grabbing the update_lock for each segment iteration fixes the issue.

@agourlay agourlay force-pushed the continuous-snapshot-data-loss-bis branch from c77273d to 936484d Compare October 8, 2025 08:21
@agourlay agourlay marked this pull request as ready for review October 8, 2025 08:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3ca182 and 936484d.

📒 Files selected for processing (1)
  • lib/collection/tests/integration/continuous_snapshot_test.rs (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

  • lib/collection/tests/integration/continuous_snapshot_test.rs
**/{tests,benches}/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

Using .unwrap() and panic!() in tests and benchmarks is acceptable and should not be flagged

Files:

  • lib/collection/tests/integration/continuous_snapshot_test.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: e2e-tests
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: test-consensus-compose
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: integration-tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests-consensus
🔇 Additional comments (1)
lib/collection/tests/integration/continuous_snapshot_test.rs (1)

128-128: LGTM!

The comment clarifies the behavior of the following code block.

Comment on lines +195 to +214

// Retrieve one point at a time again with payload
for i in 0..points_count {
let retrieve_point = PointRequestInternal {
ids: vec![i.into()],
with_payload: Some(true.into()),
with_vector: WithVector::Bool(false),
};
let hw_counter = HwMeasurementAcc::disposable();
let retrieve_result = collection
.retrieve(
retrieve_point,
None,
&ShardSelectorInternal::All,
None,
hw_counter,
)
.await?;
assert_eq!(retrieve_result.len(), 1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use explicit type conversion instead of .into().

Line 200 uses true.into() which violates the coding guideline: "Prefer explicit SomeType::from(x) over implicit x.into() in Rust code". Use an explicit conversion to improve code clarity.

As per coding guidelines.

Apply this diff to use explicit conversion:

                 let retrieve_point = PointRequestInternal {
                     ids: vec![i.into()],
-                    with_payload: Some(true.into()),
+                    with_payload: Some(WithPayloadInterface::from(true)),
                     with_vector: WithVector::Bool(false),
                 };

Note: The new retrieval loop appropriately exercises the code path with payload requests, which is essential for this regression test. The lack of explicit payload verification is acceptable since the race condition manifests as retrieval errors rather than incorrect data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Retrieve one point at a time again with payload
for i in 0..points_count {
let retrieve_point = PointRequestInternal {
ids: vec![i.into()],
with_payload: Some(true.into()),
with_vector: WithVector::Bool(false),
};
let hw_counter = HwMeasurementAcc::disposable();
let retrieve_result = collection
.retrieve(
retrieve_point,
None,
&ShardSelectorInternal::All,
None,
hw_counter,
)
.await?;
assert_eq!(retrieve_result.len(), 1);
}
// Retrieve one point at a time again with payload
for i in 0..points_count {
let retrieve_point = PointRequestInternal {
ids: vec![i.into()],
with_payload: Some(WithPayloadInterface::from(true)),
with_vector: WithVector::Bool(false),
};
let hw_counter = HwMeasurementAcc::disposable();
let retrieve_result = collection
.retrieve(
retrieve_point,
None,
&ShardSelectorInternal::All,
None,
hw_counter,
)
.await?;
assert_eq!(retrieve_result.len(), 1);
}
🤖 Prompt for AI Agents
In lib/collection/tests/integration/continuous_snapshot_test.rs around lines 195
to 214, replace the implicit conversion Some(true.into()) with an explicit
conversion using the concrete payload type's from method (e.g.,
Some(WithPayload::from(true))) so the code follows the guideline "Prefer
explicit SomeType::from(x) over implicit x.into()"; update the loop to call the
concrete_type::from(true) for with_payload and keep the rest of the retrieval
logic unchanged.

@agourlay agourlay changed the title Continuous snapshot data loss Test continuous snapshot data loss Oct 8, 2025
@agourlay agourlay requested a review from timvisee October 8, 2025 08:53
@qdrant qdrant deleted a comment from coderabbitai bot Oct 8, 2025
let retrieve_point = PointRequestInternal {
ids: vec![i.into()],
with_payload: Some(true.into()),
with_vector: WithVector::Bool(false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we retrieve it including vectors and also assert we got both a vector and payload?

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

📝 Walkthrough

Walkthrough

The test lib/collection/tests/integration/continuous_snapshot_test.rs now performs a retrieval after each point insertion. The retrieval requests both payload and vector data (with_payload=true, with_vector enabled) and asserts that exactly one result is returned and that both the vector and payload are present. No public or exported declarations were changed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • timvisee
  • generall

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title “Test continuous snapshot data loss” succinctly captures the primary change of this pull request, which is the addition of a non-regression test to verify that continuous snapshot operations do not lose data. It directly reflects the main intent of the changeset without extraneous detail or ambiguity.
Description Check ✅ Passed The pull request description clearly explains the historical test failure, includes relevant logs demonstrating the issue, references the upstream fix in PR #7345, and states that this PR adds a non-regression test for the resolved race condition. The content is directly related to the changeset and provides sufficient context for reviewers.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch continuous-snapshot-data-loss-bis

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3ca182 and 2d47867.

📒 Files selected for processing (1)
  • lib/collection/tests/integration/continuous_snapshot_test.rs (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust code
Do not use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice in new code; use bytemuck or zerocopy instead

Files:

  • lib/collection/tests/integration/continuous_snapshot_test.rs
**/{tests,benches}/**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

Using .unwrap() and panic!() in tests and benchmarks is acceptable and should not be flagged

Files:

  • lib/collection/tests/integration/continuous_snapshot_test.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: e2e-tests
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consensus-compose
  • GitHub Check: integration-tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests-no-rocksdb (ubuntu-latest)
  • GitHub Check: lint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@agourlay agourlay merged commit 1fcb7a2 into dev Oct 8, 2025
15 checks passed
@agourlay agourlay deleted the continuous-snapshot-data-loss-bis branch October 8, 2025 10:20
timvisee pushed a commit that referenced this pull request Nov 14, 2025
* Continuous snapshot data loss

* assert vector & payload are present
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants