Add test for broken WAL delta after stream records abort by timvisee · Pull Request #7787 · qdrant/qdrant

timvisee · 2025-12-16T10:32:14Z

Fixed by #7791

Aborting a stream records (or other) transfer may break future WAL delta transfers.

This PR adds a test to reproduce the problematic behavior.

Specifically, any new updates coming in during a stream records (or other) transfer may bump the last seen clocks. This also happens on the node that is receiving the transfer. If we abort the stream records transfer, the last seen clocks may have jumped over a huge gap. A follow up WAL delta transfer will only transfer changes since the last seen clocks, missing a huge set of changes in that jump.

In practice, this sequence is problematic:

start stream records transfer to recover replica
send new update through all peers
abort stream records transfer
start WAL delta transfer
⚠️ WAL delta resolves very short or empty diff

Imagine the initial stream records transfer only covered 10% of the points before it got aborted. The WAL delta transfer after it will then miss the remaining 90% of point changes, corrupting the target replica.

Test:

cargo build --features staging
pytest tests/consensus_tests/test_shard_wal_delta_transfer.py -k test_abort_stream_records_breaks_wal_delta

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

timvisee · 2025-12-17T10:30:00Z

Note to reviewers: @agourlay @ffuugoo @coszio

This PR contains the test, and #7791 which you've already reviewed.

* Make set_replica_state async * Add function called when active state of local replica changes * Add snapshot for newest clocks * Bump newest clocks snapshot on replica deactivation * Use newest clocks snapshot during recovery * Add enum for specifying whether to take or clear clocks snapshot * Store clock snapshot inside clock map, removing extra file This greatly simplifies state handling. It also prevent any kind of desynchronization because all newest clocks are always persisted atomically. * Immediately persist clocks after taking snapshot * Always update snapshot, only take if missing * Take clock snapshots through each shard flavor, including proxies * Propagate dedicated functions for taking and clearing clocks snapshot * Only persist clocks immediately if changed on snapshot/clear * Simplify recovery point logic, always take clocks snapshot if exists * Remove unwrap

agourlay

Impressive integration test 👏

* Add test to reproduce broken WAL delta after aborting stream records * Add staging env var to slow down stream records transfers for test * Tweak test formatting and utilities a bit * Add comment to test, link to PR describing bug * Update test so it still succeeds with patched behavior * Fix broken WAL delta after stream records abort (#7791) * Make set_replica_state async * Add function called when active state of local replica changes * Add snapshot for newest clocks * Bump newest clocks snapshot on replica deactivation * Use newest clocks snapshot during recovery * Add enum for specifying whether to take or clear clocks snapshot * Store clock snapshot inside clock map, removing extra file This greatly simplifies state handling. It also prevent any kind of desynchronization because all newest clocks are always persisted atomically. * Immediately persist clocks after taking snapshot * Always update snapshot, only take if missing * Take clock snapshots through each shard flavor, including proxies * Propagate dedicated functions for taking and clearing clocks snapshot * Only persist clocks immediately if changed on snapshot/clear * Simplify recovery point logic, always take clocks snapshot if exists * Remove unwrap * Fix typo * Fix doc comment * Transfer driver is async, use Tokio sleep * Reduce visibility

timvisee force-pushed the abort-stream-records-breaks-wal-delta branch from 275d10b to e4d6444 Compare December 16, 2025 10:34

timvisee added the release:1.16.3 label Dec 16, 2025

timvisee mentioned this pull request Dec 16, 2025

Fix broken WAL delta after stream records abort #7791

Merged

9 tasks

generall approved these changes Dec 17, 2025

View reviewed changes

timvisee marked this pull request as ready for review December 17, 2025 10:29

timvisee requested review from agourlay, coszio and ffuugoo December 17, 2025 10:29

timvisee and others added 7 commits December 17, 2025 11:31

Add test to reproduce broken WAL delta after aborting stream records

4438897

Add staging env var to slow down stream records transfers for test

332a2a2

Tweak test formatting and utilities a bit

37e53b1

Add comment to test, link to PR describing bug

ce26728

Update test so it still succeeds with patched behavior

6587232

Fix typo

3790ef3

timvisee force-pushed the abort-stream-records-breaks-wal-delta branch from f225e02 to 3790ef3 Compare December 17, 2025 10:31

This comment was marked as resolved.

Sign in to view

timvisee added 3 commits December 17, 2025 11:40

Fix doc comment

d721656

Transfer driver is async, use Tokio sleep

18b6e70

Reduce visibility

3938bed

qdrant deleted a comment from coderabbitai bot Dec 17, 2025

agourlay approved these changes Dec 17, 2025

View reviewed changes

timvisee merged commit 0196323 into dev Dec 17, 2025
15 checks passed

timvisee deleted the abort-stream-records-breaks-wal-delta branch December 17, 2025 11:33

timvisee mentioned this pull request Dec 19, 2025

Bump version to 1.16.3 #7806

Merged

claude bot mentioned this pull request Jan 9, 2026

chore(deps): update docker.io/qdrant/qdrant docker tag to v1.16.3 cbcoutinho/nextcloud-mcp-server#464

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test for broken WAL delta after stream records abort#7787

Add test for broken WAL delta after stream records abort#7787
timvisee merged 10 commits intodevfrom
abort-stream-records-breaks-wal-delta

timvisee commented Dec 16, 2025 •

edited

Loading

Uh oh!

timvisee commented Dec 17, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

agourlay left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timvisee commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

timvisee commented Dec 17, 2025

Uh oh!

This comment was marked as resolved.

Uh oh!

agourlay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timvisee commented Dec 16, 2025 •

edited

Loading