Fix broken WAL delta after stream records abort by timvisee · Pull Request #7791 · qdrant/qdrant

timvisee · 2025-12-16T15:34:59Z

I suggest to review and merge this into #7787, so that we can merge the fix and test as a whole into dev. See the mentioned PR for more details on the actual bug.

This PR fixes the problem by taking a 'snapshot' of last seen clocks when the replica goes in any non-active state. When doing WAL delta recovery on the replica, we derive the recovery point from the snapshot rather than the actual latest clocks.

When the replica becomes active again we're sure it's in good state. Then the clocks snapshot is cleared.

Relevant test:

pytest tests/consensus_tests/test_shard_wal_delta_transfer.py -k test_abort_stream_records_breaks_wal_delta

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

timvisee · 2025-12-16T15:36:04Z

lib/collection/src/shards/local_shard/clock_map.rs

 #[serde(from = "ClockMapHelper", into = "ClockMapHelper")]
 pub struct ClockMap {
    clocks: HashMap<Key, Clock>,
-    /// Whether this clock map has changed since the last time it was persisted.
+    /// Optional snapshot with earlier version of clocks
+    snapshot: Option<HashMap<Key, Clock>>,
+    /// Whether this clock map has changed since the last time it was persisted
    changed: bool,
 }


Here I persist the optional snapshot together with regular clocks.

It took me a few iterations to finally land here. In my opinion this is the best approach because:

we prevent adding new files

we prevent conditional flushers

we now flush actual and snapshot clocks atomically

This greatly simplifies state handling. It also prevent any kind of desynchronization because all newest clocks are always persisted atomically.

lib/collection/src/shards/local_shard/clock_map.rs

lib/collection/src/shards/replica_set/mod.rs

* Make set_replica_state async * Add function called when active state of local replica changes * Add snapshot for newest clocks * Bump newest clocks snapshot on replica deactivation * Use newest clocks snapshot during recovery * Add enum for specifying whether to take or clear clocks snapshot * Store clock snapshot inside clock map, removing extra file This greatly simplifies state handling. It also prevent any kind of desynchronization because all newest clocks are always persisted atomically. * Immediately persist clocks after taking snapshot * Always update snapshot, only take if missing * Take clock snapshots through each shard flavor, including proxies * Propagate dedicated functions for taking and clearing clocks snapshot * Only persist clocks immediately if changed on snapshot/clear * Simplify recovery point logic, always take clocks snapshot if exists * Remove unwrap

* Add test to reproduce broken WAL delta after aborting stream records * Add staging env var to slow down stream records transfers for test * Tweak test formatting and utilities a bit * Add comment to test, link to PR describing bug * Update test so it still succeeds with patched behavior * Fix broken WAL delta after stream records abort (#7791) * Make set_replica_state async * Add function called when active state of local replica changes * Add snapshot for newest clocks * Bump newest clocks snapshot on replica deactivation * Use newest clocks snapshot during recovery * Add enum for specifying whether to take or clear clocks snapshot * Store clock snapshot inside clock map, removing extra file This greatly simplifies state handling. It also prevent any kind of desynchronization because all newest clocks are always persisted atomically. * Immediately persist clocks after taking snapshot * Always update snapshot, only take if missing * Take clock snapshots through each shard flavor, including proxies * Propagate dedicated functions for taking and clearing clocks snapshot * Only persist clocks immediately if changed on snapshot/clear * Simplify recovery point logic, always take clocks snapshot if exists * Remove unwrap * Fix typo * Fix doc comment * Transfer driver is async, use Tokio sleep * Reduce visibility

timvisee added 6 commits December 16, 2025 14:31

Make set_replica_state async

424aecb

Add function called when active state of local replica changes

b7813b6

Add snapshot for newest clocks

53de40d

Bump newest clocks snapshot on replica deactivation

cc7db8f

Use newest clocks snapshot during recovery

78e5db6

Add enum for specifying whether to take or clear clocks snapshot

c8b0fdd

timvisee commented Dec 16, 2025

View reviewed changes

timvisee changed the title ~~Fix abort stream records breaks wal delta~~ Fix broken WAL delta after stream records abort Dec 16, 2025

timvisee mentioned this pull request Dec 16, 2025

Add test for broken WAL delta after stream records abort #7787

Merged

9 tasks

timvisee added 3 commits December 16, 2025 16:51

Store clock snapshot inside clock map, removing extra file

4ab240b

This greatly simplifies state handling. It also prevent any kind of desynchronization because all newest clocks are always persisted atomically.

Immediately persist clocks after taking snapshot

9527c98

Always update snapshot, only take if missing

b67d7dc

timvisee force-pushed the fix-abort-stream-records-breaks-wal-delta branch from 97ac0e1 to b67d7dc Compare December 16, 2025 15:52

timvisee marked this pull request as ready for review December 16, 2025 15:54

timvisee marked this pull request as draft December 16, 2025 15:54

timvisee marked this pull request as ready for review December 16, 2025 15:54

timvisee requested review from agourlay, coszio, ffuugoo and generall December 16, 2025 15:54

timvisee added the release:1.16.3 label Dec 16, 2025

ffuugoo approved these changes Dec 16, 2025

View reviewed changes

lib/collection/src/shards/local_shard/clock_map.rs Outdated Show resolved Hide resolved

lib/collection/src/shards/local_shard/clock_map.rs Outdated Show resolved Hide resolved

timvisee added 4 commits December 17, 2025 09:28

Take clock snapshots through each shard flavor, including proxies

22a0576

Propagate dedicated functions for taking and clearing clocks snapshot

0c3510c

Only persist clocks immediately if changed on snapshot/clear

c87a284

Simplify recovery point logic, always take clocks snapshot if exists

854fe28

agourlay reviewed Dec 17, 2025

View reviewed changes

lib/collection/src/shards/replica_set/mod.rs Outdated Show resolved Hide resolved

Remove unwrap

dddd874

generall approved these changes Dec 17, 2025

View reviewed changes

timvisee merged commit f225e02 into abort-stream-records-breaks-wal-delta Dec 17, 2025
14 checks passed

timvisee deleted the fix-abort-stream-records-breaks-wal-delta branch December 17, 2025 10:29

timvisee mentioned this pull request Dec 19, 2025

Bump version to 1.16.3 #7806

Merged

claude bot mentioned this pull request Jan 9, 2026

chore(deps): update docker.io/qdrant/qdrant docker tag to v1.16.3 cbcoutinho/nextcloud-mcp-server#464

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken WAL delta after stream records abort#7791

Fix broken WAL delta after stream records abort#7791
timvisee merged 14 commits intoabort-stream-records-breaks-wal-deltafrom
fix-abort-stream-records-breaks-wal-delta

timvisee commented Dec 16, 2025 •

edited

Loading

Uh oh!

timvisee Dec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

timvisee commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

timvisee Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

timvisee commented Dec 16, 2025 •

edited

Loading