WAL replay honors applied_seq by agourlay · Pull Request #8008 · qdrant/qdrant

agourlay · 2026-01-28T14:43:59Z

INFO storage::content_manager::toc: Loading collection: benchmark
DEBUG collection::optimizers_builder: Removing temp_segments directory: "./storage/collections/benchmark/0/temp_segments"
DEBUG collection::shards::local_shard: Recovering shard ./storage/collections/benchmark/0 starting reading WAL from 267948 up to 274626 (last_applied_seq:Some(274561))
Recovering collection benchmark [00:00:00] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 6678/6678 (eta:0s)
INFO collection::shards::local_shard: Loading remaining 774673 WAL entries from:274626 into update queue

How to test

Run Qdrant with QDRANT__STORAGE__UPDATE_QUEUE_SIZE=1000000

PUT collections/benchmark
{
  "vectors": {
    "size": 256,
    "distance": "Dot"
  },
  "optimizers_config": {
    "prevent_unoptimized": true
  }
}

./bfb -n 20M -d 256 -b 1 --skip-create

Wait a bit so it results in:

a large amount of pending updates in the WAL
a large amount of pending updates in the update channel
no timeouts or slowdown should be observed by bfb

Just restart to trigger the WAL replay which should be very fast by leveraging the last applied_seq.

IvanPleshkov · 2026-02-03T10:59:08Z

Added a test for this feature.
The test performs uploads into the shard, drops it, reload and checks that the update queue is not empty.
There is a trick inside. Test changes persistent seq num manually if it's too big (which reproduces each time in test).

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@lib/collection/src/shards/local_shard/mod.rs`:
- Around line 684-693: The code treats last_wal_index as an exclusive upper
bound (computed via from + wal.len(false)) but later uses an inclusive range,
producing an off-by-one; update the logic so the upper bound is handled
consistently: either make last_wal_index the inclusive last index (compute as
from + wal.len(false) - 1) or keep it exclusive and change any inclusive ranges
to exclusive (e.g., use to..last_wal_index instead of to..=last_wal_index);
adjust/remove any matching off-by-one arithmetic around wal_entries_to_replay,
from, last_wal_index, and the enqueue loop (also in the similar block around the
808-827 region) so the logged count matches the actual enqueued entries.

In `@lib/collection/src/update_workers/applied_seq.rs`:
- Around line 54-63: The op_num_upper_bound method can overflow when adding
APPLIED_SEQ_SAVE_INTERVAL + 1 to self.op_num (u64); update the calculation in
op_num_upper_bound to use saturating_add on the loaded op_num (e.g., call
self.op_num.load(...).saturating_add(...)) so the adjusted value never wraps,
while keeping the existing file.is_some() check and returning Some(adjusted) or
None as before.

🧹 Nitpick comments (1)

lib/collection/src/tests/wal_recovery_test.rs (1)
384-395: Consider a longer queue-drain timeout to reduce flakes.

A 2s limit can be tight on slower CI; consider increasing it (e.g., 10s) or wrapping the loop in tokio::time::timeout for clearer failure behavior.
Proposed tweak
-    let timeout = std::time::Duration::from_secs(2);
+    let timeout = std::time::Duration::from_secs(10);

coderabbitai · 2026-02-03T12:25:14Z

lib/collection/src/shards/local_shard/mod.rs

+
+        let from = wal.first_index();
+        let last_wal_index = from + wal.len(false);
+        let to = self
+            .applied_seq_handler
+            .op_num_upper_bound()
+            .unwrap_or(last_wal_index);
+        let to = std::cmp::min(to, last_wal_index);
+        let wal_entries_to_replay = to - from;
+


⚠️ Potential issue | 🟠 Major

Fix off-by-one when enqueueing remaining WAL entries.

last_wal_index = from + wal.len(false) is the exclusive upper bound. Using to..=last_wal_index enqueues one extra op_num and makes the logged count inconsistent with the actual number of entries.

Proposed fix (use exclusive end)

- let last_wal_index = from + wal.len(false); + let wal_end_exclusive = from + wal.len(false); let to = self .applied_seq_handler .op_num_upper_bound() - .unwrap_or(last_wal_index); - let to = std::cmp::min(to, last_wal_index); + .unwrap_or(wal_end_exclusive); + let to = std::cmp::min(to, wal_end_exclusive); let wal_entries_to_replay = to - from; @@ - if to < last_wal_index { + if to < wal_end_exclusive { log::info!( "Loading remaining {} WAL entries from:{to} into update queue", - last_wal_index - to + wal_end_exclusive - to ); let update_sender = self.update_sender.load(); // TODO use proper collection's hardware measurement let hw_measurements = HwMeasurementAcc::disposable(); - for op_num in to..=last_wal_index { + for op_num in to..wal_end_exclusive { update_sender .send(UpdateSignal::Operation(OperationData { op_num, operation: None, sender: None, hw_measurements: hw_measurements.clone(), })) .await?; } }

Also applies to: 808-827

🤖 Prompt for AI Agents

In `@lib/collection/src/shards/local_shard/mod.rs` around lines 684 - 693, The code treats last_wal_index as an exclusive upper bound (computed via from + wal.len(false)) but later uses an inclusive range, producing an off-by-one; update the logic so the upper bound is handled consistently: either make last_wal_index the inclusive last index (compute as from + wal.len(false) - 1) or keep it exclusive and change any inclusive ranges to exclusive (e.g., use to..last_wal_index instead of to..=last_wal_index); adjust/remove any matching off-by-one arithmetic around wal_entries_to_replay, from, last_wal_index, and the enqueue loop (also in the similar block around the 808-827 region) so the logged count matches the actual enqueued entries.

coderabbitai · 2026-02-03T12:25:14Z

lib/collection/src/update_workers/applied_seq.rs

+    /// Get the op_num upper bound for the last_applied_seq adjusted to the persistence interval
+    ///
+    /// Returns None if the handler is not active.
+    pub fn op_num_upper_bound(&self) -> Option<u64> {
+        if self.file.is_some() {
+            let adjusted = self.op_num.load(Ordering::Relaxed) + APPLIED_SEQ_SAVE_INTERVAL + 1;
+            Some(adjusted)
+        } else {
+            None
+        }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the file cat -n lib/collection/src/update_workers/applied_seq.rs | head -100

Repository: qdrant/qdrant

Length of output: 4049

🏁 Script executed:

# Let me check how op_num_upper_bound is used and if there are similar patterns rg -A 3 -B 3 "op_num_upper_bound" lib/collection/src/

Repository: qdrant/qdrant

Length of output: 1363

🏁 Script executed:

# Also check if there are other arithmetic operations on atomic u64s that might overflow rg "\.load\(.*Ordering" lib/collection/src/update_workers/ | head -20

Repository: qdrant/qdrant

Length of output: 680

🏁 Script executed:

# Check the context around APPLIED_SEQ_SAVE_INTERVAL usage rg -B 5 -A 5 "APPLIED_SEQ_SAVE_INTERVAL" lib/collection/src/

Repository: qdrant/qdrant

Length of output: 9083

🏁 Script executed:

# Let me verify the method is new and check for any tests that might exercise edge cases git diff HEAD~1 lib/collection/src/update_workers/applied_seq.rs | head -80

Repository: qdrant/qdrant

Length of output: 243

Use saturating_add to prevent u64 overflow in op_num_upper_bound.

Adding 65 to an arbitrary op_num value can wrap, which would violate the upper bound contract (result should be ≥ current value). Use saturating addition to preserve correctness.

Proposed fix

- let adjusted = self.op_num.load(Ordering::Relaxed) + APPLIED_SEQ_SAVE_INTERVAL + 1; + let adjusted = self + .op_num + .load(Ordering::Relaxed) + .saturating_add(APPLIED_SEQ_SAVE_INTERVAL + 1);

🤖 Prompt for AI Agents

In `@lib/collection/src/update_workers/applied_seq.rs` around lines 54 - 63, The op_num_upper_bound method can overflow when adding APPLIED_SEQ_SAVE_INTERVAL + 1 to self.op_num (u64); update the calculation in op_num_upper_bound to use saturating_add on the loaded op_num (e.g., call self.op_num.load(...).saturating_add(...)) so the adjusted value never wraps, while keeping the existing file.is_some() check and returning Some(adjusted) or None as before.

fixes

coderabbitai · 2026-02-03T17:51:46Z

📝 Walkthrough

Walkthrough

The pull request refactors WAL recovery in local shard to use dynamic range-based replay instead of fixed whole-WAL reading. It computes a replay window from the first WAL index to a bound derived from applied sequence state, reads only that range via a new read_range() method on SerdeWal, and queues any remaining WAL entries beyond the window as UpdateSignal operations. Supporting changes extend AppliedSeqHandler with an upper bound query method and a test-only persistence helper, and add a test validating pending WAL operations are enqueued on recovery.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Update queue dont keep ops in ram #7951 — Overlaps with OperationData payload construction and WAL entry reading for UpdateSignal::Operation queuing
Persist applied seq at regular interval #7976 — Extends AppliedSeqHandler with op_num_upper_bound and force_set_and_persist methods used in shard recovery
Qdrant on Edge #7170 — Modifies SerdeWal with read_range method for flexible range-based WAL reads

Suggested reviewers

timvisee
generall

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'WAL replay honors applied_seq' directly and precisely describes the main change: modifying WAL replay behavior to respect the applied_seq value.
Description check	✅ Passed	The description provides concrete logs, testing instructions, and expected behavior directly related to the changeset's core functionality of optimizing WAL replay using applied_seq.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch wal-replay-applied-seq

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

* WAL replay honors applied_seq fixes * add integration test * restore test after rebase --------- Co-authored-by: Ivan Pleshkov <pleshkov.ivan@gmail.com>

agourlay added this to the Update queue milestone Jan 28, 2026

agourlay force-pushed the wal-replay-applied-seq branch from f56d1ce to e4883ac Compare January 30, 2026 16:25

IvanPleshkov marked this pull request as ready for review February 3, 2026 12:16

IvanPleshkov changed the title ~~(WIP) WAL replay honors applied_seq~~ WAL replay honors applied_seq Feb 3, 2026

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

timvisee mentioned this pull request Feb 3, 2026

Enable update queue #8046

Merged

5 tasks

qdrant deleted a comment from coderabbitai bot Feb 3, 2026

timvisee approved these changes Feb 3, 2026

View reviewed changes

generall approved these changes Feb 3, 2026

View reviewed changes

agourlay and others added 3 commits February 3, 2026 18:33

WAL replay honors applied_seq

4a52595

fixes

add integration test

b661983

restore test after rebase

b4be85b

IvanPleshkov force-pushed the wal-replay-applied-seq branch from 79b7d04 to b4be85b Compare February 3, 2026 17:42

IvanPleshkov approved these changes Feb 3, 2026

View reviewed changes

IvanPleshkov merged commit 52e995d into dev Feb 3, 2026
15 checks passed

IvanPleshkov deleted the wal-replay-applied-seq branch February 3, 2026 19:48

This was referenced Feb 5, 2026

Remove UpdateSignal::Stop #8050

Merged

Fix flaky test_wal_replay_loads_pending_to_queue #8067

Merged

fix wal replay deadlock in update queue #8068

Merged

generall pushed a commit that referenced this pull request Feb 9, 2026

WAL replay honors applied_seq (#8008)

0d87284

* WAL replay honors applied_seq fixes * add integration test * restore test after rebase --------- Co-authored-by: Ivan Pleshkov <pleshkov.ivan@gmail.com>

coderabbitai bot mentioned this pull request Feb 10, 2026

Dont lock WAL while serialization #8093

Merged

coderabbitai bot mentioned this pull request Mar 13, 2026

Add internal wait enum, don't let forward proxy wait on optimizers #8394

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL replay honors applied_seq#8008

WAL replay honors applied_seq#8008
IvanPleshkov merged 3 commits intodevfrom
wal-replay-applied-seq

agourlay commented Jan 28, 2026 •

edited

Loading

Uh oh!

IvanPleshkov commented Feb 3, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

agourlay commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to test

Uh oh!

IvanPleshkov commented Feb 3, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

agourlay commented Jan 28, 2026 •

edited

Loading