Conversation
f56d1ce to
e4883ac
Compare
|
Added a test for this feature. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@lib/collection/src/shards/local_shard/mod.rs`:
- Around line 684-693: The code treats last_wal_index as an exclusive upper
bound (computed via from + wal.len(false)) but later uses an inclusive range,
producing an off-by-one; update the logic so the upper bound is handled
consistently: either make last_wal_index the inclusive last index (compute as
from + wal.len(false) - 1) or keep it exclusive and change any inclusive ranges
to exclusive (e.g., use to..last_wal_index instead of to..=last_wal_index);
adjust/remove any matching off-by-one arithmetic around wal_entries_to_replay,
from, last_wal_index, and the enqueue loop (also in the similar block around the
808-827 region) so the logged count matches the actual enqueued entries.
In `@lib/collection/src/update_workers/applied_seq.rs`:
- Around line 54-63: The op_num_upper_bound method can overflow when adding
APPLIED_SEQ_SAVE_INTERVAL + 1 to self.op_num (u64); update the calculation in
op_num_upper_bound to use saturating_add on the loaded op_num (e.g., call
self.op_num.load(...).saturating_add(...)) so the adjusted value never wraps,
while keeping the existing file.is_some() check and returning Some(adjusted) or
None as before.
🧹 Nitpick comments (1)
lib/collection/src/tests/wal_recovery_test.rs (1)
384-395: Consider a longer queue-drain timeout to reduce flakes.A 2s limit can be tight on slower CI; consider increasing it (e.g., 10s) or wrapping the loop in
tokio::time::timeoutfor clearer failure behavior.Proposed tweak
- let timeout = std::time::Duration::from_secs(2); + let timeout = std::time::Duration::from_secs(10);
|
|
||
| let from = wal.first_index(); | ||
| let last_wal_index = from + wal.len(false); | ||
| let to = self | ||
| .applied_seq_handler | ||
| .op_num_upper_bound() | ||
| .unwrap_or(last_wal_index); | ||
| let to = std::cmp::min(to, last_wal_index); | ||
| let wal_entries_to_replay = to - from; | ||
|
|
There was a problem hiding this comment.
Fix off-by-one when enqueueing remaining WAL entries.
last_wal_index = from + wal.len(false) is the exclusive upper bound. Using to..=last_wal_index enqueues one extra op_num and makes the logged count inconsistent with the actual number of entries.
Proposed fix (use exclusive end)
- let last_wal_index = from + wal.len(false);
+ let wal_end_exclusive = from + wal.len(false);
let to = self
.applied_seq_handler
.op_num_upper_bound()
- .unwrap_or(last_wal_index);
- let to = std::cmp::min(to, last_wal_index);
+ .unwrap_or(wal_end_exclusive);
+ let to = std::cmp::min(to, wal_end_exclusive);
let wal_entries_to_replay = to - from;
@@
- if to < last_wal_index {
+ if to < wal_end_exclusive {
log::info!(
"Loading remaining {} WAL entries from:{to} into update queue",
- last_wal_index - to
+ wal_end_exclusive - to
);
let update_sender = self.update_sender.load();
// TODO use proper collection's hardware measurement
let hw_measurements = HwMeasurementAcc::disposable();
- for op_num in to..=last_wal_index {
+ for op_num in to..wal_end_exclusive {
update_sender
.send(UpdateSignal::Operation(OperationData {
op_num,
operation: None,
sender: None,
hw_measurements: hw_measurements.clone(),
}))
.await?;
}
}Also applies to: 808-827
🤖 Prompt for AI Agents
In `@lib/collection/src/shards/local_shard/mod.rs` around lines 684 - 693, The
code treats last_wal_index as an exclusive upper bound (computed via from +
wal.len(false)) but later uses an inclusive range, producing an off-by-one;
update the logic so the upper bound is handled consistently: either make
last_wal_index the inclusive last index (compute as from + wal.len(false) - 1)
or keep it exclusive and change any inclusive ranges to exclusive (e.g., use
to..last_wal_index instead of to..=last_wal_index); adjust/remove any matching
off-by-one arithmetic around wal_entries_to_replay, from, last_wal_index, and
the enqueue loop (also in the similar block around the 808-827 region) so the
logged count matches the actual enqueued entries.
| /// Get the op_num upper bound for the last_applied_seq adjusted to the persistence interval | ||
| /// | ||
| /// Returns None if the handler is not active. | ||
| pub fn op_num_upper_bound(&self) -> Option<u64> { | ||
| if self.file.is_some() { | ||
| let adjusted = self.op_num.load(Ordering::Relaxed) + APPLIED_SEQ_SAVE_INTERVAL + 1; | ||
| Some(adjusted) | ||
| } else { | ||
| None | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the file
cat -n lib/collection/src/update_workers/applied_seq.rs | head -100Repository: qdrant/qdrant
Length of output: 4049
🏁 Script executed:
# Let me check how op_num_upper_bound is used and if there are similar patterns
rg -A 3 -B 3 "op_num_upper_bound" lib/collection/src/Repository: qdrant/qdrant
Length of output: 1363
🏁 Script executed:
# Also check if there are other arithmetic operations on atomic u64s that might overflow
rg "\.load\(.*Ordering" lib/collection/src/update_workers/ | head -20Repository: qdrant/qdrant
Length of output: 680
🏁 Script executed:
# Check the context around APPLIED_SEQ_SAVE_INTERVAL usage
rg -B 5 -A 5 "APPLIED_SEQ_SAVE_INTERVAL" lib/collection/src/Repository: qdrant/qdrant
Length of output: 9083
🏁 Script executed:
# Let me verify the method is new and check for any tests that might exercise edge cases
git diff HEAD~1 lib/collection/src/update_workers/applied_seq.rs | head -80Repository: qdrant/qdrant
Length of output: 243
Use saturating_add to prevent u64 overflow in op_num_upper_bound.
Adding 65 to an arbitrary op_num value can wrap, which would violate the upper bound contract (result should be ≥ current value). Use saturating addition to preserve correctness.
Proposed fix
- let adjusted = self.op_num.load(Ordering::Relaxed) + APPLIED_SEQ_SAVE_INTERVAL + 1;
+ let adjusted = self
+ .op_num
+ .load(Ordering::Relaxed)
+ .saturating_add(APPLIED_SEQ_SAVE_INTERVAL + 1);🤖 Prompt for AI Agents
In `@lib/collection/src/update_workers/applied_seq.rs` around lines 54 - 63, The
op_num_upper_bound method can overflow when adding APPLIED_SEQ_SAVE_INTERVAL + 1
to self.op_num (u64); update the calculation in op_num_upper_bound to use
saturating_add on the loaded op_num (e.g., call
self.op_num.load(...).saturating_add(...)) so the adjusted value never wraps,
while keeping the existing file.is_some() check and returning Some(adjusted) or
None as before.
79b7d04 to
b4be85b
Compare
📝 WalkthroughWalkthroughThe pull request refactors WAL recovery in local shard to use dynamic range-based replay instead of fixed whole-WAL reading. It computes a replay window from the first WAL index to a bound derived from applied sequence state, reads only that range via a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
* WAL replay honors applied_seq fixes * add integration test * restore test after rebase --------- Co-authored-by: Ivan Pleshkov <pleshkov.ivan@gmail.com>
How to test
Run Qdrant with
QDRANT__STORAGE__UPDATE_QUEUE_SIZE=1000000Wait a bit so it results in:
bfbJust restart to trigger the WAL replay which should be very fast by leveraging the last
applied_seq.