extend WAL retention if some other replicas are dead by generall · Pull Request #7834 · qdrant/qdrant

generall · 2025-12-28T00:48:35Z

Make WAL retain 10x more segments, if there are dead replicas in cluster.
In theory, this should prevent shard recovery to fallback to streaming under high load

Depends on qdrant/wal#105

coderabbitai · 2025-12-28T00:52:42Z

📝 Walkthrough

Walkthrough

Dependency revision for the wal crate updated in Cargo.toml. New WAL retention controls were added across layers: SerdeWal gains set_extended_retention and set_normal_retention; RecoverableWal exposes set_extended_retention and set_normal_retention. These retention APIs are propagated through shard types by adding set_extended_wal_retention and set_normal_wal_retention on Shard, LocalShard (via wal_ops), ProxyShard, QueueProxyShard, and ForwardProxyShard. Replica set logic now evaluates remote peer states and toggles local WAL retention between extended and normal based on recovery needs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Qdrant on Edge #7170 — Modifies SerdeWal usage in the shard crate; overlaps with WAL retention changes.
Move SerdeWal from collection into shard crate #7146 — Adjusts SerdeWal in the shard crate; directly related to added retention APIs on SerdeWal.
Migrate WAL mutex from parking lot to tokio #6307 — Changes WAL-related locking/async patterns that intersect with RecoverableWal and retention controls.

Suggested reviewers

timvisee
ffuugoo

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.17% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: extending WAL retention when other replicas are dead, which is the core objective of this pull request.
Description check	✅ Passed	The description is directly related to the changeset, explaining the rationale for extending WAL retention and noting the dependency on an external wal repository change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

Cargo.toml (1)

286-286: Consider repinning wal to a specific commit or release once the feature stabilizes

Switching the workspace wal dependency to a git branch (allow-set-retention) makes builds depend on the moving branch head. That’s fine while iterating on wal PR #105, but before shipping a release it would be safer to pin back to a fixed rev or published version for reproducible builds and less surprising upgrades.

lib/collection/src/shards/shard.rs (1)

371-390: Shard‑level WAL retention dispatch is correct; consider documenting the extended variant

The new set_extended_wal_retention / set_normal_wal_retention methods correctly cover all non‑dummy variants and no‑op on Dummy, matching how other Shard helpers are dispatched. This keeps callers oblivious to the concrete shard type while still controlling WAL retention.

You might optionally add a brief doc comment on set_extended_wal_retention mirroring the LocalShard docs (similar to the comment on set_normal_wal_retention) for consistency in the public API surface.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e12fb6f and 67e26d6.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

Cargo.toml
lib/collection/src/shards/forward_proxy_shard.rs
lib/collection/src/shards/local_shard/mod.rs
lib/collection/src/shards/local_shard/wal_ops.rs
lib/collection/src/shards/proxy_shard.rs
lib/collection/src/shards/queue_proxy_shard.rs
lib/collection/src/shards/replica_set/mod.rs
lib/collection/src/shards/replica_set/replica_set_state.rs
lib/collection/src/shards/shard.rs
lib/collection/src/wal_delta.rs
lib/shard/src/wal.rs

🧰 Additional context used

📓 Path-based instructions (1)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

lib/collection/src/shards/local_shard/mod.rs
lib/collection/src/shards/replica_set/replica_set_state.rs
lib/collection/src/shards/queue_proxy_shard.rs
lib/collection/src/shards/local_shard/wal_ops.rs
lib/collection/src/shards/forward_proxy_shard.rs
lib/collection/src/shards/proxy_shard.rs
lib/collection/src/wal_delta.rs
lib/collection/src/shards/shard.rs
lib/collection/src/shards/replica_set/mod.rs
lib/shard/src/wal.rs

🧠 Learnings (6)

📓 Common learnings

Learnt from: generall
Repo: qdrant/qdrant PR: 7388
File: lib/shard/src/segment_holder/flush.rs:167-191
Timestamp: 2025-10-12T22:22:11.435Z
Learning: In Qdrant's SegmentHolder flush logic, when calculating the minimum unsaved version for WAL acknowledgment, the code intentionally returns `segment_persisted_version` (not `segment_persisted_version + 1`) when there are unsaved changes. This conservative approach assumes the last persisted version might not have been fully applied and allows the system to safely retry operations when versions match, ensuring WAL consistency.

📚 Learning: 2025-06-14T20:35:10.603Z

Learnt from: generall
Repo: qdrant/qdrant PR: 6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.

Applied to files:

lib/collection/src/shards/local_shard/mod.rs

📚 Learning: 2025-11-14T10:01:31.926Z

Learnt from: timvisee
Repo: qdrant/qdrant PR: 7526
File: src/wal_pop.rs:7-8
Timestamp: 2025-11-14T10:01:31.926Z
Learning: In the Qdrant codebase, development binaries (like src/wal_pop.rs) don't require explicit bounds checking for command-line arguments. Rust's built-in out-of-bounds panic is acceptable for these development tools.

Applied to files:

lib/collection/src/shards/local_shard/mod.rs
Cargo.toml

📚 Learning: 2025-11-10T14:50:02.722Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.

Applied to files:

lib/collection/src/shards/replica_set/replica_set_state.rs

📚 Learning: 2025-09-01T11:42:06.964Z

Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.

Applied to files:

lib/shard/src/wal.rs

📚 Learning: 2025-10-13T22:58:03.121Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7400
File: lib/segment/src/id_tracker/simple_id_tracker.rs:234-241
Timestamp: 2025-10-13T22:58:03.121Z
Learning: SimpleIdTracker in lib/segment/src/id_tracker/simple_id_tracker.rs is being deprecated and should not receive fixes related to version tracking or recovery logic, as it has a different version storage structure that is incompatible with newer trackers.

Applied to files:

Cargo.toml

🧬 Code graph analysis (6)

lib/collection/src/shards/queue_proxy_shard.rs (3)

lib/collection/src/shards/forward_proxy_shard.rs (2)

set_extended_wal_retention (399-401)

set_normal_wal_retention (403-405)

lib/collection/src/shards/proxy_shard.rs (2)

set_extended_wal_retention (185-187)

set_normal_wal_retention (189-191)

lib/collection/src/shards/shard.rs (2)

set_extended_wal_retention (371-379)

set_normal_wal_retention (382-390)

lib/collection/src/shards/local_shard/wal_ops.rs (4)

lib/collection/src/shards/forward_proxy_shard.rs (2)

set_extended_wal_retention (399-401)

set_normal_wal_retention (403-405)

lib/collection/src/shards/proxy_shard.rs (2)

set_extended_wal_retention (185-187)

set_normal_wal_retention (189-191)

lib/collection/src/shards/queue_proxy_shard.rs (2)

set_extended_wal_retention (273-277)

set_normal_wal_retention (279-283)

lib/collection/src/shards/shard.rs (2)

set_extended_wal_retention (371-379)

set_normal_wal_retention (382-390)

lib/collection/src/wal_delta.rs (1)

lib/shard/src/wal.rs (2)

set_extended_retention (218-222)

set_normal_retention (224-227)

lib/collection/src/shards/shard.rs (4)

lib/collection/src/shards/forward_proxy_shard.rs (2)

set_extended_wal_retention (399-401)

set_normal_wal_retention (403-405)

lib/collection/src/shards/local_shard/wal_ops.rs (2)

set_extended_wal_retention (6-8)

set_normal_wal_retention (11-13)

lib/collection/src/shards/proxy_shard.rs (2)

set_extended_wal_retention (185-187)

set_normal_wal_retention (189-191)

lib/collection/src/shards/queue_proxy_shard.rs (2)

set_extended_wal_retention (273-277)

set_normal_wal_retention (279-283)

lib/collection/src/shards/replica_set/mod.rs (5)

lib/storage/src/content_manager/consensus/persistent.rs (2)

state (60-62)

this_peer_id (321-323)

lib/collection/src/shards/transfer/mod.rs (1)

this_peer_id (188-188)

lib/storage/src/content_manager/toc/mod.rs (1)

this_peer_id (621-623)

lib/storage/src/content_manager/toc/transfer.rs (1)

this_peer_id (17-19)

lib/storage/src/content_manager/consensus_manager.rs (1)

this_peer_id (205-207)

lib/shard/src/wal.rs (1)

lib/collection/src/wal_delta.rs (2)

set_extended_retention (174-176)

set_normal_retention (178-180)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Test Python bindings
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: test-consistency
GitHub Check: rust-tests (macos-latest)
GitHub Check: e2e-tests
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: test-consensus-compose
GitHub Check: integration-tests
GitHub Check: integration-tests-consensus
GitHub Check: lint

🔇 Additional comments (11)

lib/collection/src/shards/local_shard/mod.rs (1)

20-20: wal_ops module wiring for LocalShard looks correct

Adding mod wal_ops; keeps WAL‑specific helpers in a focused internal module and integrates cleanly with the existing local shard layout. No issues from a visibility or organization standpoint.

lib/collection/src/shards/replica_set/replica_set_state.rs (1)

210-224: requires_recovery only flags Dead; confirm this matches recovery/retention policy

requires_recovery() currently returns true only for ReplicaState::Dead and false for all other variants, including Recovery, PartialSnapshot, and ActiveRead. That matches the PR intent (“extend WAL retention if some other replicas are dead”), but it also means flapping or partially recovered replicas won’t trigger the extended WAL retention path via this helper.

If any of the partial/transfer states should also cause automatic recovery scheduling or prolonged WAL retention, this is the place to include them; otherwise this implementation is consistent and the explicit match over all variants is in line with our no‑_ guideline. Based on learnings, treating ActiveRead as unhealthy but not necessarily “requires recovery” is also consistent with existing is_healthy() semantics.

lib/collection/src/shards/local_shard/wal_ops.rs (1)

1-14: Local shard WAL retention helpers delegate cleanly to RecoverableWal

Both set_extended_wal_retention and set_normal_wal_retention are thin async wrappers over the underlying RecoverableWal retention controls, without touching clock maps or adding new failure modes. The docs succinctly describe the intended behavior; no issues spotted.

lib/collection/src/shards/forward_proxy_shard.rs (1)

399-405: ForwardProxyShard correctly forwards WAL retention control to its wrapped LocalShard

Both set_extended_wal_retention and set_normal_wal_retention simply delegate to wrapped_shard, consistent with how other operations are forwarded through ForwardProxyShard and with how Shard::set_*_wal_retention routes calls. No issues found.

lib/collection/src/shards/queue_proxy_shard.rs (1)

273-283: QueueProxyShard WAL retention methods are lifecycle‑safe

set_extended_wal_retention and set_normal_wal_retention both guard on self.inner before delegating to the wrapped LocalShard, so calling them after forget_updates_and_finalize() becomes a no‑op instead of panicking. That’s consistent with the existing Drop invariant around finalized queue proxies and avoids introducing new edge‑case crashes.

lib/collection/src/wal_delta.rs (1)

174-180: RecoverableWal retention setters are minimal and maintain safe lock ordering

set_extended_retention / set_normal_retention just acquire the inner SerdeWal mutex and call its corresponding setters (see lib/shard/src/wal.rs), without interacting with newest_clocks or oldest_clocks. This gives LocalShard and shard wrappers a clean, infallible hook for adjusting WAL retention without introducing new lock‑ordering risks.

lib/collection/src/shards/proxy_shard.rs (1)

185-191: LGTM! Clean delegation pattern.

The WAL retention control methods follow the established delegation pattern in this proxy wrapper, correctly forwarding calls to the wrapped shard.

lib/shard/src/wal.rs (2)

29-31: Constant aligns with PR objectives.

The 10× retention factor matches the PR description's intent to extend WAL retention when dead replicas exist.

218-227: Dependency configuration verified: The WAL dependency on branch "allow-set-retention" is correctly configured in the workspace root Cargo.toml and is actively in use (confirmed in Cargo.lock with commit 416f31a7). The set_retention API is available and properly called in both set_extended_retention and set_normal_retention methods.

lib/collection/src/shards/replica_set/mod.rs (2)

697-699: Verify consistency: remote state changes now trigger WAL retention updates.

The addition of on_remote_state_updated calls when remote peer states change ensures WAL retention is adjusted appropriately. This correctly implements the feature when remote states transition.

However, verify that local state transitions also properly evaluate remote peer states. See the comment on on_local_state_updated for potential coordination concerns.

Also applies to: 769-771

797-824: Remote state evaluation logic is correct.

The on_remote_state_updated method correctly evaluates the cluster state: it activates extended WAL retention when the local peer is active and any remote peer requires recovery, and uses normal retention otherwise.

generall · 2026-01-06T21:20:29Z

WAL pr should be merged first!

KShivendu

Happy to see qdrant/wal#93 finally being used :)

Will rebase, fix CI, and merge

KShivendu · 2026-01-07T09:31:54Z

Closing #7032 in favour of this PR.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @lib/collection/src/shards/replica_set/mod.rs:
- Around line 790-791: The call to local_shard.set_normal_wal_retention() in
on_local_state_updated unconditionally resets WAL retention without considering
remote peer states; update on_local_state_updated to re-evaluate cluster
retention policy instead of blindly resetting — either remove the call so only
on_remote_state_updated controls retention, or call a shared evaluator (e.g., a
new or existing function like evaluate_and_set_wal_retention or reuse
on_remote_state_updated’s logic) that inspects both local state and remote peer
states and sets normal or extended retention accordingly; ensure to reference
local_shard.set_normal_wal_retention() and whatever evaluator you introduce so
retention reflects the full cluster state after any local state change.

🧹 Nitpick comments (2)

lib/collection/src/shards/shard.rs (1)

371-391: LGTM! Consider adding a doc comment to set_extended_wal_retention for symmetry.

The implementation correctly delegates to each shard variant with appropriate no-ops for Dummy. The explicit exhaustive matching aligns with the coding guidelines for this codebase.

Minor: set_normal_wal_retention has a doc comment (line 381) but set_extended_wal_retention does not. Consider adding one for consistency, e.g., "WAL is keeping extended amount of data after truncation for recovery purposes."

lib/collection/src/shards/replica_set/mod.rs (1)

797-824: Simplify function signature by removing unused parameters.

The function parameters _peer_id and _new_state are not used—the implementation re-reads the entire peer state from replica_state.read().peers() instead. Since this is a private method, update the signature to async fn on_remote_state_updated(&self) and remove the parameters from both call sites.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f1c73a6 and 9126cb1.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

Cargo.toml
lib/collection/src/shards/forward_proxy_shard.rs
lib/collection/src/shards/local_shard/mod.rs
lib/collection/src/shards/local_shard/wal_ops.rs
lib/collection/src/shards/proxy_shard.rs
lib/collection/src/shards/queue_proxy_shard.rs
lib/collection/src/shards/replica_set/mod.rs
lib/collection/src/shards/replica_set/replica_set_state.rs
lib/collection/src/shards/shard.rs
lib/collection/src/wal_delta.rs
lib/shard/src/wal.rs

🚧 Files skipped from review as they are similar to previous changes (3)

lib/collection/src/shards/local_shard/mod.rs
lib/shard/src/wal.rs
lib/collection/src/shards/queue_proxy_shard.rs

🧰 Additional context used

📓 Path-based instructions (1)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

lib/collection/src/shards/local_shard/wal_ops.rs
lib/collection/src/wal_delta.rs
lib/collection/src/shards/forward_proxy_shard.rs
lib/collection/src/shards/shard.rs
lib/collection/src/shards/replica_set/replica_set_state.rs
lib/collection/src/shards/proxy_shard.rs
lib/collection/src/shards/replica_set/mod.rs

🧠 Learnings (3)

📚 Learning: 2025-11-14T10:01:31.926Z

Learnt from: timvisee
Repo: qdrant/qdrant PR: 7526
File: src/wal_pop.rs:7-8
Timestamp: 2025-11-14T10:01:31.926Z
Learning: In the Qdrant codebase, development binaries (like src/wal_pop.rs) don't require explicit bounds checking for command-line arguments. Rust's built-in out-of-bounds panic is acceptable for these development tools.

Applied to files:

Cargo.toml

📚 Learning: 2025-10-13T22:58:03.121Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7400
File: lib/segment/src/id_tracker/simple_id_tracker.rs:234-241
Timestamp: 2025-10-13T22:58:03.121Z
Learning: SimpleIdTracker in lib/segment/src/id_tracker/simple_id_tracker.rs is being deprecated and should not receive fixes related to version tracking or recovery logic, as it has a different version storage structure that is incompatible with newer trackers.

Applied to files:

Cargo.toml

📚 Learning: 2025-11-10T14:50:02.722Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.

Applied to files:

lib/collection/src/shards/replica_set/replica_set_state.rs

🧬 Code graph analysis (5)

lib/collection/src/shards/local_shard/wal_ops.rs (4)

lib/collection/src/shards/forward_proxy_shard.rs (2)

set_extended_wal_retention (399-401)

set_normal_wal_retention (403-405)

lib/collection/src/shards/proxy_shard.rs (2)

set_extended_wal_retention (185-187)

set_normal_wal_retention (189-191)

lib/collection/src/shards/queue_proxy_shard.rs (2)

set_extended_wal_retention (273-277)

set_normal_wal_retention (279-283)

lib/collection/src/shards/shard.rs (2)

set_extended_wal_retention (371-379)

set_normal_wal_retention (382-390)

lib/collection/src/wal_delta.rs (1)

lib/shard/src/wal.rs (2)

set_extended_retention (218-222)

set_normal_retention (224-227)

lib/collection/src/shards/forward_proxy_shard.rs (4)

lib/collection/src/shards/local_shard/wal_ops.rs (2)

set_extended_wal_retention (6-8)

set_normal_wal_retention (11-13)

lib/collection/src/shards/proxy_shard.rs (2)

set_extended_wal_retention (185-187)

set_normal_wal_retention (189-191)

lib/collection/src/shards/queue_proxy_shard.rs (2)

set_extended_wal_retention (273-277)

set_normal_wal_retention (279-283)

lib/collection/src/shards/shard.rs (2)

set_extended_wal_retention (371-379)

set_normal_wal_retention (382-390)

lib/collection/src/shards/shard.rs (4)

lib/collection/src/shards/forward_proxy_shard.rs (2)

set_extended_wal_retention (399-401)

set_normal_wal_retention (403-405)

lib/collection/src/shards/local_shard/wal_ops.rs (2)

set_extended_wal_retention (6-8)

set_normal_wal_retention (11-13)

lib/collection/src/shards/proxy_shard.rs (2)

set_extended_wal_retention (185-187)

set_normal_wal_retention (189-191)

lib/collection/src/shards/queue_proxy_shard.rs (2)

set_extended_wal_retention (273-277)

set_normal_wal_retention (279-283)

lib/collection/src/shards/proxy_shard.rs (4)

lib/collection/src/shards/forward_proxy_shard.rs (2)

set_extended_wal_retention (399-401)

set_normal_wal_retention (403-405)

lib/collection/src/shards/local_shard/wal_ops.rs (2)

set_extended_wal_retention (6-8)

set_normal_wal_retention (11-13)

lib/collection/src/shards/queue_proxy_shard.rs (2)

set_extended_wal_retention (273-277)

set_normal_wal_retention (279-283)

lib/collection/src/shards/shard.rs (2)

set_extended_wal_retention (371-379)

set_normal_wal_retention (382-390)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Test Python bindings
GitHub Check: lint
GitHub Check: e2e-tests
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: integration-tests-consensus
GitHub Check: integration-tests
GitHub Check: test-consistency
GitHub Check: test-consensus-compose
GitHub Check: rust-tests (ubuntu-latest)
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)

🔇 Additional comments (8)

lib/collection/src/wal_delta.rs (1)

174-180: LGTM!

The WAL retention control methods are correctly implemented as simple pass-throughs to the underlying SerdeWal implementation, with appropriate async locking.

lib/collection/src/shards/replica_set/replica_set_state.rs (1)

227-242: LGTM!

The requires_recovery method correctly identifies only the Dead state as requiring automatic recovery scheduling. The exhaustive matching ensures all current and future replica states are explicitly handled.

lib/collection/src/shards/local_shard/wal_ops.rs (1)

1-14: LGTM!

The WAL retention control methods are correctly implemented with clear documentation. The delegation to the underlying RecoverableWal follows the established pattern throughout the codebase.

lib/collection/src/shards/forward_proxy_shard.rs (1)

399-405: LGTM!

The WAL retention methods correctly delegate to the wrapped shard, following the established proxy pattern in this file. The implementation is consistent with other proxy layers in the codebase.

Cargo.toml (1)

288-288: WAL dependency revision verified and merged.

The commit hash 2209eb2c4a359d9c32f63d868c580df360efa4a9 from WAL PR #105 exists in the repository and is already merged. The dependency update is safe to proceed.

lib/collection/src/shards/replica_set/mod.rs (2)

697-699: LGTM!

Correctly routes remote peer state changes to on_remote_state_updated for WAL retention evaluation.

769-771: LGTM!

Consistent with the add_remote change—remote state updates now properly trigger WAL retention evaluation.

lib/collection/src/shards/proxy_shard.rs (1)

184-191: LGTM!

Clean delegation to wrapped_shard following the established proxy pattern. Consistent with the parallel implementations in ForwardProxyShard and QueueProxyShard.

coderabbitai · 2026-01-07T09:33:37Z

lib/collection/src/shards/replica_set/mod.rs

+            // Reset WAL retention to normal whenever local shard changes state
+            local_shard.set_normal_wal_retention().await;


⚠️ Potential issue | 🟠 Major

Logic gap: local state change doesn't re-evaluate remote peer states for retention.

When the local peer transitions to Active, this code resets WAL retention to normal. However, if remote peers are Dead or require recovery, extended retention should be active instead. The current implementation doesn't check remote states after a local state change.

Scenario:

Local is Dead, remote is Dead → normal retention

Local transitions to Active → on_local_state_updated resets to normal

Remote is still Dead → extended retention should be active, but it isn't

Consider either removing the retention reset here (let on_remote_state_updated be the sole arbiter), or re-evaluating retention based on the full cluster state after any local state change.

🤖 Prompt for AI Agents

In @lib/collection/src/shards/replica_set/mod.rs around lines 790 - 791, The call to local_shard.set_normal_wal_retention() in on_local_state_updated unconditionally resets WAL retention without considering remote peer states; update on_local_state_updated to re-evaluate cluster retention policy instead of blindly resetting — either remove the call so only on_remote_state_updated controls retention, or call a shared evaluator (e.g., a new or existing function like evaluate_and_set_wal_retention or reuse on_remote_state_updated’s logic) that inspects both local state and remote peer states and sets normal or extended retention accordingly; ensure to reference local_shard.set_normal_wal_retention() and whatever evaluator you introduce so retention reflects the full cluster state after any local state change.

* extend WAL retention if some other replicas are dead * use merged WAL commit * Handle ManualRecovery state --------- Co-authored-by: KShivendu <kshivendu1@gmail.com>

generall requested a review from timvisee December 28, 2025 00:48

coderabbitai bot reviewed Dec 28, 2025

View reviewed changes

generall requested a review from ffuugoo January 6, 2026 10:11

ffuugoo approved these changes Jan 6, 2026

View reviewed changes

generall added 2 commits January 7, 2026 14:45

extend WAL retention if some other replicas are dead

32c8e02

use merged WAL commit

514111d

KShivendu approved these changes Jan 7, 2026

View reviewed changes

Handle ManualRecovery state

9126cb1

KShivendu force-pushed the longer-wal-retention-on-failure branch from f1c73a6 to 9126cb1 Compare January 7, 2026 09:30

KShivendu mentioned this pull request Jan 7, 2026

Allow updating WAL config in collection after creation #7032

Closed

coderabbitai bot reviewed Jan 7, 2026

View reviewed changes

KShivendu merged commit 5d63383 into dev Jan 7, 2026
15 checks passed

KShivendu deleted the longer-wal-retention-on-failure branch January 7, 2026 10:25

coderabbitai bot mentioned this pull request Jan 17, 2026

feat: add read-only mode and integration tests for it #7498

Open

8 tasks

coderabbitai bot mentioned this pull request Feb 2, 2026

Drop wal from api #8000

Merged

coderabbitai bot mentioned this pull request Feb 11, 2026

Ensure WAL and shard clocks consistency when creating shard snapshot #8104

Merged

9 tasks

timvisee mentioned this pull request Feb 17, 2026

Bump version to 1.17.0 #8160

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend WAL retention if some other replicas are dead#7834

extend WAL retention if some other replicas are dead#7834
KShivendu merged 3 commits intodevfrom
longer-wal-retention-on-failure

generall commented Dec 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 28, 2025 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

generall commented Jan 6, 2026

Uh oh!

KShivendu left a comment

Uh oh!

KShivendu commented Jan 7, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Reset WAL retention to normal whenever local shard changes state
		local_shard.set_normal_wal_retention().await;

Conversation

generall commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

generall commented Jan 6, 2026

Uh oh!

KShivendu left a comment

Choose a reason for hiding this comment

Uh oh!

KShivendu commented Jan 7, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

generall commented Dec 28, 2025 •

edited

Loading

coderabbitai bot commented Dec 28, 2025 •

edited

Loading