extend WAL retention if some other replicas are dead#7834
Conversation
📝 WalkthroughWalkthroughDependency revision for the Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
Cargo.toml (1)
286-286: Consider repinningwalto a specific commit or release once the feature stabilizesSwitching the workspace
waldependency to a git branch (allow-set-retention) makes builds depend on the moving branch head. That’s fine while iterating on wal PR #105, but before shipping a release it would be safer to pin back to a fixedrevor published version for reproducible builds and less surprising upgrades.lib/collection/src/shards/shard.rs (1)
371-390: Shard‑level WAL retention dispatch is correct; consider documenting the extended variantThe new
set_extended_wal_retention/set_normal_wal_retentionmethods correctly cover all non‑dummy variants and no‑op onDummy, matching how other Shard helpers are dispatched. This keeps callers oblivious to the concrete shard type while still controlling WAL retention.You might optionally add a brief doc comment on
set_extended_wal_retentionmirroring the LocalShard docs (similar to the comment onset_normal_wal_retention) for consistency in the public API surface.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
Cargo.tomllib/collection/src/shards/forward_proxy_shard.rslib/collection/src/shards/local_shard/mod.rslib/collection/src/shards/local_shard/wal_ops.rslib/collection/src/shards/proxy_shard.rslib/collection/src/shards/queue_proxy_shard.rslib/collection/src/shards/replica_set/mod.rslib/collection/src/shards/replica_set/replica_set_state.rslib/collection/src/shards/shard.rslib/collection/src/wal_delta.rslib/shard/src/wal.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust
In new code, don't usetransmute_from_u8,transmute_to_u8,transmute_from_u8_to_slice,transmute_from_u8_to_mut_slice,transmute_to_u8_slice- usebytemuckorzerocopycrates instead
Prefer explicit exhaustive matches over catch-all_arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using: _over using..in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)
Files:
lib/collection/src/shards/local_shard/mod.rslib/collection/src/shards/replica_set/replica_set_state.rslib/collection/src/shards/queue_proxy_shard.rslib/collection/src/shards/local_shard/wal_ops.rslib/collection/src/shards/forward_proxy_shard.rslib/collection/src/shards/proxy_shard.rslib/collection/src/wal_delta.rslib/collection/src/shards/shard.rslib/collection/src/shards/replica_set/mod.rslib/shard/src/wal.rs
🧠 Learnings (6)
📓 Common learnings
Learnt from: generall
Repo: qdrant/qdrant PR: 7388
File: lib/shard/src/segment_holder/flush.rs:167-191
Timestamp: 2025-10-12T22:22:11.435Z
Learning: In Qdrant's SegmentHolder flush logic, when calculating the minimum unsaved version for WAL acknowledgment, the code intentionally returns `segment_persisted_version` (not `segment_persisted_version + 1`) when there are unsaved changes. This conservative approach assumes the last persisted version might not have been fully applied and allows the system to safely retry operations when versions match, ensuring WAL consistency.
📚 Learning: 2025-06-14T20:35:10.603Z
Learnt from: generall
Repo: qdrant/qdrant PR: 6635
File: lib/segment/src/index/field_index/full_text_index/text_index.rs:784-832
Timestamp: 2025-06-14T20:35:10.603Z
Learning: Functions gated with `#[cfg(feature = "testing")]` within `#[cfg(test)]` modules are safe from compilation issues since both the function and its call sites are restricted to test builds. The additional feature gate is often used for optional heavy test utilities.
Applied to files:
lib/collection/src/shards/local_shard/mod.rs
📚 Learning: 2025-11-14T10:01:31.926Z
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7526
File: src/wal_pop.rs:7-8
Timestamp: 2025-11-14T10:01:31.926Z
Learning: In the Qdrant codebase, development binaries (like src/wal_pop.rs) don't require explicit bounds checking for command-line arguments. Rust's built-in out-of-bounds panic is acceptable for these development tools.
Applied to files:
lib/collection/src/shards/local_shard/mod.rsCargo.toml
📚 Learning: 2025-11-10T14:50:02.722Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.
Applied to files:
lib/collection/src/shards/replica_set/replica_set_state.rs
📚 Learning: 2025-09-01T11:42:06.964Z
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7157
File: lib/shard/src/segment_holder/mod.rs:808-814
Timestamp: 2025-09-01T11:42:06.964Z
Learning: In Qdrant's segment holder, panicking when no segments exist during flush_all is intentional and preferred over graceful error handling, as having zero segments could permanently corrupt the WAL by acknowledging u64::MAX. The maintainers consider this condition impossible and use the panic as a fail-fast safety mechanism to prevent data corruption.
Applied to files:
lib/shard/src/wal.rs
📚 Learning: 2025-10-13T22:58:03.121Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7400
File: lib/segment/src/id_tracker/simple_id_tracker.rs:234-241
Timestamp: 2025-10-13T22:58:03.121Z
Learning: SimpleIdTracker in lib/segment/src/id_tracker/simple_id_tracker.rs is being deprecated and should not receive fixes related to version tracking or recovery logic, as it has a different version storage structure that is incompatible with newer trackers.
Applied to files:
Cargo.toml
🧬 Code graph analysis (6)
lib/collection/src/shards/queue_proxy_shard.rs (3)
lib/collection/src/shards/forward_proxy_shard.rs (2)
set_extended_wal_retention(399-401)set_normal_wal_retention(403-405)lib/collection/src/shards/proxy_shard.rs (2)
set_extended_wal_retention(185-187)set_normal_wal_retention(189-191)lib/collection/src/shards/shard.rs (2)
set_extended_wal_retention(371-379)set_normal_wal_retention(382-390)
lib/collection/src/shards/local_shard/wal_ops.rs (4)
lib/collection/src/shards/forward_proxy_shard.rs (2)
set_extended_wal_retention(399-401)set_normal_wal_retention(403-405)lib/collection/src/shards/proxy_shard.rs (2)
set_extended_wal_retention(185-187)set_normal_wal_retention(189-191)lib/collection/src/shards/queue_proxy_shard.rs (2)
set_extended_wal_retention(273-277)set_normal_wal_retention(279-283)lib/collection/src/shards/shard.rs (2)
set_extended_wal_retention(371-379)set_normal_wal_retention(382-390)
lib/collection/src/wal_delta.rs (1)
lib/shard/src/wal.rs (2)
set_extended_retention(218-222)set_normal_retention(224-227)
lib/collection/src/shards/shard.rs (4)
lib/collection/src/shards/forward_proxy_shard.rs (2)
set_extended_wal_retention(399-401)set_normal_wal_retention(403-405)lib/collection/src/shards/local_shard/wal_ops.rs (2)
set_extended_wal_retention(6-8)set_normal_wal_retention(11-13)lib/collection/src/shards/proxy_shard.rs (2)
set_extended_wal_retention(185-187)set_normal_wal_retention(189-191)lib/collection/src/shards/queue_proxy_shard.rs (2)
set_extended_wal_retention(273-277)set_normal_wal_retention(279-283)
lib/collection/src/shards/replica_set/mod.rs (5)
lib/storage/src/content_manager/consensus/persistent.rs (2)
state(60-62)this_peer_id(321-323)lib/collection/src/shards/transfer/mod.rs (1)
this_peer_id(188-188)lib/storage/src/content_manager/toc/mod.rs (1)
this_peer_id(621-623)lib/storage/src/content_manager/toc/transfer.rs (1)
this_peer_id(17-19)lib/storage/src/content_manager/consensus_manager.rs (1)
this_peer_id(205-207)
lib/shard/src/wal.rs (1)
lib/collection/src/wal_delta.rs (2)
set_extended_retention(174-176)set_normal_retention(178-180)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Test Python bindings
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-consistency
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: e2e-tests
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests
- GitHub Check: integration-tests-consensus
- GitHub Check: lint
🔇 Additional comments (11)
lib/collection/src/shards/local_shard/mod.rs (1)
20-20:wal_opsmodule wiring forLocalShardlooks correctAdding
mod wal_ops;keeps WAL‑specific helpers in a focused internal module and integrates cleanly with the existing local shard layout. No issues from a visibility or organization standpoint.lib/collection/src/shards/replica_set/replica_set_state.rs (1)
210-224:requires_recoveryonly flagsDead; confirm this matches recovery/retention policy
requires_recovery()currently returnstrueonly forReplicaState::Deadandfalsefor all other variants, includingRecovery,PartialSnapshot, andActiveRead. That matches the PR intent (“extend WAL retention if some other replicas are dead”), but it also means flapping or partially recovered replicas won’t trigger the extended WAL retention path via this helper.If any of the partial/transfer states should also cause automatic recovery scheduling or prolonged WAL retention, this is the place to include them; otherwise this implementation is consistent and the explicit match over all variants is in line with our no‑
_guideline. Based on learnings, treatingActiveReadas unhealthy but not necessarily “requires recovery” is also consistent with existingis_healthy()semantics.lib/collection/src/shards/local_shard/wal_ops.rs (1)
1-14: Local shard WAL retention helpers delegate cleanly toRecoverableWalBoth
set_extended_wal_retentionandset_normal_wal_retentionare thin async wrappers over the underlyingRecoverableWalretention controls, without touching clock maps or adding new failure modes. The docs succinctly describe the intended behavior; no issues spotted.lib/collection/src/shards/forward_proxy_shard.rs (1)
399-405: ForwardProxyShard correctly forwards WAL retention control to its wrapped LocalShardBoth
set_extended_wal_retentionandset_normal_wal_retentionsimply delegate towrapped_shard, consistent with how other operations are forwarded throughForwardProxyShardand with howShard::set_*_wal_retentionroutes calls. No issues found.lib/collection/src/shards/queue_proxy_shard.rs (1)
273-283: QueueProxyShard WAL retention methods are lifecycle‑safe
set_extended_wal_retentionandset_normal_wal_retentionboth guard onself.innerbefore delegating to the wrappedLocalShard, so calling them afterforget_updates_and_finalize()becomes a no‑op instead of panicking. That’s consistent with the existing Drop invariant around finalized queue proxies and avoids introducing new edge‑case crashes.lib/collection/src/wal_delta.rs (1)
174-180: RecoverableWal retention setters are minimal and maintain safe lock ordering
set_extended_retention/set_normal_retentionjust acquire the innerSerdeWalmutex and call its corresponding setters (seelib/shard/src/wal.rs), without interacting withnewest_clocksoroldest_clocks. This gives LocalShard and shard wrappers a clean, infallible hook for adjusting WAL retention without introducing new lock‑ordering risks.lib/collection/src/shards/proxy_shard.rs (1)
185-191: LGTM! Clean delegation pattern.The WAL retention control methods follow the established delegation pattern in this proxy wrapper, correctly forwarding calls to the wrapped shard.
lib/shard/src/wal.rs (2)
29-31: Constant aligns with PR objectives.The 10× retention factor matches the PR description's intent to extend WAL retention when dead replicas exist.
218-227: Dependency configuration verified: The WAL dependency on branch "allow-set-retention" is correctly configured in the workspace root Cargo.toml and is actively in use (confirmed in Cargo.lock with commit 416f31a7). Theset_retentionAPI is available and properly called in bothset_extended_retentionandset_normal_retentionmethods.lib/collection/src/shards/replica_set/mod.rs (2)
697-699: Verify consistency: remote state changes now trigger WAL retention updates.The addition of
on_remote_state_updatedcalls when remote peer states change ensures WAL retention is adjusted appropriately. This correctly implements the feature when remote states transition.However, verify that local state transitions also properly evaluate remote peer states. See the comment on
on_local_state_updatedfor potential coordination concerns.Also applies to: 769-771
797-824: Remote state evaluation logic is correct.The
on_remote_state_updatedmethod correctly evaluates the cluster state: it activates extended WAL retention when the local peer is active and any remote peer requires recovery, and uses normal retention otherwise.
|
WAL pr should be merged first! |
KShivendu
left a comment
There was a problem hiding this comment.
Happy to see qdrant/wal#93 finally being used :)
Will rebase, fix CI, and merge
f1c73a6 to
9126cb1
Compare
|
Closing #7032 in favour of this PR. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @lib/collection/src/shards/replica_set/mod.rs:
- Around line 790-791: The call to local_shard.set_normal_wal_retention() in
on_local_state_updated unconditionally resets WAL retention without considering
remote peer states; update on_local_state_updated to re-evaluate cluster
retention policy instead of blindly resetting — either remove the call so only
on_remote_state_updated controls retention, or call a shared evaluator (e.g., a
new or existing function like evaluate_and_set_wal_retention or reuse
on_remote_state_updated’s logic) that inspects both local state and remote peer
states and sets normal or extended retention accordingly; ensure to reference
local_shard.set_normal_wal_retention() and whatever evaluator you introduce so
retention reflects the full cluster state after any local state change.
🧹 Nitpick comments (2)
lib/collection/src/shards/shard.rs (1)
371-391: LGTM! Consider adding a doc comment toset_extended_wal_retentionfor symmetry.The implementation correctly delegates to each shard variant with appropriate no-ops for
Dummy. The explicit exhaustive matching aligns with the coding guidelines for this codebase.Minor:
set_normal_wal_retentionhas a doc comment (line 381) butset_extended_wal_retentiondoes not. Consider adding one for consistency, e.g., "WAL is keeping extended amount of data after truncation for recovery purposes."lib/collection/src/shards/replica_set/mod.rs (1)
797-824: Simplify function signature by removing unused parameters.The function parameters
_peer_idand_new_stateare not used—the implementation re-reads the entire peer state fromreplica_state.read().peers()instead. Since this is a private method, update the signature toasync fn on_remote_state_updated(&self)and remove the parameters from both call sites.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
Cargo.tomllib/collection/src/shards/forward_proxy_shard.rslib/collection/src/shards/local_shard/mod.rslib/collection/src/shards/local_shard/wal_ops.rslib/collection/src/shards/proxy_shard.rslib/collection/src/shards/queue_proxy_shard.rslib/collection/src/shards/replica_set/mod.rslib/collection/src/shards/replica_set/replica_set_state.rslib/collection/src/shards/shard.rslib/collection/src/wal_delta.rslib/shard/src/wal.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- lib/collection/src/shards/local_shard/mod.rs
- lib/shard/src/wal.rs
- lib/collection/src/shards/queue_proxy_shard.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust
In new code, don't usetransmute_from_u8,transmute_to_u8,transmute_from_u8_to_slice,transmute_from_u8_to_mut_slice,transmute_to_u8_slice- usebytemuckorzerocopycrates instead
Prefer explicit exhaustive matches over catch-all_arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using: _over using..in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)
Files:
lib/collection/src/shards/local_shard/wal_ops.rslib/collection/src/wal_delta.rslib/collection/src/shards/forward_proxy_shard.rslib/collection/src/shards/shard.rslib/collection/src/shards/replica_set/replica_set_state.rslib/collection/src/shards/proxy_shard.rslib/collection/src/shards/replica_set/mod.rs
🧠 Learnings (3)
📚 Learning: 2025-11-14T10:01:31.926Z
Learnt from: timvisee
Repo: qdrant/qdrant PR: 7526
File: src/wal_pop.rs:7-8
Timestamp: 2025-11-14T10:01:31.926Z
Learning: In the Qdrant codebase, development binaries (like src/wal_pop.rs) don't require explicit bounds checking for command-line arguments. Rust's built-in out-of-bounds panic is acceptable for these development tools.
Applied to files:
Cargo.toml
📚 Learning: 2025-10-13T22:58:03.121Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7400
File: lib/segment/src/id_tracker/simple_id_tracker.rs:234-241
Timestamp: 2025-10-13T22:58:03.121Z
Learning: SimpleIdTracker in lib/segment/src/id_tracker/simple_id_tracker.rs is being deprecated and should not receive fixes related to version tracking or recovery logic, as it has a different version storage structure that is incompatible with newer trackers.
Applied to files:
Cargo.toml
📚 Learning: 2025-11-10T14:50:02.722Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7474
File: src/common/health.rs:328-330
Timestamp: 2025-11-10T14:50:02.722Z
Learning: In Qdrant's `ReplicaState`, the `ActiveRead` variant is intentionally considered unhealthy (returns `false` from `is_healthy()`) because it requires a shard transfer operation to finish. This is the correct behavior for health checks.
Applied to files:
lib/collection/src/shards/replica_set/replica_set_state.rs
🧬 Code graph analysis (5)
lib/collection/src/shards/local_shard/wal_ops.rs (4)
lib/collection/src/shards/forward_proxy_shard.rs (2)
set_extended_wal_retention(399-401)set_normal_wal_retention(403-405)lib/collection/src/shards/proxy_shard.rs (2)
set_extended_wal_retention(185-187)set_normal_wal_retention(189-191)lib/collection/src/shards/queue_proxy_shard.rs (2)
set_extended_wal_retention(273-277)set_normal_wal_retention(279-283)lib/collection/src/shards/shard.rs (2)
set_extended_wal_retention(371-379)set_normal_wal_retention(382-390)
lib/collection/src/wal_delta.rs (1)
lib/shard/src/wal.rs (2)
set_extended_retention(218-222)set_normal_retention(224-227)
lib/collection/src/shards/forward_proxy_shard.rs (4)
lib/collection/src/shards/local_shard/wal_ops.rs (2)
set_extended_wal_retention(6-8)set_normal_wal_retention(11-13)lib/collection/src/shards/proxy_shard.rs (2)
set_extended_wal_retention(185-187)set_normal_wal_retention(189-191)lib/collection/src/shards/queue_proxy_shard.rs (2)
set_extended_wal_retention(273-277)set_normal_wal_retention(279-283)lib/collection/src/shards/shard.rs (2)
set_extended_wal_retention(371-379)set_normal_wal_retention(382-390)
lib/collection/src/shards/shard.rs (4)
lib/collection/src/shards/forward_proxy_shard.rs (2)
set_extended_wal_retention(399-401)set_normal_wal_retention(403-405)lib/collection/src/shards/local_shard/wal_ops.rs (2)
set_extended_wal_retention(6-8)set_normal_wal_retention(11-13)lib/collection/src/shards/proxy_shard.rs (2)
set_extended_wal_retention(185-187)set_normal_wal_retention(189-191)lib/collection/src/shards/queue_proxy_shard.rs (2)
set_extended_wal_retention(273-277)set_normal_wal_retention(279-283)
lib/collection/src/shards/proxy_shard.rs (4)
lib/collection/src/shards/forward_proxy_shard.rs (2)
set_extended_wal_retention(399-401)set_normal_wal_retention(403-405)lib/collection/src/shards/local_shard/wal_ops.rs (2)
set_extended_wal_retention(6-8)set_normal_wal_retention(11-13)lib/collection/src/shards/queue_proxy_shard.rs (2)
set_extended_wal_retention(273-277)set_normal_wal_retention(279-283)lib/collection/src/shards/shard.rs (2)
set_extended_wal_retention(371-379)set_normal_wal_retention(382-390)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Test Python bindings
- GitHub Check: lint
- GitHub Check: e2e-tests
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: integration-tests-consensus
- GitHub Check: integration-tests
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: rust-tests (windows-latest)
🔇 Additional comments (8)
lib/collection/src/wal_delta.rs (1)
174-180: LGTM!The WAL retention control methods are correctly implemented as simple pass-throughs to the underlying
SerdeWalimplementation, with appropriate async locking.lib/collection/src/shards/replica_set/replica_set_state.rs (1)
227-242: LGTM!The
requires_recoverymethod correctly identifies only theDeadstate as requiring automatic recovery scheduling. The exhaustive matching ensures all current and future replica states are explicitly handled.lib/collection/src/shards/local_shard/wal_ops.rs (1)
1-14: LGTM!The WAL retention control methods are correctly implemented with clear documentation. The delegation to the underlying
RecoverableWalfollows the established pattern throughout the codebase.lib/collection/src/shards/forward_proxy_shard.rs (1)
399-405: LGTM!The WAL retention methods correctly delegate to the wrapped shard, following the established proxy pattern in this file. The implementation is consistent with other proxy layers in the codebase.
Cargo.toml (1)
288-288: WAL dependency revision verified and merged.The commit hash
2209eb2c4a359d9c32f63d868c580df360efa4a9from WAL PR #105 exists in the repository and is already merged. The dependency update is safe to proceed.lib/collection/src/shards/replica_set/mod.rs (2)
697-699: LGTM!Correctly routes remote peer state changes to
on_remote_state_updatedfor WAL retention evaluation.
769-771: LGTM!Consistent with the
add_remotechange—remote state updates now properly trigger WAL retention evaluation.lib/collection/src/shards/proxy_shard.rs (1)
184-191: LGTM!Clean delegation to
wrapped_shardfollowing the established proxy pattern. Consistent with the parallel implementations inForwardProxyShardandQueueProxyShard.
| // Reset WAL retention to normal whenever local shard changes state | ||
| local_shard.set_normal_wal_retention().await; |
There was a problem hiding this comment.
Logic gap: local state change doesn't re-evaluate remote peer states for retention.
When the local peer transitions to Active, this code resets WAL retention to normal. However, if remote peers are Dead or require recovery, extended retention should be active instead. The current implementation doesn't check remote states after a local state change.
Scenario:
- Local is
Dead, remote isDead→ normal retention - Local transitions to
Active→on_local_state_updatedresets to normal - Remote is still
Dead→ extended retention should be active, but it isn't
Consider either removing the retention reset here (let on_remote_state_updated be the sole arbiter), or re-evaluating retention based on the full cluster state after any local state change.
🤖 Prompt for AI Agents
In @lib/collection/src/shards/replica_set/mod.rs around lines 790 - 791, The
call to local_shard.set_normal_wal_retention() in on_local_state_updated
unconditionally resets WAL retention without considering remote peer states;
update on_local_state_updated to re-evaluate cluster retention policy instead of
blindly resetting — either remove the call so only on_remote_state_updated
controls retention, or call a shared evaluator (e.g., a new or existing function
like evaluate_and_set_wal_retention or reuse on_remote_state_updated’s logic)
that inspects both local state and remote peer states and sets normal or
extended retention accordingly; ensure to reference
local_shard.set_normal_wal_retention() and whatever evaluator you introduce so
retention reflects the full cluster state after any local state change.
* extend WAL retention if some other replicas are dead * use merged WAL commit * Handle ManualRecovery state --------- Co-authored-by: KShivendu <kshivendu1@gmail.com>
* extend WAL retention if some other replicas are dead * use merged WAL commit * Handle ManualRecovery state --------- Co-authored-by: KShivendu <kshivendu1@gmail.com>
Make WAL retain 10x more segments, if there are dead replicas in cluster.
In theory, this should prevent shard recovery to fallback to streaming under high load
Depends on qdrant/wal#105