Improve crash recovery resilience by kjnilsson · Pull Request #589 · rabbitmq/ra

kjnilsson · 2026-03-10T10:37:06Z

Summary

Fixes several issues that can cause Ra nodes to crash during boot after an unclean shutdown (e.g. power loss).

Changes

Recover corrupt snapshot indexes from machine state. The indexes file is not fsynced, so corruption after a crash is expected. Instead of crashing with a badmatch in find_snapshots, ra_snapshot now recovers live indexes by reading the snapshot and calling ra_machine:live_indexes/2.
Fix WAL recovery crash from concurrent segment writer deletes. During multi-file WAL recovery the segment writer can delete mem table ETS entries that the next WAL file's recovery depends on, causing an unhandled gap_detected error. Fixed by carrying the Tables map across WAL files so recovery reuses its own ra_mt state instead of re-scanning a mutated ETS table.
Fix segment deletion after dual WAL flush. After a crash between segment flush and WAL deletion, compact_segrefs truncates overlapping segment ranges — but the -- based deletion compared full tuples, so truncated-but-still-referenced segments were incorrectly deleted. Now compares by filename only.
Send RPCs to snapshot_backoff peers when leader enforces leadership. make_all_rpcs now includes snapshot_backoff peers so a lagging follower that triggered a pre_vote_rpc gets its snapshot immediately rather than waiting for the backoff timer.
Register new servers after log init. Ensures the config file is fully written before registration, as it is required for recovery.
Sync parent directory after creating config file. Ensures the directory entry is durable after a crash.

The indexes file is intentionally not fsynced, so corruption after a crash is expected. Previously this caused a badmatch crash in find_snapshots during init. Now ra_snapshot carries the machine config and can recover live indexes by reading the snapshot and calling ra_machine:live_indexes/2, the same approach used in complete_accept.

When a crash occurs after the segment writer flushes a WAL file to segments but before the WAL file is deleted, recovery replays the same WAL creating segments that overlap with those from the first flush. compact_segrefs correctly handles this by truncating the range of partially overlapping segment refs. However the deletion logic used the -- operator which compares full {Filename, Range} tuples. A segment whose range was truncated (but not removed) no longer matched its original ref, so it appeared in the diff and was deleted even though the reader still referenced it. The subsequent fold during state machine recovery then crashed with ra_log_failed_to_open_segment enoent. Compare by filename only when deciding which segments to delete, so that segments still referenced by the reader (even with a truncated range) are preserved.

When a leader receives a pre_vote_rpc from a follower with a stale term, make_all_rpcs now includes peers in snapshot_backoff status alongside normal peers. This ensures the lagging follower that triggered the pre-vote gets its snapshot expeditiously rather than waiting for the backoff timer to fire. The pending backoff timer is cancelled via a new cancel_snapshot_retry_timer effect before the RPC is sent.

…concurrently During multi-file WAL recovery after a power-off, the segment writer processes mem tables from earlier WAL files asynchronously. When servers have no Pid (normal during recovery), the segment writer deletes entries directly from the mem table ETS. If this deletion races with recovery of the next WAL file, recover_entry calls mem_table_please which re-scans the (now partially depleted) ETS table. The resulting ra_mt state has a LastSeq that no longer matches the PrevIdx tracked in the writers map, causing ra_mt:insert_sparse to return {error, gap_detected} — an unhandled case_clause in recover_entry that crashes the node at boot. Fix by carrying the Tables map across WAL files in the recovery fold, alongside the already-carried Writers map. This way recover_entry reuses the ra_mt state it built during earlier file recovery rather than re-scanning a potentially mutated ETS table. Made-with: Cursor

Else it may fail to boot. Ignore for windows.

New servers should register _after_ log initialisation to ensure the config file is fully written as it is required for successful recovery

kjnilsson added 5 commits March 10, 2026 10:36

test fix

500435b

michaelklishin approved these changes Mar 11, 2026

View reviewed changes

kjnilsson marked this pull request as draft March 11, 2026 09:30

kjnilsson added 2 commits March 11, 2026 10:22

Sync parent directory after creating config file.

86e6e2b

Else it may fail to boot. Ignore for windows.

Change registration vs log init order for new servers.

3ec78e4

New servers should register _after_ log initialisation to ensure the config file is fully written as it is required for successful recovery

kjnilsson force-pushed the recover-indexes branch from d84c687 to 3ec78e4 Compare March 11, 2026 10:22

kjnilsson changed the title ~~Recover corrupt snapshot indexes file from machine state~~ Improve crash recovery resilience Mar 11, 2026

kjnilsson marked this pull request as ready for review March 11, 2026 12:34

mkuratczyk self-requested a review March 11, 2026 12:53

mkuratczyk approved these changes Mar 11, 2026

View reviewed changes

kjnilsson merged commit d674387 into main Mar 11, 2026
7 checks passed

michaelklishin added this to the 3.0.1 milestone Mar 13, 2026

dumbbell-rabbitmq deleted the recover-indexes branch March 23, 2026 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve crash recovery resilience#589

Improve crash recovery resilience#589
kjnilsson merged 7 commits intomainfrom
recover-indexes

kjnilsson commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kjnilsson commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kjnilsson commented Mar 10, 2026 •

edited

Loading