Skip to content

Improve crash recovery resilience#589

Merged
kjnilsson merged 7 commits intomainfrom
recover-indexes
Mar 11, 2026
Merged

Improve crash recovery resilience#589
kjnilsson merged 7 commits intomainfrom
recover-indexes

Conversation

@kjnilsson
Copy link
Copy Markdown
Contributor

@kjnilsson kjnilsson commented Mar 10, 2026

Summary

Fixes several issues that can cause Ra nodes to crash during boot after an unclean shutdown (e.g. power loss).

Changes

  • Recover corrupt snapshot indexes from machine state. The indexes file is not fsynced, so corruption after a crash is expected. Instead of crashing with a badmatch in find_snapshots, ra_snapshot now recovers live indexes by reading the snapshot and calling ra_machine:live_indexes/2.

  • Fix WAL recovery crash from concurrent segment writer deletes. During multi-file WAL recovery the segment writer can delete mem table ETS entries that the next WAL file's recovery depends on, causing an unhandled gap_detected error. Fixed by carrying the Tables map across WAL files so recovery reuses its own ra_mt state instead of re-scanning a mutated ETS table.

  • Fix segment deletion after dual WAL flush. After a crash between segment flush and WAL deletion, compact_segrefs truncates overlapping segment ranges — but the -- based deletion compared full tuples, so truncated-but-still-referenced segments were incorrectly deleted. Now compares by filename only.

  • Send RPCs to snapshot_backoff peers when leader enforces leadership. make_all_rpcs now includes snapshot_backoff peers so a lagging follower that triggered a pre_vote_rpc gets its snapshot immediately rather than waiting for the backoff timer.

  • Register new servers after log init. Ensures the config file is fully written before registration, as it is required for recovery.

  • Sync parent directory after creating config file. Ensures the directory entry is durable after a crash.

The indexes file is intentionally not fsynced, so corruption after a
crash is expected. Previously this caused a badmatch crash in
find_snapshots during init. Now ra_snapshot carries the machine config
and can recover live indexes by reading the snapshot and calling
ra_machine:live_indexes/2, the same approach used in complete_accept.
When a crash occurs after the segment writer flushes a WAL file to
segments but before the WAL file is deleted, recovery replays the
same WAL creating segments that overlap with those from the first
flush. compact_segrefs correctly handles this by truncating the
range of partially overlapping segment refs. However the deletion
logic used the -- operator which compares full {Filename, Range}
tuples. A segment whose range was truncated (but not removed) no
longer matched its original ref, so it appeared in the diff and
was deleted even though the reader still referenced it. The
subsequent fold during state machine recovery then crashed with
ra_log_failed_to_open_segment enoent.

Compare by filename only when deciding which segments to delete,
so that segments still referenced by the reader (even with a
truncated range) are preserved.
When a leader receives a pre_vote_rpc from a follower with a stale
term, make_all_rpcs now includes peers in snapshot_backoff status
alongside normal peers. This ensures the lagging follower that
triggered the pre-vote gets its snapshot expeditiously rather than
waiting for the backoff timer to fire. The pending backoff timer is
cancelled via a new cancel_snapshot_retry_timer effect before the
RPC is sent.
…concurrently

During multi-file WAL recovery after a power-off, the segment writer
processes mem tables from earlier WAL files asynchronously. When servers
have no Pid (normal during recovery), the segment writer deletes entries
directly from the mem table ETS. If this deletion races with recovery of
the next WAL file, recover_entry calls mem_table_please which re-scans
the (now partially depleted) ETS table. The resulting ra_mt state has a
LastSeq that no longer matches the PrevIdx tracked in the writers map,
causing ra_mt:insert_sparse to return {error, gap_detected} — an
unhandled case_clause in recover_entry that crashes the node at boot.

Fix by carrying the Tables map across WAL files in the recovery fold,
alongside the already-carried Writers map. This way recover_entry reuses
the ra_mt state it built during earlier file recovery rather than
re-scanning a potentially mutated ETS table.

Made-with: Cursor
@kjnilsson kjnilsson marked this pull request as draft March 11, 2026 09:30
Else it may fail to boot. Ignore for windows.
New servers should register _after_ log initialisation to ensure the
config file is fully written as it is required for successful recovery
@kjnilsson kjnilsson changed the title Recover corrupt snapshot indexes file from machine state Improve crash recovery resilience Mar 11, 2026
@kjnilsson kjnilsson marked this pull request as ready for review March 11, 2026 12:34
@mkuratczyk mkuratczyk self-requested a review March 11, 2026 12:53
@kjnilsson kjnilsson merged commit d674387 into main Mar 11, 2026
7 checks passed
@michaelklishin michaelklishin added this to the 3.0.1 milestone Mar 13, 2026
@dumbbell-rabbitmq dumbbell-rabbitmq deleted the recover-indexes branch March 23, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants