Wal gap detection fixes by kjnilsson · Pull Request #581 · rabbitmq/ra

kjnilsson · 2026-02-19T14:38:58Z

The WAL writers map tracks the last index written by each ra server
and is used for gap detection. Previously this state was only recovered
by replaying .wal files on disk, so if the segment writer deleted a
WAL file and the WAL then crashed before the new file had any entries,
all tracking was lost. Subsequent writes with gaps would be silently
accepted, leading to gaps in segments.

Write a writers.snapshot file in the WAL directory on every roll-over.
During recovery, seed the writers map from this file before replaying
any WAL files. The snapshot is overwritten in place on each roll-over
so there is no cleanup needed.

Micro-benchmark shows < 200us overhead per roll-over for typical
deployments (up to a few hundred writers).

GC stale entries from the WAL writers map

The writers map grows monotonically -- when a ra server is deleted its
entry persists in memory and in the writers.snapshot file forever.

Add two complementary cleanup mechanisms:

At recovery time, trim the writers map against ra_directory to
remove entries for UIDs that are no longer registered.
At runtime, accept a forget_writer cast so that servers being
deleted can evict their entry immediately. Called from
ra_server_proc:terminate/3 on {shutdown, delete}.

Phase 1 is the reliable catch-all; phase 2 handles the common case
without waiting for a WAL restart.

Fix log state divergence after mem table overwrites segment range

During init, when the mem table contains entries that overwrite part of
the segment range, the segment range upper bound was not being clamped.
This caused MaxConfirmedWrittenIdx and LastSegRefIdx to be derived from
stale segment indexes, leading to incorrect last_written_index_term and
pending state after restart.

Fix ra_range:combine/2 to clamp the upper bound to AddRange's end,
since AddRange represents newer data that supersedes the tail of the
existing range. Compute a TruncSegmentRange in ra_log:init/1 that
excludes segment indexes overwritten by the mem table, and use it
when recovering MaxConfirmedWrittenIdx and LastSegRefIdx.

Harden take_after_overwrite_and_init to assert that the log overview
is consistent across close and re-init

the-mikedavis

I could reproduce the missing_key error by emulating the new test case ra_log_2_SUITE:writes_not_lost_after_multiple_wal_crashes/1 with a QQ in RabbitMQ.

rabbit_repro:run/0...

-module(rabbit_repro).

-export([run/0]).

run() ->
    WalPid = whereis(ra_log_wal),
    erlang:suspend_process(WalPid),

    ok = ra:pipeline_command({'%2F_qq', node()}, rabbit_fifo:make_garbage_collection()),

    exit(WalPid, kill),
    wait_for_wal(WalPid),

    WalPid2 = whereis(ra_log_wal),
    exit(WalPid2, kill),
    wait_for_wal(WalPid2),

    ok = ra:pipeline_command({'%2F_qq', node()}, rabbit_fifo:make_garbage_collection()),
    ok = ra:pipeline_command({'%2F_qq', node()}, rabbit_fifo:make_garbage_collection()),

    timer:sleep(100),
    rabbit_quorum_queue:wal_force_roll_over(node()),

    ok.

wait_for_wal(OldPid) ->
    ok = ra_lib:retry(fun () ->
                              P = whereis(ra_log_wal),
                              is_pid(P) andalso P =/= OldPid
                      end, 100, 100).

I also edited the local copy of Ra to drop the ra_log:resend_from/2 call here. (Return {State0, []} instead.) To reproduce this we need the server's gap detection to fail to re-send commands. That might happen in the wild if the WAL crashes with the resend message in its mailbox, but the timing is very hard to reproduce.

Start a broker on main with make run-broker
Publish a few messages into a QQ named 'qq' perf-test -x 1 -y 0 --pmessages 5 -u qq -qq
Run rabbit_repro:run(). in the shell
Dump the segment's index (ra_log_segment:dump_index/1 on the segment file path), notice the gap
Reboot or rabbitmqctl stop_app + rabbitmqctl start_app

Index dump...

> ra_log_segment:dump_index("/tmp/rabbitmq-test-instances/rabbit@mango2/mnesia/rabbit@mango2/quorum/rabbit@mango2/2F_QQ8K3N38RGI2N8/0000000000000001.segment").
[{0,0,131080},
 {1,1,131092},
 {2,1,131121},
 {3,1,131655},
 {4,1,131838},
 {5,1,132095},
 {6,1,132352},
 {7,1,132609},
 {8,1,132866},
 {9,1,133123},
 %% 10 is missing here!!
 {11,1,133203},
 {12,1,133261},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,...},
 {0,...},
 {...}|...]

Error logs on reboot...

2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>   crasher:
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     initial call: ra_server_proc:init/1
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     pid: <0.1193.0>
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     registered_name: '%2F_qq'
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     exception exit: {missing_key,10,
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>                                  <<"/tmp/rabbitmq-test-instances/rabbit@mango2/mnesia/rabbit@mango2/quorum/rabbit@mango2/2F_QQ8K3
N38RGI2N8/0000000000000001.segment">>}
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in function  ra_log_segment:fold0/8 (src/ra_log_segment.erl, line 426)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log_reader:'-segment_fold/5-fun-1-'/4 (src/ra_log_reader.erl, line 344)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log_reader:'-segment_fold/5-lists^foldl/2-1-'/3 (src/ra_log_reader.erl, line 341)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log_reader:segment_fold/5 (src/ra_log_reader.erl, line 341)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log:fold/5 (src/ra_log.erl, line 462)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_server:recover/1 (src/ra_server.erl, line 456)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_server_proc:recover/3 (src/ra_server_proc.erl, line 400)

src/ra_log_wal.erl

the-mikedavis · 2026-02-19T18:09:38Z

I tried out this branch and rabbitmq/rabbitmq-server#13885 in the server together and the reproduction doesn't work. We can even add ?assert(false) to the branch that calls resend_from/2 because it isn't hit anymore. (The WAL explicitly re-requests it when the new commands are sent.)

The WAL writers map tracks the last index written by each ra server and is used for gap detection. Previously this state was only recovered by replaying .wal files on disk, so if the segment writer deleted a WAL file and the WAL then crashed before the new file had any entries, all tracking was lost. Subsequent writes with gaps would be silently accepted, leading to gaps in segments. Write a writers.snapshot file in the WAL directory on every roll-over. During recovery, seed the writers map from this file before replaying any WAL files. The snapshot is overwritten in place on each roll-over so there is no cleanup needed. Micro-benchmark shows < 200us overhead per roll-over for typical deployments (up to a few hundred writers). GC stale entries from the WAL writers map The writers map grows monotonically -- when a ra server is deleted its entry persists in memory and in the writers.snapshot file forever. Add two complementary cleanup mechanisms: 1. At recovery time, trim the writers map against ra_directory to remove entries for UIDs that are no longer registered. 2. At runtime, accept a forget_writer cast so that servers being deleted can evict their entry immediately. Called from ra_server_proc:terminate/3 on {shutdown, delete}. Phase 1 is the reliable catch-all; phase 2 handles the common case without waiting for a WAL restart. Fix log state divergence after mem table overwrites segment range During init, when the mem table contains entries that overwrite part of the segment range, the segment range upper bound was not being clamped. This caused MaxConfirmedWrittenIdx and LastSegRefIdx to be derived from stale segment indexes, leading to incorrect last_written_index_term and pending state after restart. Fix ra_range:combine/2 to clamp the upper bound to AddRange's end, since AddRange represents newer data that supersedes the tail of the existing range. Compute a TruncSegmentRange in ra_log:init/1 that excludes segment indexes overwritten by the mem table, and use it when recovering MaxConfirmedWrittenIdx and LastSegRefIdx. Harden take_after_overwrite_and_init to assert that the log overview is consistent across close and re-init.

kjnilsson mentioned this pull request Feb 19, 2026

Prevent log segments from having holes in index sequence #578

Closed

11 tasks

the-mikedavis reviewed Feb 19, 2026

View reviewed changes

src/ra_log_wal.erl Outdated Show resolved Hide resolved

kjnilsson force-pushed the wal-gap-detection-fixes branch 3 times, most recently from eaad846 to bd15eae Compare February 23, 2026 15:09

kjnilsson force-pushed the wal-gap-detection-fixes branch from bd15eae to f68c255 Compare February 23, 2026 15:10

kjnilsson marked this pull request as ready for review February 23, 2026 15:10

kjnilsson merged commit da3f497 into main Feb 23, 2026
6 checks passed

michaelklishin added this to the 2.17.3 milestone Feb 23, 2026

lukebakken deleted the wal-gap-detection-fixes branch February 26, 2026 17:51

the-mikedavis mentioned this pull request Mar 3, 2026

Improve handling of complete disk exhaustion #585

Open

the-mikedavis modified the milestones: 2.17.3, 3.0.0 Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wal gap detection fixes#581

Wal gap detection fixes#581
kjnilsson merged 1 commit intomainfrom
wal-gap-detection-fixes

kjnilsson commented Feb 19, 2026 •

edited

Loading

Uh oh!

the-mikedavis left a comment

Uh oh!

Uh oh!

the-mikedavis commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kjnilsson commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the-mikedavis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

the-mikedavis commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kjnilsson commented Feb 19, 2026 •

edited

Loading

the-mikedavis commented Feb 19, 2026 •

edited

Loading