Skip to content

Wal gap detection fixes#581

Merged
kjnilsson merged 1 commit intomainfrom
wal-gap-detection-fixes
Feb 23, 2026
Merged

Wal gap detection fixes#581
kjnilsson merged 1 commit intomainfrom
wal-gap-detection-fixes

Conversation

@kjnilsson
Copy link
Copy Markdown
Contributor

@kjnilsson kjnilsson commented Feb 19, 2026

The WAL writers map tracks the last index written by each ra server
and is used for gap detection. Previously this state was only recovered
by replaying .wal files on disk, so if the segment writer deleted a
WAL file and the WAL then crashed before the new file had any entries,
all tracking was lost. Subsequent writes with gaps would be silently
accepted, leading to gaps in segments.

Write a writers.snapshot file in the WAL directory on every roll-over.
During recovery, seed the writers map from this file before replaying
any WAL files. The snapshot is overwritten in place on each roll-over
so there is no cleanup needed.

Micro-benchmark shows < 200us overhead per roll-over for typical
deployments (up to a few hundred writers).

GC stale entries from the WAL writers map

The writers map grows monotonically -- when a ra server is deleted its
entry persists in memory and in the writers.snapshot file forever.

Add two complementary cleanup mechanisms:

  1. At recovery time, trim the writers map against ra_directory to
    remove entries for UIDs that are no longer registered.

  2. At runtime, accept a forget_writer cast so that servers being
    deleted can evict their entry immediately. Called from
    ra_server_proc:terminate/3 on {shutdown, delete}.

Phase 1 is the reliable catch-all; phase 2 handles the common case
without waiting for a WAL restart.

Fix log state divergence after mem table overwrites segment range

During init, when the mem table contains entries that overwrite part of
the segment range, the segment range upper bound was not being clamped.
This caused MaxConfirmedWrittenIdx and LastSegRefIdx to be derived from
stale segment indexes, leading to incorrect last_written_index_term and
pending state after restart.

Fix ra_range:combine/2 to clamp the upper bound to AddRange's end,
since AddRange represents newer data that supersedes the tail of the
existing range. Compute a TruncSegmentRange in ra_log:init/1 that
excludes segment indexes overwritten by the mem table, and use it
when recovering MaxConfirmedWrittenIdx and LastSegRefIdx.

Harden take_after_overwrite_and_init to assert that the log overview
is consistent across close and re-init

Copy link
Copy Markdown
Collaborator

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reproduce the missing_key error by emulating the new test case ra_log_2_SUITE:writes_not_lost_after_multiple_wal_crashes/1 with a QQ in RabbitMQ.

rabbit_repro:run/0...
-module(rabbit_repro).

-export([run/0]).

run() ->
    WalPid = whereis(ra_log_wal),
    erlang:suspend_process(WalPid),

    ok = ra:pipeline_command({'%2F_qq', node()}, rabbit_fifo:make_garbage_collection()),

    exit(WalPid, kill),
    wait_for_wal(WalPid),

    WalPid2 = whereis(ra_log_wal),
    exit(WalPid2, kill),
    wait_for_wal(WalPid2),

    ok = ra:pipeline_command({'%2F_qq', node()}, rabbit_fifo:make_garbage_collection()),
    ok = ra:pipeline_command({'%2F_qq', node()}, rabbit_fifo:make_garbage_collection()),

    timer:sleep(100),
    rabbit_quorum_queue:wal_force_roll_over(node()),

    ok.

wait_for_wal(OldPid) ->
    ok = ra_lib:retry(fun () ->
                              P = whereis(ra_log_wal),
                              is_pid(P) andalso P =/= OldPid
                      end, 100, 100).

I also edited the local copy of Ra to drop the ra_log:resend_from/2 call here. (Return {State0, []} instead.) To reproduce this we need the server's gap detection to fail to re-send commands. That might happen in the wild if the WAL crashes with the resend message in its mailbox, but the timing is very hard to reproduce.

  1. Start a broker on main with make run-broker
  2. Publish a few messages into a QQ named 'qq' perf-test -x 1 -y 0 --pmessages 5 -u qq -qq
  3. Run rabbit_repro:run(). in the shell
  4. Dump the segment's index (ra_log_segment:dump_index/1 on the segment file path), notice the gap
  5. Reboot or rabbitmqctl stop_app + rabbitmqctl start_app
Index dump...
> ra_log_segment:dump_index("/tmp/rabbitmq-test-instances/rabbit@mango2/mnesia/rabbit@mango2/quorum/rabbit@mango2/2F_QQ8K3N38RGI2N8/0000000000000001.segment").
[{0,0,131080},
 {1,1,131092},
 {2,1,131121},
 {3,1,131655},
 {4,1,131838},
 {5,1,132095},
 {6,1,132352},
 {7,1,132609},
 {8,1,132866},
 {9,1,133123},
 %% 10 is missing here!!
 {11,1,133203},
 {12,1,133261},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,0},
 {0,0,...},
 {0,...},
 {...}|...]
Error logs on reboot...
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>   crasher:
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     initial call: ra_server_proc:init/1
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     pid: <0.1193.0>
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     registered_name: '%2F_qq'
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>     exception exit: {missing_key,10,
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>                                  <<"/tmp/rabbitmq-test-instances/rabbit@mango2/mnesia/rabbit@mango2/quorum/rabbit@mango2/2F_QQ8K3
N38RGI2N8/0000000000000001.segment">>}
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in function  ra_log_segment:fold0/8 (src/ra_log_segment.erl, line 426)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log_reader:'-segment_fold/5-fun-1-'/4 (src/ra_log_reader.erl, line 344)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log_reader:'-segment_fold/5-lists^foldl/2-1-'/3 (src/ra_log_reader.erl, line 341)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log_reader:segment_fold/5 (src/ra_log_reader.erl, line 341)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_log:fold/5 (src/ra_log.erl, line 462)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_server:recover/1 (src/ra_server.erl, line 456)
2026-02-19 13:01:44.889972-05:00 [error] <0.1193.0>       in call from ra_server_proc:recover/3 (src/ra_server_proc.erl, line 400)

@the-mikedavis
Copy link
Copy Markdown
Collaborator

the-mikedavis commented Feb 19, 2026

I tried out this branch and rabbitmq/rabbitmq-server#13885 in the server together and the reproduction doesn't work. We can even add ?assert(false) to the branch that calls resend_from/2 because it isn't hit anymore. (The WAL explicitly re-requests it when the new commands are sent.)

@kjnilsson kjnilsson force-pushed the wal-gap-detection-fixes branch 3 times, most recently from eaad846 to bd15eae Compare February 23, 2026 15:09
The WAL writers map tracks the last index written by each ra server
and is used for gap detection. Previously this state was only recovered
by replaying .wal files on disk, so if the segment writer deleted a
WAL file and the WAL then crashed before the new file had any entries,
all tracking was lost. Subsequent writes with gaps would be silently
accepted, leading to gaps in segments.

Write a writers.snapshot file in the WAL directory on every roll-over.
During recovery, seed the writers map from this file before replaying
any WAL files. The snapshot is overwritten in place on each roll-over
so there is no cleanup needed.

Micro-benchmark shows < 200us overhead per roll-over for typical
deployments (up to a few hundred writers).

GC stale entries from the WAL writers map

The writers map grows monotonically -- when a ra server is deleted its
entry persists in memory and in the writers.snapshot file forever.

Add two complementary cleanup mechanisms:

1. At recovery time, trim the writers map against ra_directory to
   remove entries for UIDs that are no longer registered.

2. At runtime, accept a forget_writer cast so that servers being
   deleted can evict their entry immediately. Called from
   ra_server_proc:terminate/3 on {shutdown, delete}.

Phase 1 is the reliable catch-all; phase 2 handles the common case
without waiting for a WAL restart.

Fix log state divergence after mem table overwrites segment range

During init, when the mem table contains entries that overwrite part of
the segment range, the segment range upper bound was not being clamped.
This caused MaxConfirmedWrittenIdx and LastSegRefIdx to be derived from
stale segment indexes, leading to incorrect last_written_index_term and
pending state after restart.

Fix ra_range:combine/2 to clamp the upper bound to AddRange's end,
since AddRange represents newer data that supersedes the tail of the
existing range. Compute a TruncSegmentRange in ra_log:init/1 that
excludes segment indexes overwritten by the mem table, and use it
when recovering MaxConfirmedWrittenIdx and LastSegRefIdx.

Harden take_after_overwrite_and_init to assert that the log overview
is consistent across close and re-init.
@kjnilsson kjnilsson force-pushed the wal-gap-detection-fixes branch from bd15eae to f68c255 Compare February 23, 2026 15:10
@kjnilsson kjnilsson marked this pull request as ready for review February 23, 2026 15:10
@kjnilsson kjnilsson merged commit da3f497 into main Feb 23, 2026
6 checks passed
@michaelklishin michaelklishin added this to the 2.17.3 milestone Feb 23, 2026
@lukebakken lukebakken deleted the wal-gap-detection-fixes branch February 26, 2026 17:51
@the-mikedavis the-mikedavis modified the milestones: 2.17.3, 3.0.0 Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants