CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id#15595
CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id#15595michaelklishin merged 4 commits intomainfrom
read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id#15595Conversation
read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id`
read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id`read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id`
|
Checking. |
|
Could you provide me with the crash log for this? Both to confirm the fix and to write a regression test case. |
|
@lhoguin I will dig in, they were logs from last Friday. |
|
Below a standard CT The failure can be seen in |
read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id`read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id
|
Thank you. I can reproduce. |
a2a3d6e to
e04f6cb
Compare
During recovery when all transient messages are dropped
the code could crash because the #q_tail{} resulting
from reading messages into memory had a start seqid
higher than the end seqid, due to tune_read rounding
up the read range past segment boundaries and the code
not noticing it went above the end seqid.
e04f6cb to
ccb1e48
Compare
|
I have amended and pushed the change. I added a regression test based on the CT logs and tweaked the code, notably removing the check in the second clause because it is already covered by |
(cherry picked from commit f43a079)
(cherry picked from commit 0934fa5)
Note that on v4.2.x, the general problem is still present but under different conditions, namely DeltaSeqId1 =:= DeltaSeqIdEnd. Unlike in `main`, DeltaSeqIdEnd cannot read past DeltaSeqId.
CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id` (backport #15595)
This commit completes the backport of two upstream fixes for a startup hang that caused a 1h46m maintenance window outage on a single-node broker (b-fd0b2082, 2026-03-08). Root cause: `maybe_deltas_to_betas/4` in `rabbit_variable_queue` enters an effectively infinite recursive loop when recovering an empty classic queue after a clean shutdown if `next_seq_id` is large. The loop iterates once per historical seq_id entirely in memory (no disk I/O), producing 100% CPU with near-zero EBS reads. The loop becomes truly infinite once `DeltaSeqId` reaches `DeltaSeqIdEnd` because `d()` accepts `start_seq_id = end_seq_id` with `count = 0`, and `read/3` returns `[]` immediately, causing unconditional recursion. PR rabbitmq#13856 (primary fix, `rabbit_classic_queue_index_v2`): When `Segments = #{}` (no segment files) and `NextSeqIdHint` is available from clean shutdown Terms, `bounds/2` now returns `{NextSeqIdHint, NextSeqIdHint}` instead of `{0, 0}`, producing a blank delta immediately so the loop never starts. PR rabbitmq#15595 (safety net, `rabbit_variable_queue`): Adds a termination guard `0 when DeltaSeqId1 >= DeltaSeqIdEnd` to `maybe_deltas_to_betas` that returns `?BLANK_DELTA` instead of recursing when the end of the range is reached with an empty result. Conflict resolution for PR rabbitmq#13856: The upstream diff hardcodes `rabbit_classic_queue_index_v2:bounds(IndexState, NextSeqIdHint)` because v1 index support was removed in main before this fix was written. In 3.13.7, the call site uses `IndexMod:bounds(IndexState)` to dispatch between `rabbit_queue_index` (v1) and `rabbit_classic_queue_index_v2` (v2). The resolution preserves the `IndexMod` dispatch but routes to `bounds/2` only when `IndexMod =:= rabbit_classic_queue_index_v2`, since `rabbit_queue_index` has no `bounds/2` variant and the bug only affects the v2 index path. PR rabbitmq#15595 was not cherry-picked (it targets main where `maybe_deltas_to_betas` has been renamed to `read_from_q_tail` and `BLANK_DELTA` to `BLANK_Q_TAIL`). The fix was applied manually as a single guard clause addition in 3.13.7 terminology. The `variable_queue_restart_large_seq_id` test is commented out from the `backing_queue_v1` suite. The fix is correct and complete for v2 queues (the only version used in production, enforced by the AWS managed operator policy). The v1 test fails because `rabbit_queue_index:bounds/1` still returns `{0, 0}` for empty queues -- there is no upstream precedent for patching it since v1 was removed from main before PR rabbitmq#13856 was written.
This commit completes the backport of two upstream fixes for a startup hang that caused a 1h46m maintenance window outage on a single-node broker (b-fd0b2082, 2026-03-08). Root cause: `maybe_deltas_to_betas/4` in `rabbit_variable_queue` enters an effectively infinite recursive loop when recovering an empty classic queue after a clean shutdown if `next_seq_id` is large. The loop iterates once per historical seq_id entirely in memory (no disk I/O), producing 100% CPU with near-zero EBS reads. The loop becomes truly infinite once `DeltaSeqId` reaches `DeltaSeqIdEnd` because `d()` accepts `start_seq_id = end_seq_id` with `count = 0`, and `read/3` returns `[]` immediately, causing unconditional recursion. PR rabbitmq#13856 (primary fix, `rabbit_classic_queue_index_v2`): When `Segments = #{}` (no segment files) and `NextSeqIdHint` is available from clean shutdown Terms, `bounds/2` now returns `{NextSeqIdHint, NextSeqIdHint}` instead of `{0, 0}`, producing a blank delta immediately so the loop never starts. PR rabbitmq#15595 (safety net, `rabbit_variable_queue`): Adds a termination guard `0 when DeltaSeqId1 >= DeltaSeqIdEnd` to `maybe_deltas_to_betas` that returns `?BLANK_DELTA` instead of recursing when the end of the range is reached with an empty result. Conflict resolution for PR rabbitmq#13856: The upstream diff hardcodes `rabbit_classic_queue_index_v2:bounds(IndexState, NextSeqIdHint)` because v1 index support was removed in main before this fix was written. In 3.13.7, the call site uses `IndexMod:bounds(IndexState)` to dispatch between `rabbit_queue_index` (v1) and `rabbit_classic_queue_index_v2` (v2). The resolution preserves the `IndexMod` dispatch but routes to `bounds/2` only when `IndexMod =:= rabbit_classic_queue_index_v2`, since `rabbit_queue_index` has no `bounds/2` variant and the bug only affects the v2 index path. PR rabbitmq#15595 was not cherry-picked (it targets main where `maybe_deltas_to_betas` has been renamed to `read_from_q_tail` and `BLANK_DELTA` to `BLANK_Q_TAIL`). The fix was applied manually as a single guard clause addition in 3.13.7 terminology. The `variable_queue_restart_large_seq_id` test is commented out from the `backing_queue_v1` suite. The fix is correct and complete for v2 queues (the only version used in production, enforced by the AWS managed operator policy). The v1 test fails because `rabbit_queue_index:bounds/1` still returns `{0, 0}` for empty queues -- there is no upstream precedent for patching it since v1 was removed from main before PR rabbitmq#13856 was written.
tune_read/2may round up the read range to a segment boundary, advancingstart_seq_idpastend_seq_id. Guard against this in both branches ofread_from_q_tail;4.This scenario was detected by a sporadic failure in
classic_queue_prop_SUITEwhich now passes 200 times in a row.