Skip to content

CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id#15595

Merged
michaelklishin merged 4 commits intomainfrom
mk-cq-qt-1-crash
Mar 3, 2026
Merged

CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id#15595
michaelklishin merged 4 commits intomainfrom
mk-cq-qt-1-crash

Conversation

@michaelklishin
Copy link
Copy Markdown
Collaborator

tune_read/2 may round up the read range to a segment boundary, advancing start_seq_id past end_seq_id. Guard against this in both branches of read_from_q_tail;4.

This scenario was detected by a sporadic failure in classic_queue_prop_SUITE which now passes 200 times in a row.

@michaelklishin michaelklishin added this to the 4.3.0 milestone Feb 28, 2026
@michaelklishin michaelklishin changed the title Fix q_tail crash when tune_read rounds past end_seq_id Fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id` Feb 28, 2026
@michaelklishin michaelklishin changed the title Fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id` CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id` Feb 28, 2026
@lhoguin
Copy link
Copy Markdown
Contributor

lhoguin commented Mar 2, 2026

Checking.

@lhoguin
Copy link
Copy Markdown
Contributor

lhoguin commented Mar 2, 2026

Could you provide me with the crash log for this? Both to confirm the fix and to write a regression test case.

@michaelklishin
Copy link
Copy Markdown
Collaborator Author

@lhoguin I will dig in, they were logs from last Friday.

@michaelklishin
Copy link
Copy Markdown
Collaborator Author

Below a standard CT .zip file straight from Actions. It should include node data directories.

The failure can be seen in ct_run.rabbit_shard4@localhost.2026-02-28_08.34.49/deps.rabbit.classic_queue_prop_SUITE.logs/run.2026-02-28_08.36.08/classic_queue_prop_suite.classic_queue_v2.html.

CT logs (rabbit parallel-ct-set-1 OTP-28 ).zip

@michaelklishin michaelklishin changed the title CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id` CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id Mar 2, 2026
@lhoguin
Copy link
Copy Markdown
Contributor

lhoguin commented Mar 3, 2026

Thank you. I can reproduce.

@lhoguin lhoguin force-pushed the mk-cq-qt-1-crash branch from a2a3d6e to e04f6cb Compare March 3, 2026 11:43
During recovery when all transient messages are dropped
the code could crash because the #q_tail{} resulting
from reading messages into memory had a start seqid
higher than the end seqid, due to tune_read rounding
up the read range past segment boundaries and the code
not noticing it went above the end seqid.
@lhoguin lhoguin force-pushed the mk-cq-qt-1-crash branch from e04f6cb to ccb1e48 Compare March 3, 2026 11:44
@lhoguin
Copy link
Copy Markdown
Contributor

lhoguin commented Mar 3, 2026

I have amended and pushed the change.

I added a regression test based on the CT logs and tweaked the code, notably removing the check in the second clause because it is already covered by case QTailCount - QHead1Len where we know nothing remains in q_tail and we don't use QTailSeqId1 in that case.

@michaelklishin michaelklishin merged commit 9e2e12c into main Mar 3, 2026
184 checks passed
@michaelklishin michaelklishin deleted the mk-cq-qt-1-crash branch March 3, 2026 16:37
mergify bot pushed a commit that referenced this pull request Mar 3, 2026
(cherry picked from commit f43a079)
mergify bot pushed a commit that referenced this pull request Mar 3, 2026
michaelklishin added a commit that referenced this pull request Mar 3, 2026
Note that on v4.2.x, the general problem is still
present but under different conditions, namely
DeltaSeqId1 =:= DeltaSeqIdEnd.

Unlike in `main`,
DeltaSeqIdEnd cannot read past DeltaSeqId.
michaelklishin added a commit that referenced this pull request Mar 3, 2026
CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id` (backport #15595)
lukebakken added a commit to lukebakken/rmq-rabbitmq-server that referenced this pull request Mar 17, 2026
This commit completes the backport of two upstream fixes for a startup
hang that caused a 1h46m maintenance window outage on a single-node
broker (b-fd0b2082, 2026-03-08).

Root cause: `maybe_deltas_to_betas/4` in `rabbit_variable_queue` enters
an effectively infinite recursive loop when recovering an empty classic
queue after a clean shutdown if `next_seq_id` is large. The loop
iterates once per historical seq_id entirely in memory (no disk I/O),
producing 100% CPU with near-zero EBS reads. The loop becomes truly
infinite once `DeltaSeqId` reaches `DeltaSeqIdEnd` because `d()` accepts
`start_seq_id = end_seq_id` with `count = 0`, and `read/3` returns `[]`
immediately, causing unconditional recursion.

PR rabbitmq#13856 (primary fix, `rabbit_classic_queue_index_v2`): When
`Segments = #{}` (no segment files) and `NextSeqIdHint` is available
from clean shutdown Terms, `bounds/2` now returns
`{NextSeqIdHint, NextSeqIdHint}` instead of `{0, 0}`, producing a blank
delta immediately so the loop never starts.

PR rabbitmq#15595 (safety net, `rabbit_variable_queue`): Adds a termination
guard `0 when DeltaSeqId1 >= DeltaSeqIdEnd` to `maybe_deltas_to_betas`
that returns `?BLANK_DELTA` instead of recursing when the end of the
range is reached with an empty result.

Conflict resolution for PR rabbitmq#13856: The upstream diff hardcodes
`rabbit_classic_queue_index_v2:bounds(IndexState, NextSeqIdHint)` because
v1 index support was removed in main before this fix was written. In
3.13.7, the call site uses `IndexMod:bounds(IndexState)` to dispatch
between `rabbit_queue_index` (v1) and `rabbit_classic_queue_index_v2`
(v2). The resolution preserves the `IndexMod` dispatch but routes to
`bounds/2` only when `IndexMod =:= rabbit_classic_queue_index_v2`, since
`rabbit_queue_index` has no `bounds/2` variant and the bug only affects
the v2 index path.

PR rabbitmq#15595 was not cherry-picked (it targets main where
`maybe_deltas_to_betas` has been renamed to `read_from_q_tail` and
`BLANK_DELTA` to `BLANK_Q_TAIL`). The fix was applied manually as a
single guard clause addition in 3.13.7 terminology.

The `variable_queue_restart_large_seq_id` test is commented out from the
`backing_queue_v1` suite. The fix is correct and complete for v2 queues
(the only version used in production, enforced by the AWS managed
operator policy). The v1 test fails because `rabbit_queue_index:bounds/1`
still returns `{0, 0}` for empty queues -- there is no upstream precedent
for patching it since v1 was removed from main before PR rabbitmq#13856 was
written.
lukebakken added a commit to lukebakken/rmq-rabbitmq-server that referenced this pull request Mar 17, 2026
This commit completes the backport of two upstream fixes for a startup
hang that caused a 1h46m maintenance window outage on a single-node
broker (b-fd0b2082, 2026-03-08).

Root cause: `maybe_deltas_to_betas/4` in `rabbit_variable_queue` enters
an effectively infinite recursive loop when recovering an empty classic
queue after a clean shutdown if `next_seq_id` is large. The loop
iterates once per historical seq_id entirely in memory (no disk I/O),
producing 100% CPU with near-zero EBS reads. The loop becomes truly
infinite once `DeltaSeqId` reaches `DeltaSeqIdEnd` because `d()` accepts
`start_seq_id = end_seq_id` with `count = 0`, and `read/3` returns `[]`
immediately, causing unconditional recursion.

PR rabbitmq#13856 (primary fix, `rabbit_classic_queue_index_v2`): When
`Segments = #{}` (no segment files) and `NextSeqIdHint` is available
from clean shutdown Terms, `bounds/2` now returns
`{NextSeqIdHint, NextSeqIdHint}` instead of `{0, 0}`, producing a blank
delta immediately so the loop never starts.

PR rabbitmq#15595 (safety net, `rabbit_variable_queue`): Adds a termination
guard `0 when DeltaSeqId1 >= DeltaSeqIdEnd` to `maybe_deltas_to_betas`
that returns `?BLANK_DELTA` instead of recursing when the end of the
range is reached with an empty result.

Conflict resolution for PR rabbitmq#13856: The upstream diff hardcodes
`rabbit_classic_queue_index_v2:bounds(IndexState, NextSeqIdHint)` because
v1 index support was removed in main before this fix was written. In
3.13.7, the call site uses `IndexMod:bounds(IndexState)` to dispatch
between `rabbit_queue_index` (v1) and `rabbit_classic_queue_index_v2`
(v2). The resolution preserves the `IndexMod` dispatch but routes to
`bounds/2` only when `IndexMod =:= rabbit_classic_queue_index_v2`, since
`rabbit_queue_index` has no `bounds/2` variant and the bug only affects
the v2 index path.

PR rabbitmq#15595 was not cherry-picked (it targets main where
`maybe_deltas_to_betas` has been renamed to `read_from_q_tail` and
`BLANK_DELTA` to `BLANK_Q_TAIL`). The fix was applied manually as a
single guard clause addition in 3.13.7 terminology.

The `variable_queue_restart_large_seq_id` test is commented out from the
`backing_queue_v1` suite. The fix is correct and complete for v2 queues
(the only version used in production, enforced by the AWS managed
operator policy). The v1 test fails because `rabbit_queue_index:bounds/1`
still returns `{0, 0}` for empty queues -- there is no upstream precedent
for patching it since v1 was removed from main before PR rabbitmq#13856 was
written.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants