CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id` by michaelklishin · Pull Request #15595 · rabbitmq/rabbitmq-server

michaelklishin · 2026-02-28T16:39:34Z

tune_read/2 may round up the read range to a segment boundary, advancing start_seq_id past end_seq_id. Guard against this in both branches of read_from_q_tail;4.

This scenario was detected by a sporadic failure in classic_queue_prop_SUITE which now passes 200 times in a row.

lhoguin · 2026-03-02T08:32:51Z

Checking.

lhoguin · 2026-03-02T09:35:48Z

Could you provide me with the crash log for this? Both to confirm the fix and to write a regression test case.

michaelklishin · 2026-03-02T15:03:19Z

@lhoguin I will dig in, they were logs from last Friday.

michaelklishin · 2026-03-02T16:07:02Z

Below a standard CT .zip file straight from Actions. It should include node data directories.

The failure can be seen in ct_run.rabbit_shard4@localhost.2026-02-28_08.34.49/deps.rabbit.classic_queue_prop_SUITE.logs/run.2026-02-28_08.36.08/classic_queue_prop_suite.classic_queue_v2.html.

CT logs (rabbit parallel-ct-set-1 OTP-28 ).zip

lhoguin · 2026-03-03T10:05:13Z

Thank you. I can reproduce.

During recovery when all transient messages are dropped the code could crash because the #q_tail{} resulting from reading messages into memory had a start seqid higher than the end seqid, due to tune_read rounding up the read range past segment boundaries and the code not noticing it went above the end seqid.

lhoguin · 2026-03-03T11:47:03Z

I have amended and pushed the change.

I added a regression test based on the CT logs and tweaked the code, notably removing the check in the second clause because it is already covered by case QTailCount - QHead1Len where we know nothing remains in q_tail and we don't use QTailSeqId1 in that case.

(cherry picked from commit f43a079)

(cherry picked from commit 0934fa5)

Note that on v4.2.x, the general problem is still present but under different conditions, namely DeltaSeqId1 =:= DeltaSeqIdEnd. Unlike in `main`, DeltaSeqIdEnd cannot read past DeltaSeqId.

CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id` (backport #15595)

This commit completes the backport of two upstream fixes for a startup hang that caused a 1h46m maintenance window outage on a single-node broker (b-fd0b2082, 2026-03-08). Root cause: `maybe_deltas_to_betas/4` in `rabbit_variable_queue` enters an effectively infinite recursive loop when recovering an empty classic queue after a clean shutdown if `next_seq_id` is large. The loop iterates once per historical seq_id entirely in memory (no disk I/O), producing 100% CPU with near-zero EBS reads. The loop becomes truly infinite once `DeltaSeqId` reaches `DeltaSeqIdEnd` because `d()` accepts `start_seq_id = end_seq_id` with `count = 0`, and `read/3` returns `[]` immediately, causing unconditional recursion. PR rabbitmq#13856 (primary fix, `rabbit_classic_queue_index_v2`): When `Segments = #{}` (no segment files) and `NextSeqIdHint` is available from clean shutdown Terms, `bounds/2` now returns `{NextSeqIdHint, NextSeqIdHint}` instead of `{0, 0}`, producing a blank delta immediately so the loop never starts. PR rabbitmq#15595 (safety net, `rabbit_variable_queue`): Adds a termination guard `0 when DeltaSeqId1 >= DeltaSeqIdEnd` to `maybe_deltas_to_betas` that returns `?BLANK_DELTA` instead of recursing when the end of the range is reached with an empty result. Conflict resolution for PR rabbitmq#13856: The upstream diff hardcodes `rabbit_classic_queue_index_v2:bounds(IndexState, NextSeqIdHint)` because v1 index support was removed in main before this fix was written. In 3.13.7, the call site uses `IndexMod:bounds(IndexState)` to dispatch between `rabbit_queue_index` (v1) and `rabbit_classic_queue_index_v2` (v2). The resolution preserves the `IndexMod` dispatch but routes to `bounds/2` only when `IndexMod =:= rabbit_classic_queue_index_v2`, since `rabbit_queue_index` has no `bounds/2` variant and the bug only affects the v2 index path. PR rabbitmq#15595 was not cherry-picked (it targets main where `maybe_deltas_to_betas` has been renamed to `read_from_q_tail` and `BLANK_DELTA` to `BLANK_Q_TAIL`). The fix was applied manually as a single guard clause addition in 3.13.7 terminology. The `variable_queue_restart_large_seq_id` test is commented out from the `backing_queue_v1` suite. The fix is correct and complete for v2 queues (the only version used in production, enforced by the AWS managed operator policy). The v1 test fails because `rabbit_queue_index:bounds/1` still returns `{0, 0}` for empty queues -- there is no upstream precedent for patching it since v1 was removed from main before PR rabbitmq#13856 was written.

michaelklishin added this to the 4.3.0 milestone Feb 28, 2026

michaelklishin requested a review from lhoguin February 28, 2026 16:39

michaelklishin added the backport-v4.2.x label Feb 28, 2026

michaelklishin changed the title ~~Fix q_tail crash when tune_read rounds past end_seq_id~~ Fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id` Feb 28, 2026

michaelklishin changed the title ~~Fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id`~~ CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id` Feb 28, 2026

michaelklishin changed the title ~~CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id`~~ CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id Mar 2, 2026

lhoguin force-pushed the mk-cq-qt-1-crash branch from a2a3d6e to e04f6cb Compare March 3, 2026 11:43

lhoguin force-pushed the mk-cq-qt-1-crash branch from e04f6cb to ccb1e48 Compare March 3, 2026 11:44

lhoguin approved these changes Mar 3, 2026

View reviewed changes

michaelklishin added 3 commits March 3, 2026 07:43

Comment wording #15595 [ci skip]

f43a079

Typos [ci skip]

63c965e

Pass the same channel the call op happens on #15595

0934fa5

michaelklishin merged commit 9e2e12c into main Mar 3, 2026
184 checks passed

michaelklishin deleted the mk-cq-qt-1-crash branch March 3, 2026 16:37

mergify bot pushed a commit that referenced this pull request Mar 3, 2026

Comment wording #15595 [ci skip]

5917b4c

(cherry picked from commit f43a079)

mergify bot pushed a commit that referenced this pull request Mar 3, 2026

Pass the same channel the call op happens on #15595

567be54

(cherry picked from commit 0934fa5)

mergify bot mentioned this pull request Mar 3, 2026

CQs: fix a read_rom_q_tail/4 crash when tune_read/2 rounds past end_seq_id (backport #15595) #15620

Merged

michaelklishin added a commit that referenced this pull request Mar 3, 2026

Merge pull request #15620 from rabbitmq/mergify/bp/v4.2.x/pr-15595

1194f75

CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id` (backport #15595)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id`#15595

CQs: fix a `read_rom_q_tail/4` crash when `tune_read/2` rounds past `end_seq_id`#15595
michaelklishin merged 4 commits intomainfrom
mk-cq-qt-1-crash

michaelklishin commented Feb 28, 2026

Uh oh!

lhoguin commented Mar 2, 2026

Uh oh!

lhoguin commented Mar 2, 2026

Uh oh!

michaelklishin commented Mar 2, 2026

Uh oh!

michaelklishin commented Mar 2, 2026

Uh oh!

lhoguin commented Mar 3, 2026 •

edited

Loading

Uh oh!

lhoguin commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelklishin commented Feb 28, 2026

Uh oh!

lhoguin commented Mar 2, 2026

Uh oh!

lhoguin commented Mar 2, 2026

Uh oh!

michaelklishin commented Mar 2, 2026

Uh oh!

michaelklishin commented Mar 2, 2026

Uh oh!

lhoguin commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoguin commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoguin commented Mar 3, 2026 •

edited

Loading