fix(blockassembly/subtreeprocessor): stop losing the boundary batch in Reset drain#851
Conversation
…n Reset drain
The Reset drain loop at SubtreeProcessor.go:1355-1361 used a
dequeue-then-time-check pattern that silently drops one boundary
batch per Reset call:
validUntilMillis := time.Now().UnixMilli()
for {
batch, found := stp.queue.dequeueBatch(0)
if !found || batch.time > validUntilMillis {
// we are done
break
}
}
dequeueBatch(0) bypasses the queue filter and removes the head batch
unconditionally. Only THEN does the loop check batch.time. If the
batch is too new, the loop breaks - but the batch has already been
removed from the queue and nothing puts it back, so it is dropped on
the floor.
Under heavy concurrent enqueue (Reset happens during reorgs / FSM
transitions / recovery), this drops one batch worth of in-flight txs
each call. The lost txs never get mined.
Fix introduces dequeueBatchUntil on LockFreeQueue with inclusive-until
semantics (admit iff batch.time <= maxTimeMillis). Unlike dequeueBatch
it peeks at batch.time BEFORE removing the batch, so the boundary
batch is never lost. The Reset drain loop becomes:
validUntilMillis := stp.clock.Now().UnixMilli()
for {
if _, found := stp.queue.dequeueBatchUntil(validUntilMillis); !found {
break
}
}
Also switches the time anchor from time.Now() to stp.clock.Now() so
the Reset path runs through the existing clock seam, consistent with
the other validFromMillis call sites.
Test changes:
* New Test_dequeueBatchUntilPreservesPostBoundaryBatch: regression
guard. Enqueues a pre-snapshot and a post-snapshot batch via the
clock seam, drains up to the snapshot, asserts the post-snapshot
batch survives. Also covers the inclusive boundary
(batch.time == maxTimeMillis admits) and empty-queue case.
|
🤖 Claude Code Review Status: Complete Summary: No issues found. The PR correctly fixes a boundary-batch loss bug in the Reset drain loop through a well-designed peek-before-dequeue method. Analysis:
The change is minimal (31 lines added across queue.go and tests), well-scoped to the bug, and builds correctly on the existing clock abstraction from #841. |
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-12 14:23 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve. Bug is real — verified that dequeueBatch(0) advances q.head before the time check, so the boundary batch is gone by the time the caller inspects batch.time. The peek-before-dequeue fix is correct, the memory ordering is sound (the new method reads next.time after the same atomic load the existing dequeue relies on), and the new test pins the inclusive-until contract.
A few minor suggestions, none blocking:
-
Add a Reset-level regression test. The new test guards
dequeueBatchUntil's contract at the queue level, but doesn't exercise the actualreset()drain loop. A test that installs afixedClockonstp/stp.queue, enqueues pre- and post-snapshot batches, drives the Reset path, and asserts the post-snapshot batch survives instp.queuewould guard against someone "simplifying" the drain back todequeueBatch(0). The queue-level guard is the more fundamental contract, so this is belt-and-braces. -
Millisecond-boundary precision in the PR description. With inclusive-until semantics, a batch enqueued in the same millisecond as the snapshot (but after
validUntilMillisis captured) will still be drained, sincebatch.time == validUntilMillisadmits. So "batches arriving concurrently after this anchor are intentionally left in the queue" is accurate at the next millisecond, not at the snapshot millisecond itself. 1ms window vs. the original full-drain window — orders of magnitude smaller, not worth changing the code, but worth being precise. -
API asymmetry around
0.dequeueBatch(0)means "no filter, drain all";dequeueBatchUntil(0)means "reject everything withbatch.time > 0" (i.e. drain nothing in practice). The docs handle it, but a brief cross-reference between the two methods would help the next caller. -
Verification. Recommend running
golangci-lint runandstaticcheck ./...onservices/blockassembly/subtreeprocessorbefore merge per AGENTS.md — not called out in the test plan.



Summary
The Reset drain loop at
SubtreeProcessor.go:1355-1361silently drops one boundary batch perResetcall. Fix: introduceLockFreeQueue.dequeueBatchUntil(peek-before-dequeue) and rewrite the drain loop to use it.The bug
dequeueBatch(0)bypasses the queue filter and removes the head batch unconditionally. Thenbatch.timeis checked. If the batch is too new, the loop breaks - but the batch is already gone from the queue and nothing puts it back. Net effect: one batch worth of fresh txs lost perResetcall.Resetruns during reorgs, FSM transitions, and recovery operations. Under heavy concurrent enqueue (the documented use case for this codebase), eachResetstatistically catches a batch in the window betweentime.Now()snapshot and the drain reaching the boundary. Those txs never get mined.Fix
New
dequeueBatchUntilonLockFreeQueuewith inclusive-until semantics (admit iff batch.time <= maxTimeMillis). Peeks atbatch.timebefore removing the batch, so the boundary batch is never lost.The Reset drain becomes:
Also switches the time anchor from
time.Now()tostp.clock.Now()so the Reset path runs through the existing clock seam - consistent with the other validFromMillis call sites and necessary for deterministic future testing of Reset under reorg pressure.Test changes
Test_dequeueBatchUntilPreservesPostBoundaryBatch- regression guard. Enqueues a pre-snapshot and a post-snapshot batch via the clock seam, drains up to the snapshot, asserts the post-snapshot batch survives in the queue. Also covers the inclusive boundary (batch.time == maxTimeMillisadmits) and the empty-queue case.Test plan
go vet ./services/blockassembly/subtreeprocessor/- cleango test -race -count=1 ./services/blockassembly/subtreeprocessor/- pass (155s).TestResetMarksAssemblyTxsAsNotOnLongestChainBeforeClearingstill passes - Reset behavior is preserved for the documented invariants.go test -race -count=1 ./services/blockassembly/- pass (101s).Series context
Fourth in a series from the same investigation:
clockinterface enabling deterministic time-dependent tests.validFromMillisindequeueDuringBlockMovement.Independent of #848 - depends only on #841's clock seam (already on main).