Skip to content

Incomplete rollback in moveForwardBlock can lose in-flight tx batches #852

@liam

Description

@liam

Summary

Found while auditing for "mutate-then-check" patterns after #851 landed. The rollback path in moveForwardBlock (and the analogous path in reorgBlocks) restores four in-memory fields on error but does not restore the queue (batches drained by processRemainderTransactionsAndDequeue) or external UTXO-store writes. Same bug class as #851 at a larger scale.

Status: based on code review of services/blockassembly/subtreeprocessor/SubtreeProcessor.go against current main. Not yet reproduced by a failing test.

Locations

Two rollback sites with identical shape:

  • SubtreeProcessor.go:704-722 - main event loop, moveForwardBlock chan handler.
  • SubtreeProcessor.go:2624-2649 - reorgBlocks, moveForward-only fast path.

Both snapshot:

originalChainedSubtrees := stp.chainedSubtrees
originalCurrentSubtree := stp.currentSubtree.Load()
originalCurrentTxMap := stp.currentTxMap
currentBlockHeader := stp.currentBlockHeader.Load()

And on error from moveForwardBlock(...):

stp.chainedSubtrees = originalChainedSubtrees
stp.currentSubtree.Store(originalCurrentSubtree)
stp.currentTxMap = originalCurrentTxMap
stp.currentBlockHeader.Store(currentBlockHeader)
stp.setTxCountFromSubtrees()

What moveForwardBlock mutates beyond those four fields

Walking moveForwardBlock (SubtreeProcessor.go:3564-3667):

  1. Read-only setup.
  2. processConflictingTransactions (line 3614) - may modify UTXO store via ProcessConflicting.
  3. resetSubtreeState (line 3623) - modifies the snapshotted fields.
  4. processRemainderTransactionsAndDequeue (line 3628) - drains the queue via dequeueDuringBlockMovement. Drained batches are added to the new subtree state.
  5. processCoinbaseUtxos (line 3644) - writes coinbase UTXOs to the UTXO store.

If steps 2, 4, or 5 error after partial work, the rollback misses those changes.

Failure modes

  • Step 2 fails mid-conflict-marking: UTXO store left with partial Conflicting marks. Probably idempotent so not catastrophic.
  • Step 4 fails mid-drain: queue partially drained. Batches dequeued into the new subtree state are discarded by rollback. Real tx loss.
  • Step 5 fails after coinbase insert: orphan coinbase in UTXO store, plus all of step 4's drained txs lost.

Reproduction sketch

Deterministic via the clock seam (#841) and a utxoStore wrapper:

// 1. Build SubtreeProcessor with small InitialMerkleItemsPerSubtree.
// 2. Pin queue and SubtreeProcessor clocks via fixedClock.
// 3. Enqueue N batches so the queue has work to drain.
// 4. Construct a block whose CoinbaseTx triggers an error inside a test
//    utxoStore wrapper (returns sentinel error on Create).
// 5. queueLenBefore := stp.queue.length()
// 6. err := stp.MoveForwardBlock(block)
// 7. Assert err != nil
// 8. Assert stp.queue.length() < queueLenBefore  (drains happened)
// 9. Assert chainedSubtrees / currentSubtree / currentTxMap match pre-call snapshot
// 10. The drained batches are nowhere to be found.

Suggested remediation options (in increasing scope)

  1. Reorder side effects: defer the drain until after processCoinbaseUtxos. Smallest change but needs review of downstream ordering assumptions.
  2. Queue snapshot/restore primitive: capture head/tail/length before the drain, restore on error. Complicated by concurrent enqueues.
  3. Two-phase dequeue: read batches without removing them, commit on success. Largest restructure.

Option 1 is the most tractable as a single PR.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions