Lock-free parallel execution: derive per-block changesets post-hoc, remove changeset accumulator from the exec path

## Background

PR #21088 added a `changesetMu` mutex on `SharedDomains` (and `FlushPendingUpdatesLocked` / `ComputeCommitmentLocked` variants) as a band-aid to serialize the parallel commitment calculator's swap-and-record window against the apply goroutine's `DomainPut`/`DomainDel`. That closed the off-by-one wrong-trie-root cluster, `TestRecreateAndRewind`, and all ~227 race-detector hits across the `EXEC3_PARALLEL=true` race-test matrix groups — but at the cost of serializing apply-side writes during compute.

## The problem

The "current changeset accumulator" is unwind-side machinery: a sidecar that records per-block prev-value diffs so a later unwind can reconstruct the pre-block state. Execution should be forward-only and not be concerned with it. Today the parallel calculator swaps a *global* accumulator pointer to route block N's branch writes into block N's saved CS, and the apply loop writes through that same pointer — hence the need for the band-aid lock.

## Proposed direction

Derive per-block changesets **post-hoc from sd entries** (now tx-granular) at `sd.Flush` time, instead of maintaining the accumulator during execution. Then:

- delete `changesetMu` and the `Lock/UnlockChangesetAccumulator` + `*Locked` API surface
- delete the swap dance in `committer.go computeWithBlockAccumulator`
- delete the `SetChangesetAccumulator` / `GetChangesetAccumulator` / `SavePastChangesetAccumulator` API
- delete the `domain == kv.CommitmentDomain` exemptions in `SharedDomains.DomainPut` / `DomainDel`

### Smallest first step (Option A0)

`SharedDomainsCommitmentContext.deferCommitmentUpdates` already exists and is enabled for parallel-applying-blocks (`exec3.go:217`). Branches accumulate in `pendingUpdate` and `FlushPendingUpdates` replays them. The remaining inline-write paths to chase are `encodeAndStoreCommitmentState`'s `[state]` marker write and the `concurrentTrieContextFactory` ETL drain — route those through the deferred mechanism too, then the lock window collapses to a single single-threaded flush.

### Review note

@AskAlexSharov suggested folding `changesetMu` into the existing `latestStateLock` in #21088 review — kept separate to avoid widening the high-traffic `latestStateLock` (held on every `Put`/`Del`/`GetLatest`/`GetAsOf`) to cover the calculator's whole `ComputeCommitment` window. The lock-free refactor here makes that question moot.

## Acceptance

- `changesetMu` and the associated swap API removed
- all four parallel race-test matrix groups still report 0 race-detector hits
- all Bucket C tests (`TestBlockchainHeaderchainReorgConsistency`, `TestLongerForkHeaders/Blocks`, `TestCallTraceUnwind`, `TestTxLookupUnwind`, `TestLowDiffLongChain`, `TestRecreateAndRewind`) still pass under `EXEC3_PARALLEL=true`
- benchmark: apply-side throughput during compute no longer serialized

Related: #21088, #21017


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock-free parallel execution: derive per-block changesets post-hoc, remove changeset accumulator from the exec path #21106

Background

The problem

Proposed direction

Smallest first step (Option A0)

Review note

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lock-free parallel execution: derive per-block changesets post-hoc, remove changeset accumulator from the exec path #21106

Description

Background

The problem

Proposed direction

Smallest first step (Option A0)

Review note

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions