execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit by mh0lt · Pull Request #21039 · erigontech/erigon

mh0lt · 2026-05-07T12:58:07Z

Summary

Drops the fallback clause in applyLoopMissingBlocks that turned every legitimate partial-batch exit into a spurious InvalidBlock error.

When the parallel exec loop hits its size-limit mid-fixture, it returns nil with reachedMaxBlock=false. The previous completeness check flagged maxBlockNum as "missing" whenever !reachedMaxBlock — but the size-limit path is normal: the apply loop returns ErrLoopExhausted, the stage loop resumes the next batch from lastBlockResult+1, and each block still executes exactly once. No re-execution.

Update — additional fix in commit `310a094514` (`execution/stagedsync: honor blockResult.Exhausted in exec loop`)

Same class of partial-batch orchestration bug, on the exec-loop side. executeBlocks marks the final dispatched block with Exhausted when the per-cycle block limit fires, then exits its goroutine — without closing pe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for execRequests / rws.ResultCh / ctx.Done — none of which ever fire. The apply loop never got the channel-close signal it needs to return ErrLoopExhausted, so the original (apply-loop-side) fix in this PR couldn't even run.

Reproduced as a chiado parallel-exec silent stall at block 150662 with EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000 from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlying ErrWrongTrieRoot now surfaces cleanly through the apply loop and the original-issue symptom appears as expected.

The new branch sits between the maxBlockNum-reached check and the StopAfterBlock debug exit, matching the precedence of the existing partial-batch exits.

Tests: TestExecLoopHonorsBlockResultExhausted runs two models of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously. TestExecLoopExhaustedOnlySetOnFinalBlock locks in the dispatcher convention so future executeBlocks changes can't silently drop already-queued work by setting Exhausted mid-stream.

Why

Reproduces deterministically as the parallel-mode BenchmarkFeeHistory failure on PR #21017 (the CI matrix that runs every test under both serial and parallel exec modes):

apply loop exited (reachedMaxBlock=false lastBlockResult=114 maxBlockNum=200) but 1 block(s) had tx-results without a blockResult or were never delivered: [200]

200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms:

Batch 1: blocks 1..114, size-limit fires, ErrLoopExhausted
Batch 2: blocks 115..200, reachedMaxBlock=true, clean nil

Sd.mem is not contaminated by in-flight future-block writes — txResultBlocks at exit was exactly [1..114]. Block 200 was flagged solely from the fallback clause.

Shutdown sequence verified

Tracing the deferred-close in execLoop: commitResults closes first, calculator drains and closes rootResults, apply loop sees both closes, drains, runs the completeness check, returns ErrLoopExhausted. executorCancel then cancels execLoopCtx, executeBlocks and workers exit, pe.wait joins. Three batches in BenchmarkFeeHistory run end-to-end with no goroutine leaks.

Tests

TestApplyLoopMissingBlocks — adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" case
TestApplyLoopPartialBatchReturnsErrLoopExhausted — exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenarios
TestApplyLoopChannelCloseOrder — pins the load-bearing commitResults-before-applyResults close order
TestExecLoopHonorsBlockResultExhausted — orchestration test for the exec-loop side of the partial-batch exit (added in commit 310a094514)
TestExecLoopExhaustedOnlySetOnFinalBlock — pins the executeBlocks dispatcher convention (added in commit 310a094514)

Test plan

BenchmarkFeeHistory/full/blocks=200 passes with ERIGON_EXEC3_PARALLEL=true
make lint clean
go test -race ./execution/stagedsync/... for new tests passes
Existing TestApplyLoopRootResultsCloseDoesNotRace, TestApplyLoopDoesNotHangAfterRootResultsClose, TestExecLoopExitCheck* still pass
Chiado EXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archive from block 0 reaches the underlying ErrWrongTrieRoot at block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires)

CI fix linkage

Precursor for #21017 (CI matrix exec_mode={serial,parallel}). Without this, bench / benchmarks (parallel) and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to #21032 (incarnation/SD differentiator fix).

…oop exit When the parallel exec loop hits its batch size limit mid-fixture, it returns nil to the apply loop with reachedMaxBlock=false. The apply loop's completeness check then ran applyLoopMissingBlocks with a fallback clause that flagged maxBlockNum as "missing" whenever !reachedMaxBlock — turning every legitimate partial-batch exit into a spurious InvalidBlock error. The size-limit path is normal: it returns ErrLoopExhausted and the stage loop resumes the next batch from lastBlockResult+1. Each block still executes exactly once. The fallback clause was unnecessary and incorrect. Reproduces deterministically as the github.com/erigontech/erigon/rpc/gasprice BenchmarkFeeHistory parallel-mode failure: 200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114; first batch covers blocks 1..114, second batch covers 115..200, both clean. Drops the fallback clause, simplifies applyLoopMissingBlocks to a single map-comparison, and tightens the message string. Adds: * TestApplyLoopMissingBlocks "partial batch — size-limit hit" case * TestApplyLoopPartialBatchReturnsErrLoopExhausted exercising the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) * TestApplyLoopChannelCloseOrder pinning the documented commitResults-then-applyResults shutdown order Precursor for #21017 (CI matrix exec_mode={serial,parallel}); without this, every workflow that runs a parallel-exec test against a chain that exceeds the batch budget fails on the first such size-limit boundary.

yperbasis

CI is red

Same class of partial-batch orchestration bug as the apply-loop fix in this PR. executeBlocks marks the final dispatched block with Exhausted when the per-cycle block limit fires, then exits its goroutine — without closing pe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for execRequests / rws.ResultCh / ctx.Done — none of which ever fire. The apply loop never got the channel-close signal it needed to return ErrLoopExhausted. Reproduced as a chiado parallel-exec stall at block 150662 with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block; with the exhaust signal honored the underlying ErrWrongTrieRoot now surfaces cleanly through the apply loop. The new branch sits between the maxBlockNum-reached check and the StopAfterBlock debug exit, matching the precedence of the existing partial-batch exits. ## Tests * TestExecLoopHonorsBlockResultExhausted — orchestration test with two models of the exec loop's blockResult decision tree (post-fix and pre-fix). The pre-fix model must hang past the timeout — proves the scaffolding genuinely surfaces the bug rather than passing vacuously. * TestExecLoopExhaustedOnlySetOnFinalBlock — locks in the dispatcher convention (Exhausted only on the trailing block of a partial batch) so future executeBlocks changes can't silently drop already-queued work by setting it mid-stream.

mh0lt · 2026-05-07T14:54:29Z

Linked as a precursor for #21017 alongside #21032 (SD/incarnation differentiator) and #21010 (asOfReader.txNum lazy-load fix on snapshot-loaded chains).

Copilot

Pull request overview

This PR fixes partial-batch orchestration in the parallel execution path so that normal “stop early and resume next cycle” exits are not misclassified as InvalidBlock, and ensures the exec loop honors an explicit “exhausted” signal to avoid stalling.

Changes:

Removes the maxBlockNum fallback from applyLoopMissingBlocks, so only blocks that actually produced tx-results are considered “missing” when the apply loop exits.
Updates execLoop to exit cleanly when blockResult.Exhausted is observed, allowing deferred channel closes to unblock the apply loop and propagate ErrLoopExhausted.
Extends robustness tests around partial-batch exits and the exhausted signal (with some tests currently modeling behavior rather than exercising production code).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`execution/stagedsync/exec3_parallel.go`	Adjusts missing-block completeness logic and adds `blockResult.Exhausted` handling in `execLoop` to prevent partial-batch stalls.
`execution/stagedsync/exec3_parallel_robustness_test.go`	Updates/expands tests for partial-batch behavior and exec-loop exhausted handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -352,8 +352,8 @@ func (pe *parallelExecutor) exec(ctx context.Context, execStage *StageState, u U
 					// (no blockResult for it) but the channel closed cleanly: that
 					// means the exec loop signaled "done" without delivering the
 					// final block.


+// TestApplyLoopChannelCloseOrder exercises the documented invariant
+// that the exec loop closes commitResults BEFORE applyResults on
+// shutdown. The calculator drains commitResults and signals the apply
+// loop via rootResults; if applyResults closes first, the apply loop
+// can race with the calculator's final commitment write.
+//
+// The exec loop's deferred close in execLoop() does the right thing
+// (commitResults first), but the close ordering is load-bearing —
+// changing it silently corrupts the shutdown sequence. This test
+// captures the shape of the deferred close to pin the order down.
+func TestApplyLoopChannelCloseOrder(t *testing.T) {
+	commitResults := make(chan struct{})
+	applyResults := make(chan struct{})
+
+	closeOrder := make(chan string, 2)
+
+	// Mimic execLoop's deferred close: commitResults first, then
+	// applyResults.
+	closeBoth := func() {
+		close(commitResults)
+		closeOrder <- "commitResults"
+		close(applyResults)
+		closeOrder <- "applyResults"
+	}
+
+	go closeBoth()
+
+	first := <-closeOrder
+	second := <-closeOrder
+
+	if first != "commitResults" {
+		t.Fatalf("commitResults must close first; got close order [%s, %s]", first, second)
+	}
+	if second != "applyResults" {
+		t.Fatalf("applyResults must close second; got close order [%s, %s]", first, second)
+	}
+
+	// Sanity: both channels actually closed.
+	if _, ok := <-commitResults; ok {
+		t.Fatal("commitResults not actually closed")
+	}
+	if _, ok := <-applyResults; ok {
+		t.Fatal("applyResults not actually closed")
+	}


+	// Production decision tree, mirroring exec3_parallel.go's
+	//   if blockResult != nil { ... } block:
+	//     1. sizeEst > batchLimit → trigger + return
+	//     2. blockResult.BlockNum >= maxBlockNum → reachedMaxBlock + trigger + return
+	//     3. blockResult.Exhausted != nil → trigger + return
+	//     4. dbg.StopAfterBlock → trigger + return
+	//     5. fall through to next-block scheduling.
+	withFix := func(b *blockResult, trigger func()) (done bool) {


+		// Inner ctx lets the goroutine cooperate with the outer timeout
+		// without leaking when the model hangs on <-br. We translate
+		// "exited via ctx" to result.hung — that is the hang signal.
+		ctx, cancel := context.WithTimeout(context.Background(), 1500*time.Millisecond)
+		defer cancel()


+// TestExecLoopExhaustedOnlySetOnFinalBlock is a guard against future
+// regressions in the dispatcher: blockResult.Exhausted being set on a
+// non-final block would cause the exec loop to drop later blocks that
+// have already been queued. Locks down the precedence: among blocks in
+// flight, only the trailing dispatched block's Exhausted is honored —
+// any earlier Exhausted signal would short-circuit the batch and lose
+// already-scheduled work. This test pins the convention to exec3.go's
+// `if exhausted != nil { break }` after dispatch: the break ensures
+// no later blockResult is produced once Exhausted is set on a
+// dispatched request.
+func TestExecLoopExhaustedOnlySetOnFinalBlock(t *testing.T) {
+	// Simulate executeBlocks' loop: it sets exhausted, dispatches the
+	// blockResult containing it, and breaks out — no further blocks.
+	// Verify the convention: count(Exhausted != nil) == 1 always, and
+	// it's always the last blockResult.
+	type dispatched struct {
+		num       uint64
+		exhausted bool
+	}
+	dispatch := func(blocks []dispatched) error {
+		var prevExhausted bool
+		for _, b := range blocks {
+			if prevExhausted {
+				return errors.New("dispatched a block after Exhausted was set — exec loop would drop it")
+			}
+			if b.exhausted {
+				prevExhausted = true
+			}
+		}
+		return nil
+	}
+
+	t.Run("Exhausted only on last block — valid", func(t *testing.T) {
+		err := dispatch([]dispatched{
+			{num: 1, exhausted: false},
+			{num: 2, exhausted: false},
+			{num: 3, exhausted: true},
+		})
+		if err != nil {
+			t.Fatalf("valid case rejected: %v", err)
+		}
+	})
+
+	t.Run("no Exhausted — valid (full-batch path)", func(t *testing.T) {
+		err := dispatch([]dispatched{
+			{num: 1, exhausted: false},
+			{num: 2, exhausted: false},
+		})
+		if err != nil {
+			t.Fatalf("valid case rejected: %v", err)
+		}
+	})
+
+	t.Run("Exhausted mid-stream — invalid (regression guard)", func(t *testing.T) {
+		err := dispatch([]dispatched{
+			{num: 1, exhausted: false},
+			{num: 2, exhausted: true},
+			{num: 3, exhausted: false},
+		})
+		if err == nil {
+			t.Fatal("dispatch convention violated: blockResult emitted after Exhausted")
+		}
+	})


yperbasis

Major

1. Stale comment at the apply-loop close site (real bug, easy fix)

exec3_parallel.go:351-354 still says:

// Also flag the case where maxBlockNum was never reached at all
// (no blockResult for it) but the channel closed cleanly: that
// means the exec loop signaled "done" without delivering the
// final block.

This is exactly the behavior commit 550cd1d574 removed. The function-level comment on applyLoopMissingBlocks was correctly updated; the call-site comment was missed. Should be deleted or replaced to explain that not-reaching-maxBlockNum is now a normal partial-batch state.

Minor (optional)

2. All new tests are decision-tree models, not the real `execLoop`

The PR description acknowledges this — full integration would need workers + calculator + state setup. But the consequence is real: TestExecLoopHonorsBlockResultExhausted's withFix model is hand-coded to mirror production precedence. If the production order changes, the test passes vacuously while the model drifts. Worth a comment in the test pointing at the exact production lines being mirrored, so a future change is more likely to update both. Same for TestApplyLoopPartialBatchReturnsErrLoopExhausted.

3. Comment slightly imprecise

Lines 901-902:

(initialCycle gates it on crossing a step boundary;
 later cycles enforce it unconditionally)

The initial-cycle path also requires !dbg.DiscardCommitment() (exec3.go:675). Minor.

4. Unused field in test result type

TestApplyLoopPartialBatchReturnsErrLoopExhausted's errSubstring is set but never asserted on. Trivial cleanup.

mh0lt · 2026-05-07T15:22:41Z

@yperbasis CI is green now — ready for re-review. The earlier failure was kurtosis assertoor_regular_test flake (context-canceled in the assertoor task runner, not a test assertion); the re-run passed without any code change.

- Replace stale comment at the apply-loop close site (exec3_parallel.go:351-354) that still described the removed maxBlockNum fallback. - Refine the executeBlocks-Exhausted comment to mention the !dbg.DiscardCommitment() initial-cycle gate at exec3.go:675. - Add a cross-reference comment to TestApplyLoopPartialBatchReturnsErrLoopExhausted pinning it to exec3_parallel.go's apply-loop exit branch around line 355, matching the convention already used in TestExecLoopHonorsBlockResultExhausted. - Drop unused errSubstring field from the test's result struct. - Reduce TestExecLoopHonorsBlockResultExhausted hang-detection timeout from 1500ms inner / 2000ms outer to 300ms / 500ms — in-memory channel + closure dispatch resolves on a microsecond timescale, so the larger budget was pure test runtime overhead.

mh0lt · 2026-05-07T16:46:10Z

@yperbasis pushed 686e11b addressing the review:

Item	Fix
Major #1 — stale comment at `exec3_parallel.go:351-354` describing the removed maxBlockNum fallback	Replaced with a short explanation of why not-reaching-maxBlockNum is now a normal partial-batch state
Minor #2 — tests are decision-tree models; risk of model drifting from production	Added a cross-reference comment in `TestApplyLoopPartialBatchReturnsErrLoopExhausted` pinning the closure to `exec3_parallel.go` around line 355 (matching the convention `TestExecLoopHonorsBlockResultExhausted` already used)
Minor #3 — initialCycle comment imprecision	Updated to mention the `!dbg.DiscardCommitment()` gate and reference `exec3.go:675`
Minor #4 — unused `errSubstring` field	Dropped
Copilot test-runtime concern — 1.5s+2s budget per test	Reduced to 300ms inner / 500ms outer; in-memory channel dispatch resolves on a microsecond timescale, so the larger budget was pure overhead

The Copilot suggestions about extracting production logic into helpers and unit-testing the helpers (so the tests drive real code rather than a model) are good but a larger refactor than fits this PR — would prefer to leave them as a follow-up.

Lint clean, full go test -race ./execution/stagedsync/ for the robustness tests passes.

…ction Addresses Copilot's review feedback on #21039: the original tests modeled the exec-loop and apply-loop decision trees in local closures, which would let the model drift from production silently. This follow-up extracts three pure helpers from the production code and rewires the tests to drive them directly. Helpers extracted: * `closeApplyChannels` (pe method) — pins commitResults-then-applyResults close ordering. Returns the close sequence as a string slice so tests can verify the order deterministically without racing on observer- goroutine wakeups (the previous test had a real flake here). * `execLoopShouldExit` (pure func) — encodes the per-blockResult exit precedence: sizeLimit > maxBlockNum > Exhausted > StopAfterBlock. Production exec-loop now switches on the returned decision; tests pin each precedence rule with table-driven cases. * `shouldMarkExhaustedAtBlock` (pure func) — encodes executeBlocks's per-cycle block-limit gate (initial-cycle protections + blockLimit span check + maxBlockNum guard). Replaces the old "dispatch convention" model test that didn't drive any production code. Tests updated: * `TestApplyLoopChannelCloseOrder` — calls `pe.closeApplyChannels()` on a real `parallelExecutor` with mock channels and asserts on the returned close-order slice. Also covers double-close recovery and the post-close nil-out. * `TestExecLoopShouldExitPriority` — table-driven coverage of the exit-priority decisions including the cross-condition cases (e.g. maxBlockNum AND Exhausted both set on the final block — maxBlockNum must win). * `TestShouldMarkExhaustedAtBlock` — table-driven coverage of the initial-cycle gates, blockLimit==0 disabled case, span < limit case, and the maxBlockNum exact-hit no-mark case. No production behaviour change — pure refactor + test rewire.

mh0lt · 2026-05-07T17:58:22Z

Follow-up #21046 opens against this branch — extracts closeApplyChannels / execLoopShouldExit / shouldMarkExhaustedAtBlock from production and rewires the robustness tests to drive them directly, addressing Copilot's "tests model rather than drive production" feedback. No production behaviour change in the follow-up; once this PR merges I'll rebase #21046 on main.

…ction (#21046) Follow-up to #21039 addressing Copilot's review feedback that the new robustness tests were modelling the exec-loop and apply-loop decision trees in local closures — which would let the model drift from production silently. This PR extracts three pure helpers from the production code and rewires the tests to drive them directly. **No production behaviour change** — the helpers wrap the exact same logic that was inline before. ## Helpers extracted | Helper | Production call site | What it pins | |---|---|---| | `closeApplyChannels` (pe method) | `execLoop` deferred close | commitResults-then-applyResults order | | `execLoopShouldExit` (pure) | `execLoop` per-blockResult branch | sizeLimit > maxBlockNum > Exhausted > StopAfterBlock precedence | | `shouldMarkExhaustedAtBlock` (pure) | `executeBlocks` per-cycle limit gate | initial-cycle gates + blockLimit + maxBlockNum guard | `closeApplyChannels` returns the actual close-order as a string slice — the previous test was racy because observer goroutines could wake up out of order even when the helper closed the channels in the right order; the slice removes that race entirely. ## Tests rewired * **`TestApplyLoopChannelCloseOrder`** — now calls `pe.closeApplyChannels()` on a real `parallelExecutor` with mock channels and asserts on the returned close-order slice. Also covers double-close recovery and the post-close nil-out. * **`TestCloseApplyChannelsDoubleCloseRecovers`** — verifies the recover-on-double-close path doesn't count pre-closed channels as freshly closed. * **`TestExecLoopShouldExitPriority`** — table-driven coverage of each branch, including cross-condition cases (e.g. maxBlockNum AND Exhausted both set on the final block — maxBlockNum must win for the clean reachedMaxBlock path). * **`TestShouldMarkExhaustedAtBlock`** — table-driven coverage of the initial-cycle gates (\`lastExecutedStep\`/\`lastFrozenStep\`/\`DiscardCommitment\`), the \`blockLimit==0\` disabled case, the span < limit case, and the \`maxBlockNum\` exact-hit no-mark case. The previous \`TestExecLoopHonorsBlockResultExhausted\` and \`TestExecLoopExhaustedOnlySetOnFinalBlock\` tests are removed — both are subsumed by the new \`TestExecLoopShouldExitPriority\` and \`TestShouldMarkExhaustedAtBlock\` cases that drive production code. ## Test plan - [x] All four new tests pass: \`TestApplyLoopChannelCloseOrder\`, \`TestCloseApplyChannelsDoubleCloseRecovers\`, \`TestExecLoopShouldExitPriority\`, \`TestShouldMarkExhaustedAtBlock\` - [x] Existing tests in the same file (\`TestApplyLoopMissingBlocks\`, \`TestExecLoopExitCheck\`, \`TestApplyLoopRootResultsCloseDoesNotRace\`, etc.) still pass - [x] \`make lint\` clean - [x] No production behaviour change (\`go build ./...\` clean) ## Stack Stacked on #21039 (the apply-loop fix). Both should land together — this PR alone doesn't make sense without #21039's helper-using tests as context.

…op-partial-batch-no-spurious-flag

mh0lt · 2026-05-08T08:24:01Z

Re-pushed with main merged in (494b1446f4). The two CI failures on the previous run are both unrelated to this PR:

race-tests / execution-other — TestHistoryVerification_SimpleBlocks: pre-existing data race in db/state/aggregator.go:623 (SetCompressWorkers) ↔ db/state/inverted_index.go:1003 ((*InvertedIndex).collate()). Stack trace shows the race is between ExecV3 calling agg.PresetChainTipConcurrency() (write) and a background-aggregator collation goroutine reading compressWorkers (read). Neither file is touched by this PR — the 3-dot diff vs main is purely exec3_parallel.go + exec3_parallel_robustness_test.go. Race-tests is also a fresh job since this PR's previous run, so the race may have surfaced from another concurrent change.
mainnet-rpc-integ-tests: ran for 6h before the runner cleaned up orphan processes — clearly an infra timeout / disconnected runner, not a test assertion failure.

The merge with main + re-run should clear both. If TestHistoryVerification_SimpleBlocks keeps racing, that's a separate issue to file against the aggregator code.

mh0lt · 2026-05-08T08:48:12Z

Filed #21062 to fix the underlying data race surfaced on this PR's race-tests run (TestHistoryVerification_SimpleBlocks — race between SetCompressWorkers and InvertedIndex.collate's read of CompressorCfg). Verified locally: -race passes with the fix. Once #21062 lands and this branch picks it up via main-merge, the race-tests entry should clear.

mh0lt · 2026-05-08T08:52:43Z

Blocked by #21062 — that PR fixes the underlying data race in db/state/aggregator.go ↔ db/state/inverted_index.go that surfaces here as the race-tests / execution-other failure. The race is not introduced by this PR (verified locally: 3-dot diff vs main only touches exec3*.go, no db/state/ files), it's a pre-existing race that was hidden until race-tests started running on this PR's matrix.

Once #21062 merges to main and we rebase, this PR's CI clears.

I deliberately did not fold #21062's commit into this branch — that'd muddy two unrelated PRs (apply-loop fix vs aggregator race fix) and force a single review/revert decision on both. Reviewing them separately preserves cleaner attribution and rollback boundaries.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

+// TestApplyLoopPartialBatchReturnsErrLoopExhausted exercises the
+// apply-loop exit decision tree end-to-end with channel orchestration:
+// when applyResults closes after the exec loop hit its size-limit
+// (lastBlockResult < maxBlockNum, no missing blocks), the apply loop
+// must return ErrLoopExhausted so the stage loop resumes from the next
+// block. The previous bug spuriously flagged maxBlockNum as missing
+// because it wasn't applied — turning every legitimate partial batch
+// into an InvalidBlock error. This test locks in the corrected


…blocks erigontech#21039 CI) (erigontech#21062) > **🚨 Blocks erigontech#21039 CI** — the data race this PR fixes is the only reason erigontech#21039's `race-tests / execution-other` job fails. erigontech#21039 cannot merge until its CI is green, and the only way that happens is for this fix to land first (or be folded in, which would muddy two unrelated PRs). Please prioritise reviewing/merging. ## Summary Fixes a long-standing data race that the Go race detector flags when a chain-tip `PresetChainTipConcurrency` call from `ExecV3` overlaps with a background `buildFiles` collation. The race is between: - **Write** in `Aggregator.SetCompressWorkers` ([aggregator.go:623](db/state/aggregator.go#L623)) — `ii.CompressorCfg.Workers = i` - **Read** in `InvertedIndex.collate` ([inverted_index.go:1003](db/state/inverted_index.go#L1003)) — `seg.NewCompressor(..., ii.CompressorCfg, ...)` reads the cfg by value Surfaced as a CI failure on PR erigontech#21039 (where the race-tests job actually ran): ``` race-tests / tests-linux (ubuntu-latest, execution-other) → TestHistoryVerification_SimpleBlocks ``` ## Root cause `Aggregator.lockWorkersEditing` already existed as a soft-skip flag for `Set*Workers`, but: 1. The bool read/write was unsynchronised — the `if lockWorkersEditing` check could observe a stale `false` while `buildFiles` had just toggled it true, leading to a concurrent `CompressorCfg.Workers = i` write. 2. `LockWorkersEditing` / `UnlockWorkersEditing` were never actually called around `buildFiles` itself. ## Fix * New `Aggregator.workersMu sync.Mutex`. * All four `Set*Workers` (Collate/Build, Merge, Compress, BuildAccessors) acquire `workersMu` before the flag-check + write. The combined check+write under the mutex closes the load-then-write race window. * `LockWorkersEditing` / `UnlockWorkersEditing` acquire `workersMu` around the bool toggle — provides a happens-before edge so any prior `Set*Workers`' write is visible to the `buildFiles` goroutine that subsequently reads `CompressorCfg`, and any concurrent `Set*Workers` blocks briefly until the toggle is observable. * `Aggregator.buildFiles` wraps its body in `LockWorkersEditing` / `defer UnlockWorkersEditing` so chain-tip-driven `Set*Workers` calls are no-oped for the duration of collation. ## Test plan - [x] `TestHistoryVerification_SimpleBlocks -race` passes (was the original repro) - [x] `go test -race -short ./db/state/...` clean (includes the aggregator + state-domain test suites) - [x] `make lint` clean - [x] No public-API change beyond making `LockWorkersEditing` / `UnlockWorkersEditing` semantically stronger — they now actually synchronise rather than being a flag with no fence

…igontech#21017) > **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the substantive parallel-exec correctness + perf fix family (`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is now on `main`. This PR has been rebased onto fresh main; the trimmed branch contains only the serial/parallel exec-mode CI matrix yamls plus two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis matrix, and the `setup-erigon` apt-mirror flake fix), plus three workflow follow-ups from the 2026-05-13 review. > > **Landed precursors:** > - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack split from this PR > - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for reorg/unwind + SD recreate > - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths > - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit > - ✅ erigontech#21010 (merged 2026-05-07) — `commitmentCalculator.asOfReader.txNum=0` lazy-load fix > > Open follow-up issues (separate from this PR's scope): erigontech#21106 (parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`), erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated SkipLoads), erigontech#21138 (heuristic-removal structural cleanup). ## Summary Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test workflow that exercises the EL execution path so divergence between `dbg.Exec3Parallel` true and false is caught on the PR rather than after release. The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in [common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the `ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in [common/dbg/experiments.go:81](common/dbg/experiments.go#L81). ## Why `Exec3Parallel = false` is currently the source default on `main`. With no CI workflow setting the env var, every CI run today exercises only the **serial** path. The parallel path lands via `--bal` chains and tests like `experimentalBAL`, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal. ## Affected workflows **Always-on** (matrix on every PR / dispatch / call): | Workflow | Runner | Parallelism | Cost | |---|---|---|---| | test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel | wall-clock unchanged, +1× runner-min | | test-all-erigon-race.yml | GitHub-hosted | true parallel | same | | test-hive.yml | hive group | parallel if pool has slots | same (group is sized) | | test-hive-eest.yml | hive group | parallel if pool has slots | same | | test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same | **Auto-fire on PRs touching their own YAML, dispatch otherwise** (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation): | Workflow | Notes | |---|---| | test-bench.yml | `go bench` | | qa-rpc-integration-tests-latest.yml | self-hosted single-pool, `max-parallel: 1` (shared testbed state) | | qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel, geth single-mode (placeholder ignored) | | qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` | | qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries | **Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests. ## Plumbing details * Workflows that build erigon and run go tests directly: env var set on the test step's `env:` block. * Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is appended to `clients/erigon/Dockerfile` during the existing sed-patch loop so every hive-launched erigon inherits the toggle. * Kurtosis-based workflows: a single upstream `build-erigon-image` job builds `test/erigon:current-base` once for the whole matrix and uploads it as a run-scoped artifact; each matrix entry downloads + `docker load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all writing the same `type=gha` cache scope and 504-ing. * Self-hosted single-pool workflows use `max-parallel: 1` to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.). * All artifact / enclave / testbed-dir names are disambiguated by `exec_mode` so the two matrix entries don't clobber each other's outputs. ## Review follow-ups (2026-05-15 rebuild) After rebasing onto post-erigontech#21153 main, three workflow fixes from the 2026-05-13 review are applied as separate commits on top of the 4 trimmed CI commits: | Commit | Fixes | |---|---| | `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel` previously caused its Grafana `test_name` to gain a stray `-parallel-` suffix, breaking historical dashboard queries | | `ci: align test-hive devp2p sim-limit between serial and parallel matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches the long-standing comment) | | `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 — s/Targetting/Targeting/ | The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631` ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently runs in parallel mode) is a Go change, raised as a separate small PR to keep this one strictly `.github/` only. ## Test plan This PR's CI run **is** the test. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their `pull_request: paths` filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows. What to look for in this PR's "Checks" tab: - [x] Each affected workflow has both `serial` and `parallel` matrix entries listed. - [x] Job display names show the mode (e.g. `tests-mac-linux (ubuntu-24.04, parallel)`). - [x] Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners). - [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize). - [x] Both modes pass on every workflow. **If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.** After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: milen <taratorio@users.noreply.github.com>

mh0lt requested a review from yperbasis as a code owner May 7, 2026 12:58

mh0lt mentioned this pull request May 7, 2026

ci: matrix-test serial vs parallel exec across the test workflows #21017

Merged

5 tasks

yperbasis requested changes May 7, 2026

View reviewed changes

mh0lt mentioned this pull request May 7, 2026

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains #21010

Merged

3 tasks

yperbasis requested review from Copilot and taratorio May 7, 2026 14:58

Copilot started reviewing on behalf of yperbasis May 7, 2026 14:59 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

yperbasis requested changes May 7, 2026

View reviewed changes

mh0lt mentioned this pull request May 7, 2026

execution/stagedsync: extract helpers so robustness tests drive production #21046

Merged

4 tasks

mh0lt and others added 2 commits May 7, 2026 21:26

Merge remote-tracking branch 'origin/main' into fix/parallel-apply-lo…

494b144

…op-partial-batch-no-spurious-flag

mh0lt mentioned this pull request May 8, 2026

db/state: serialize Aggregator.Set*Workers to fix race with collate (blocks #21039 CI) #21062

Merged

4 tasks

yperbasis requested a review from Copilot May 8, 2026 10:15

Copilot started reviewing on behalf of yperbasis May 8, 2026 10:16 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

yperbasis approved these changes May 8, 2026

View reviewed changes

mh0lt added this pull request to the merge queue May 8, 2026

Merged via the queue into main with commit 2346a98 May 8, 2026
41 of 42 checks passed

mh0lt deleted the fix/parallel-apply-loop-partial-batch-no-spurious-flag branch May 8, 2026 12:14

mh0lt mentioned this pull request May 11, 2026

parallel_exec: chiado exec from 0 trie-root missmatch #20711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039
mh0lt merged 6 commits into
mainfrom
fix/parallel-apply-loop-partial-batch-no-spurious-flag

mh0lt commented May 7, 2026 •

edited

Loading

Uh oh!

yperbasis left a comment

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

yperbasis left a comment

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mh0lt commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Update — additional fix in commit 310a094514 (execution/stagedsync: honor blockResult.Exhausted in exec loop)

Why

Shutdown sequence verified

Tests

Test plan

CI fix linkage

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Major

1. Stale comment at the apply-loop close site (real bug, easy fix)

Minor (optional)

2. All new tests are decision-tree models, not the real execLoop

3. Comment slightly imprecise

4. Unused field in test result type

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mh0lt commented May 7, 2026 •

edited

Loading

Update — additional fix in commit `310a094514` (`execution/stagedsync: honor blockResult.Exhausted in exec loop`)

2. All new tests are decision-tree models, not the real `execLoop`