execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit#21039
Conversation
…oop exit When the parallel exec loop hits its batch size limit mid-fixture, it returns nil to the apply loop with reachedMaxBlock=false. The apply loop's completeness check then ran applyLoopMissingBlocks with a fallback clause that flagged maxBlockNum as "missing" whenever !reachedMaxBlock — turning every legitimate partial-batch exit into a spurious InvalidBlock error. The size-limit path is normal: it returns ErrLoopExhausted and the stage loop resumes the next batch from lastBlockResult+1. Each block still executes exactly once. The fallback clause was unnecessary and incorrect. Reproduces deterministically as the github.com/erigontech/erigon/rpc/gasprice BenchmarkFeeHistory parallel-mode failure: 200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114; first batch covers blocks 1..114, second batch covers 115..200, both clean. Drops the fallback clause, simplifies applyLoopMissingBlocks to a single map-comparison, and tightens the message string. Adds: * TestApplyLoopMissingBlocks "partial batch — size-limit hit" case * TestApplyLoopPartialBatchReturnsErrLoopExhausted exercising the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) * TestApplyLoopChannelCloseOrder pinning the documented commitResults-then-applyResults shutdown order Precursor for #21017 (CI matrix exec_mode={serial,parallel}); without this, every workflow that runs a parallel-exec test against a chain that exceeds the batch budget fails on the first such size-limit boundary.
Same class of partial-batch orchestration bug as the apply-loop fix in this PR. executeBlocks marks the final dispatched block with Exhausted when the per-cycle block limit fires, then exits its goroutine — without closing pe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for execRequests / rws.ResultCh / ctx.Done — none of which ever fire. The apply loop never got the channel-close signal it needed to return ErrLoopExhausted. Reproduced as a chiado parallel-exec stall at block 150662 with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block; with the exhaust signal honored the underlying ErrWrongTrieRoot now surfaces cleanly through the apply loop. The new branch sits between the maxBlockNum-reached check and the StopAfterBlock debug exit, matching the precedence of the existing partial-batch exits. ## Tests * TestExecLoopHonorsBlockResultExhausted — orchestration test with two models of the exec loop's blockResult decision tree (post-fix and pre-fix). The pre-fix model must hang past the timeout — proves the scaffolding genuinely surfaces the bug rather than passing vacuously. * TestExecLoopExhaustedOnlySetOnFinalBlock — locks in the dispatcher convention (Exhausted only on the trailing block of a partial batch) so future executeBlocks changes can't silently drop already-queued work by setting it mid-stream.
There was a problem hiding this comment.
Pull request overview
This PR fixes partial-batch orchestration in the parallel execution path so that normal “stop early and resume next cycle” exits are not misclassified as InvalidBlock, and ensures the exec loop honors an explicit “exhausted” signal to avoid stalling.
Changes:
- Removes the
maxBlockNumfallback fromapplyLoopMissingBlocks, so only blocks that actually produced tx-results are considered “missing” when the apply loop exits. - Updates
execLoopto exit cleanly whenblockResult.Exhaustedis observed, allowing deferred channel closes to unblock the apply loop and propagateErrLoopExhausted. - Extends robustness tests around partial-batch exits and the exhausted signal (with some tests currently modeling behavior rather than exercising production code).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
execution/stagedsync/exec3_parallel.go |
Adjusts missing-block completeness logic and adds blockResult.Exhausted handling in execLoop to prevent partial-batch stalls. |
execution/stagedsync/exec3_parallel_robustness_test.go |
Updates/expands tests for partial-batch behavior and exec-loop exhausted handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -352,8 +352,8 @@ func (pe *parallelExecutor) exec(ctx context.Context, execStage *StageState, u U | |||
| // (no blockResult for it) but the channel closed cleanly: that | |||
| // means the exec loop signaled "done" without delivering the | |||
| // final block. | |||
| // TestApplyLoopChannelCloseOrder exercises the documented invariant | ||
| // that the exec loop closes commitResults BEFORE applyResults on | ||
| // shutdown. The calculator drains commitResults and signals the apply | ||
| // loop via rootResults; if applyResults closes first, the apply loop | ||
| // can race with the calculator's final commitment write. | ||
| // | ||
| // The exec loop's deferred close in execLoop() does the right thing | ||
| // (commitResults first), but the close ordering is load-bearing — | ||
| // changing it silently corrupts the shutdown sequence. This test | ||
| // captures the shape of the deferred close to pin the order down. | ||
| func TestApplyLoopChannelCloseOrder(t *testing.T) { | ||
| commitResults := make(chan struct{}) | ||
| applyResults := make(chan struct{}) | ||
|
|
||
| closeOrder := make(chan string, 2) | ||
|
|
||
| // Mimic execLoop's deferred close: commitResults first, then | ||
| // applyResults. | ||
| closeBoth := func() { | ||
| close(commitResults) | ||
| closeOrder <- "commitResults" | ||
| close(applyResults) | ||
| closeOrder <- "applyResults" | ||
| } | ||
|
|
||
| go closeBoth() | ||
|
|
||
| first := <-closeOrder | ||
| second := <-closeOrder | ||
|
|
||
| if first != "commitResults" { | ||
| t.Fatalf("commitResults must close first; got close order [%s, %s]", first, second) | ||
| } | ||
| if second != "applyResults" { | ||
| t.Fatalf("applyResults must close second; got close order [%s, %s]", first, second) | ||
| } | ||
|
|
||
| // Sanity: both channels actually closed. | ||
| if _, ok := <-commitResults; ok { | ||
| t.Fatal("commitResults not actually closed") | ||
| } | ||
| if _, ok := <-applyResults; ok { | ||
| t.Fatal("applyResults not actually closed") | ||
| } |
| // Production decision tree, mirroring exec3_parallel.go's | ||
| // if blockResult != nil { ... } block: | ||
| // 1. sizeEst > batchLimit → trigger + return | ||
| // 2. blockResult.BlockNum >= maxBlockNum → reachedMaxBlock + trigger + return | ||
| // 3. blockResult.Exhausted != nil → trigger + return | ||
| // 4. dbg.StopAfterBlock → trigger + return | ||
| // 5. fall through to next-block scheduling. | ||
| withFix := func(b *blockResult, trigger func()) (done bool) { |
| // Inner ctx lets the goroutine cooperate with the outer timeout | ||
| // without leaking when the model hangs on <-br. We translate | ||
| // "exited via ctx" to result.hung — that is the hang signal. | ||
| ctx, cancel := context.WithTimeout(context.Background(), 1500*time.Millisecond) | ||
| defer cancel() |
| // TestExecLoopExhaustedOnlySetOnFinalBlock is a guard against future | ||
| // regressions in the dispatcher: blockResult.Exhausted being set on a | ||
| // non-final block would cause the exec loop to drop later blocks that | ||
| // have already been queued. Locks down the precedence: among blocks in | ||
| // flight, only the trailing dispatched block's Exhausted is honored — | ||
| // any earlier Exhausted signal would short-circuit the batch and lose | ||
| // already-scheduled work. This test pins the convention to exec3.go's | ||
| // `if exhausted != nil { break }` after dispatch: the break ensures | ||
| // no later blockResult is produced once Exhausted is set on a | ||
| // dispatched request. | ||
| func TestExecLoopExhaustedOnlySetOnFinalBlock(t *testing.T) { | ||
| // Simulate executeBlocks' loop: it sets exhausted, dispatches the | ||
| // blockResult containing it, and breaks out — no further blocks. | ||
| // Verify the convention: count(Exhausted != nil) == 1 always, and | ||
| // it's always the last blockResult. | ||
| type dispatched struct { | ||
| num uint64 | ||
| exhausted bool | ||
| } | ||
| dispatch := func(blocks []dispatched) error { | ||
| var prevExhausted bool | ||
| for _, b := range blocks { | ||
| if prevExhausted { | ||
| return errors.New("dispatched a block after Exhausted was set — exec loop would drop it") | ||
| } | ||
| if b.exhausted { | ||
| prevExhausted = true | ||
| } | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| t.Run("Exhausted only on last block — valid", func(t *testing.T) { | ||
| err := dispatch([]dispatched{ | ||
| {num: 1, exhausted: false}, | ||
| {num: 2, exhausted: false}, | ||
| {num: 3, exhausted: true}, | ||
| }) | ||
| if err != nil { | ||
| t.Fatalf("valid case rejected: %v", err) | ||
| } | ||
| }) | ||
|
|
||
| t.Run("no Exhausted — valid (full-batch path)", func(t *testing.T) { | ||
| err := dispatch([]dispatched{ | ||
| {num: 1, exhausted: false}, | ||
| {num: 2, exhausted: false}, | ||
| }) | ||
| if err != nil { | ||
| t.Fatalf("valid case rejected: %v", err) | ||
| } | ||
| }) | ||
|
|
||
| t.Run("Exhausted mid-stream — invalid (regression guard)", func(t *testing.T) { | ||
| err := dispatch([]dispatched{ | ||
| {num: 1, exhausted: false}, | ||
| {num: 2, exhausted: true}, | ||
| {num: 3, exhausted: false}, | ||
| }) | ||
| if err == nil { | ||
| t.Fatal("dispatch convention violated: blockResult emitted after Exhausted") | ||
| } | ||
| }) |
yperbasis
left a comment
There was a problem hiding this comment.
Major
1. Stale comment at the apply-loop close site (real bug, easy fix)
exec3_parallel.go:351-354 still says:
// Also flag the case where maxBlockNum was never reached at all
// (no blockResult for it) but the channel closed cleanly: that
// means the exec loop signaled "done" without delivering the
// final block.This is exactly the behavior commit 550cd1d574 removed. The function-level comment on applyLoopMissingBlocks was correctly updated; the call-site comment was missed. Should be deleted or replaced to explain that not-reaching-maxBlockNum is now a normal partial-batch state.
Minor (optional)
2. All new tests are decision-tree models, not the real execLoop
The PR description acknowledges this — full integration would need workers + calculator + state setup. But the consequence is real: TestExecLoopHonorsBlockResultExhausted's withFix model is hand-coded to mirror production precedence. If the production order changes, the test passes vacuously while the model drifts. Worth a comment in the test pointing at the exact production lines being mirrored, so a future change is more likely to update both. Same for TestApplyLoopPartialBatchReturnsErrLoopExhausted.
3. Comment slightly imprecise
Lines 901-902:
(initialCycle gates it on crossing a step boundary;
later cycles enforce it unconditionally)
The initial-cycle path also requires !dbg.DiscardCommitment() (exec3.go:675). Minor.
4. Unused field in test result type
TestApplyLoopPartialBatchReturnsErrLoopExhausted's errSubstring is set but never asserted on. Trivial cleanup.
|
@yperbasis CI is green now — ready for re-review. The earlier failure was |
- Replace stale comment at the apply-loop close site (exec3_parallel.go:351-354) that still described the removed maxBlockNum fallback. - Refine the executeBlocks-Exhausted comment to mention the !dbg.DiscardCommitment() initial-cycle gate at exec3.go:675. - Add a cross-reference comment to TestApplyLoopPartialBatchReturnsErrLoopExhausted pinning it to exec3_parallel.go's apply-loop exit branch around line 355, matching the convention already used in TestExecLoopHonorsBlockResultExhausted. - Drop unused errSubstring field from the test's result struct. - Reduce TestExecLoopHonorsBlockResultExhausted hang-detection timeout from 1500ms inner / 2000ms outer to 300ms / 500ms — in-memory channel + closure dispatch resolves on a microsecond timescale, so the larger budget was pure test runtime overhead.
|
@yperbasis pushed 686e11b addressing the review:
The Copilot suggestions about extracting production logic into helpers and unit-testing the helpers (so the tests drive real code rather than a model) are good but a larger refactor than fits this PR — would prefer to leave them as a follow-up. Lint clean, full |
…ction Addresses Copilot's review feedback on #21039: the original tests modeled the exec-loop and apply-loop decision trees in local closures, which would let the model drift from production silently. This follow-up extracts three pure helpers from the production code and rewires the tests to drive them directly. Helpers extracted: * `closeApplyChannels` (pe method) — pins commitResults-then-applyResults close ordering. Returns the close sequence as a string slice so tests can verify the order deterministically without racing on observer- goroutine wakeups (the previous test had a real flake here). * `execLoopShouldExit` (pure func) — encodes the per-blockResult exit precedence: sizeLimit > maxBlockNum > Exhausted > StopAfterBlock. Production exec-loop now switches on the returned decision; tests pin each precedence rule with table-driven cases. * `shouldMarkExhaustedAtBlock` (pure func) — encodes executeBlocks's per-cycle block-limit gate (initial-cycle protections + blockLimit span check + maxBlockNum guard). Replaces the old "dispatch convention" model test that didn't drive any production code. Tests updated: * `TestApplyLoopChannelCloseOrder` — calls `pe.closeApplyChannels()` on a real `parallelExecutor` with mock channels and asserts on the returned close-order slice. Also covers double-close recovery and the post-close nil-out. * `TestExecLoopShouldExitPriority` — table-driven coverage of the exit-priority decisions including the cross-condition cases (e.g. maxBlockNum AND Exhausted both set on the final block — maxBlockNum must win). * `TestShouldMarkExhaustedAtBlock` — table-driven coverage of the initial-cycle gates, blockLimit==0 disabled case, span < limit case, and the maxBlockNum exact-hit no-mark case. No production behaviour change — pure refactor + test rewire.
|
Follow-up #21046 opens against this branch — extracts |
…ction (#21046) Follow-up to #21039 addressing Copilot's review feedback that the new robustness tests were modelling the exec-loop and apply-loop decision trees in local closures — which would let the model drift from production silently. This PR extracts three pure helpers from the production code and rewires the tests to drive them directly. **No production behaviour change** — the helpers wrap the exact same logic that was inline before. ## Helpers extracted | Helper | Production call site | What it pins | |---|---|---| | `closeApplyChannels` (pe method) | `execLoop` deferred close | commitResults-then-applyResults order | | `execLoopShouldExit` (pure) | `execLoop` per-blockResult branch | sizeLimit > maxBlockNum > Exhausted > StopAfterBlock precedence | | `shouldMarkExhaustedAtBlock` (pure) | `executeBlocks` per-cycle limit gate | initial-cycle gates + blockLimit + maxBlockNum guard | `closeApplyChannels` returns the actual close-order as a string slice — the previous test was racy because observer goroutines could wake up out of order even when the helper closed the channels in the right order; the slice removes that race entirely. ## Tests rewired * **`TestApplyLoopChannelCloseOrder`** — now calls `pe.closeApplyChannels()` on a real `parallelExecutor` with mock channels and asserts on the returned close-order slice. Also covers double-close recovery and the post-close nil-out. * **`TestCloseApplyChannelsDoubleCloseRecovers`** — verifies the recover-on-double-close path doesn't count pre-closed channels as freshly closed. * **`TestExecLoopShouldExitPriority`** — table-driven coverage of each branch, including cross-condition cases (e.g. maxBlockNum AND Exhausted both set on the final block — maxBlockNum must win for the clean reachedMaxBlock path). * **`TestShouldMarkExhaustedAtBlock`** — table-driven coverage of the initial-cycle gates (\`lastExecutedStep\`/\`lastFrozenStep\`/\`DiscardCommitment\`), the \`blockLimit==0\` disabled case, the span < limit case, and the \`maxBlockNum\` exact-hit no-mark case. The previous \`TestExecLoopHonorsBlockResultExhausted\` and \`TestExecLoopExhaustedOnlySetOnFinalBlock\` tests are removed — both are subsumed by the new \`TestExecLoopShouldExitPriority\` and \`TestShouldMarkExhaustedAtBlock\` cases that drive production code. ## Test plan - [x] All four new tests pass: \`TestApplyLoopChannelCloseOrder\`, \`TestCloseApplyChannelsDoubleCloseRecovers\`, \`TestExecLoopShouldExitPriority\`, \`TestShouldMarkExhaustedAtBlock\` - [x] Existing tests in the same file (\`TestApplyLoopMissingBlocks\`, \`TestExecLoopExitCheck\`, \`TestApplyLoopRootResultsCloseDoesNotRace\`, etc.) still pass - [x] \`make lint\` clean - [x] No production behaviour change (\`go build ./...\` clean) ## Stack Stacked on #21039 (the apply-loop fix). Both should land together — this PR alone doesn't make sense without #21039's helper-using tests as context.
…op-partial-batch-no-spurious-flag
|
Re-pushed with main merged in (
The merge with main + re-run should clear both. If |
|
Filed #21062 to fix the underlying data race surfaced on this PR's race-tests run ( |
|
Blocked by #21062 — that PR fixes the underlying data race in Once #21062 merges to main and we rebase, this PR's CI clears. I deliberately did not fold #21062's commit into this branch — that'd muddy two unrelated PRs (apply-loop fix vs aggregator race fix) and force a single review/revert decision on both. Reviewing them separately preserves cleaner attribution and rollback boundaries. |
| // TestApplyLoopPartialBatchReturnsErrLoopExhausted exercises the | ||
| // apply-loop exit decision tree end-to-end with channel orchestration: | ||
| // when applyResults closes after the exec loop hit its size-limit | ||
| // (lastBlockResult < maxBlockNum, no missing blocks), the apply loop | ||
| // must return ErrLoopExhausted so the stage loop resumes from the next | ||
| // block. The previous bug spuriously flagged maxBlockNum as missing | ||
| // because it wasn't applied — turning every legitimate partial batch | ||
| // into an InvalidBlock error. This test locks in the corrected |
…blocks erigontech#21039 CI) (erigontech#21062) > **🚨 Blocks erigontech#21039 CI** — the data race this PR fixes is the only reason erigontech#21039's `race-tests / execution-other` job fails. erigontech#21039 cannot merge until its CI is green, and the only way that happens is for this fix to land first (or be folded in, which would muddy two unrelated PRs). Please prioritise reviewing/merging. ## Summary Fixes a long-standing data race that the Go race detector flags when a chain-tip `PresetChainTipConcurrency` call from `ExecV3` overlaps with a background `buildFiles` collation. The race is between: - **Write** in `Aggregator.SetCompressWorkers` ([aggregator.go:623](db/state/aggregator.go#L623)) — `ii.CompressorCfg.Workers = i` - **Read** in `InvertedIndex.collate` ([inverted_index.go:1003](db/state/inverted_index.go#L1003)) — `seg.NewCompressor(..., ii.CompressorCfg, ...)` reads the cfg by value Surfaced as a CI failure on PR erigontech#21039 (where the race-tests job actually ran): ``` race-tests / tests-linux (ubuntu-latest, execution-other) → TestHistoryVerification_SimpleBlocks ``` ## Root cause `Aggregator.lockWorkersEditing` already existed as a soft-skip flag for `Set*Workers`, but: 1. The bool read/write was unsynchronised — the `if lockWorkersEditing` check could observe a stale `false` while `buildFiles` had just toggled it true, leading to a concurrent `CompressorCfg.Workers = i` write. 2. `LockWorkersEditing` / `UnlockWorkersEditing` were never actually called around `buildFiles` itself. ## Fix * New `Aggregator.workersMu sync.Mutex`. * All four `Set*Workers` (Collate/Build, Merge, Compress, BuildAccessors) acquire `workersMu` before the flag-check + write. The combined check+write under the mutex closes the load-then-write race window. * `LockWorkersEditing` / `UnlockWorkersEditing` acquire `workersMu` around the bool toggle — provides a happens-before edge so any prior `Set*Workers`' write is visible to the `buildFiles` goroutine that subsequently reads `CompressorCfg`, and any concurrent `Set*Workers` blocks briefly until the toggle is observable. * `Aggregator.buildFiles` wraps its body in `LockWorkersEditing` / `defer UnlockWorkersEditing` so chain-tip-driven `Set*Workers` calls are no-oped for the duration of collation. ## Test plan - [x] `TestHistoryVerification_SimpleBlocks -race` passes (was the original repro) - [x] `go test -race -short ./db/state/...` clean (includes the aggregator + state-domain test suites) - [x] `make lint` clean - [x] No public-API change beyond making `LockWorkersEditing` / `UnlockWorkersEditing` semantically stronger — they now actually synchronise rather than being a flag with no fence
…igontech#21017) > **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the substantive parallel-exec correctness + perf fix family (`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is now on `main`. This PR has been rebased onto fresh main; the trimmed branch contains only the serial/parallel exec-mode CI matrix yamls plus two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis matrix, and the `setup-erigon` apt-mirror flake fix), plus three workflow follow-ups from the 2026-05-13 review. > > **Landed precursors:** > - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack split from this PR > - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for reorg/unwind + SD recreate > - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths > - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit > - ✅ erigontech#21010 (merged 2026-05-07) — `commitmentCalculator.asOfReader.txNum=0` lazy-load fix > > Open follow-up issues (separate from this PR's scope): erigontech#21106 (parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`), erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated SkipLoads), erigontech#21138 (heuristic-removal structural cleanup). ## Summary Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test workflow that exercises the EL execution path so divergence between `dbg.Exec3Parallel` true and false is caught on the PR rather than after release. The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in [common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the `ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in [common/dbg/experiments.go:81](common/dbg/experiments.go#L81). ## Why `Exec3Parallel = false` is currently the source default on `main`. With no CI workflow setting the env var, every CI run today exercises only the **serial** path. The parallel path lands via `--bal` chains and tests like `experimentalBAL`, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal. ## Affected workflows **Always-on** (matrix on every PR / dispatch / call): | Workflow | Runner | Parallelism | Cost | |---|---|---|---| | test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel | wall-clock unchanged, +1× runner-min | | test-all-erigon-race.yml | GitHub-hosted | true parallel | same | | test-hive.yml | hive group | parallel if pool has slots | same (group is sized) | | test-hive-eest.yml | hive group | parallel if pool has slots | same | | test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same | **Auto-fire on PRs touching their own YAML, dispatch otherwise** (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation): | Workflow | Notes | |---|---| | test-bench.yml | `go bench` | | qa-rpc-integration-tests-latest.yml | self-hosted single-pool, `max-parallel: 1` (shared testbed state) | | qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel, geth single-mode (placeholder ignored) | | qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` | | qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries | **Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests. ## Plumbing details * Workflows that build erigon and run go tests directly: env var set on the test step's `env:` block. * Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is appended to `clients/erigon/Dockerfile` during the existing sed-patch loop so every hive-launched erigon inherits the toggle. * Kurtosis-based workflows: a single upstream `build-erigon-image` job builds `test/erigon:current-base` once for the whole matrix and uploads it as a run-scoped artifact; each matrix entry downloads + `docker load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all writing the same `type=gha` cache scope and 504-ing. * Self-hosted single-pool workflows use `max-parallel: 1` to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.). * All artifact / enclave / testbed-dir names are disambiguated by `exec_mode` so the two matrix entries don't clobber each other's outputs. ## Review follow-ups (2026-05-15 rebuild) After rebasing onto post-erigontech#21153 main, three workflow fixes from the 2026-05-13 review are applied as separate commits on top of the 4 trimmed CI commits: | Commit | Fixes | |---|---| | `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel` previously caused its Grafana `test_name` to gain a stray `-parallel-` suffix, breaking historical dashboard queries | | `ci: align test-hive devp2p sim-limit between serial and parallel matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches the long-standing comment) | | `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 — s/Targetting/Targeting/ | The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631` ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently runs in parallel mode) is a Go change, raised as a separate small PR to keep this one strictly `.github/` only. ## Test plan This PR's CI run **is** the test. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their `pull_request: paths` filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows. What to look for in this PR's "Checks" tab: - [x] Each affected workflow has both `serial` and `parallel` matrix entries listed. - [x] Job display names show the mode (e.g. `tests-mac-linux (ubuntu-24.04, parallel)`). - [x] Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners). - [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize). - [x] Both modes pass on every workflow. **If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.** After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: milen <taratorio@users.noreply.github.com>
Summary
Drops the fallback clause in
applyLoopMissingBlocksthat turned every legitimate partial-batch exit into a spuriousInvalidBlockerror.When the parallel exec loop hits its size-limit mid-fixture, it returns nil with
reachedMaxBlock=false. The previous completeness check flaggedmaxBlockNumas "missing" whenever!reachedMaxBlock— but the size-limit path is normal: the apply loop returnsErrLoopExhausted, the stage loop resumes the next batch fromlastBlockResult+1, and each block still executes exactly once. No re-execution.Update — additional fix in commit
310a094514(execution/stagedsync: honor blockResult.Exhausted in exec loop)Same class of partial-batch orchestration bug, on the exec-loop side.
executeBlocksmarks the final dispatched block withExhaustedwhen the per-cycle block limit fires, then exits its goroutine — without closingpe.execRequests, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting forexecRequests/rws.ResultCh/ctx.Done— none of which ever fire. The apply loop never got the channel-close signal it needs to returnErrLoopExhausted, so the original (apply-loop-side) fix in this PR couldn't even run.Reproduced as a chiado parallel-exec silent stall at block 150662 with
EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000from block 0 (issue #20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlyingErrWrongTrieRootnow surfaces cleanly through the apply loop and the original-issue symptom appears as expected.The new branch sits between the
maxBlockNum-reached check and theStopAfterBlockdebug exit, matching the precedence of the existing partial-batch exits.Tests:
TestExecLoopHonorsBlockResultExhaustedruns two models of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously.TestExecLoopExhaustedOnlySetOnFinalBlocklocks in the dispatcher convention so futureexecuteBlockschanges can't silently drop already-queued work by settingExhaustedmid-stream.Why
Reproduces deterministically as the parallel-mode
BenchmarkFeeHistoryfailure on PR #21017 (the CI matrix that runs every test under bothserialandparallelexec modes):200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms:
ErrLoopExhaustedreachedMaxBlock=true, clean nilSd.mem is not contaminated by in-flight future-block writes —
txResultBlocksat exit was exactly[1..114]. Block 200 was flagged solely from the fallback clause.Shutdown sequence verified
Tracing the deferred-close in
execLoop:commitResultscloses first, calculator drains and closesrootResults, apply loop sees both closes, drains, runs the completeness check, returnsErrLoopExhausted.executorCancelthen cancelsexecLoopCtx,executeBlocksand workers exit,pe.waitjoins. Three batches inBenchmarkFeeHistoryrun end-to-end with no goroutine leaks.Tests
TestApplyLoopMissingBlocks— adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" caseTestApplyLoopPartialBatchReturnsErrLoopExhausted— exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenariosTestApplyLoopChannelCloseOrder— pins the load-bearingcommitResults-before-applyResultsclose orderTestExecLoopHonorsBlockResultExhausted— orchestration test for the exec-loop side of the partial-batch exit (added in commit310a094514)TestExecLoopExhaustedOnlySetOnFinalBlock— pins the executeBlocks dispatcher convention (added in commit310a094514)Test plan
BenchmarkFeeHistory/full/blocks=200passes withERIGON_EXEC3_PARALLEL=truemake lintcleango test -race ./execution/stagedsync/...for new tests passesTestApplyLoopRootResultsCloseDoesNotRace,TestApplyLoopDoesNotHangAfterRootResultsClose,TestExecLoopExitCheck*still passEXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archivefrom block 0 reaches the underlyingErrWrongTrieRootat block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires)CI fix linkage
Precursor for #21017 (CI matrix
exec_mode={serial,parallel}). Without this,bench / benchmarks (parallel)and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to #21032 (incarnation/SD differentiator fix).