common/dbg: default EXEC3_PARALLEL=true#21591
Merged
Merged
Conversation
d1fe513 to
9246cc2
Compare
Flip the EXEC3_PARALLEL env var default from false to true, making parallel execution the default for plain `./build/bin/erigon` runs. Setting EXEC3_PARALLEL=false (or ERIGON_EXEC3_PARALLEL=false) still forces the serial path. Depends on #21590 (the parallel-exec SD-revival and metamorphic-CREATE2 fixes) — that PR must merge first so the new default lands with the correctness fixes already in place. The 5-chain soak validation that backs the flip is documented in #21590's test plan.
9246cc2 to
917ae36
Compare
taratorio
approved these changes
Jun 3, 2026
3 tasks
domiwei
pushed a commit
to domiwei/erigon
that referenced
this pull request
Jun 8, 2026
…mate (erigontech#21667) ## Summary Fixes a parallel-exec OCC correctness bug: the apply loop was flushing writes from txs that **failed validation** (`VersionInvalid`) with the `complete=Done` flag, leaking phantom committed state into the versionMap where downstream readers picked it up. ## Symptoms before this fix - **Intermittent wrong-trie-root failures** during from-genesis parallel sync (PR erigontech#21591 made `EXEC3_PARALLEL=true` the default). - **Pattern-2 family**: gas-mismatch failures on DEX strategy contracts, especially in the gnosis 14.8M-18.5M block range. - **Cascading mis-execution**: a phantom write at tx[N] is picked up by tx[N+k] via `MapRead`. Version-only validation passes because `readVersion == writtenVersion`. The downstream tx commits state derived from the phantom, and the divergence surfaces as a gas-mismatch tens of thousands of blocks later. Captured deterministically at gnosis block **18,483,405**: tx[3] inc=0 was flagged Invalid by `ValidateVersionBlock`, but its 28 writes were flushed with `complete=true` because `cntInvalid == 0` (the counter only tracks *prior* `VersionTooEarly` txs). Slot 0x08 on contract `0x18b2b7673c6d661923e9460d592699617828b293` was stored with value `aabS...5981`. tx[16] subsequently read that phantom via `MapRead` and committed phantom-derived state. The cascade surfaced as a gas-mismatch roughly 80K blocks downstream. ## Fix New helper `applyLoopFlushAsComplete(valid, cntInvalid) = valid && cntInvalid == 0` gates the `complete` flag. An invalidated tx now flushes as **Estimate**, which causes downstream reads of those cells to return `MVReadResultDependency`. The validator treats that as `VersionInvalid`, forcing the dependent tx to re-execute after the retry settles - restoring OCC's \`Done = committed\` invariant. Companion clarity changes in the same files: - \`exec3_parallel.go\`: read \`txVersion\` from \`txResult.Task.Version()\` (the \`*taskVersion\` wrapper that carries the current \`Incarnation\`) instead of bare \`TxTask\` which always returns \`Incarnation=0\`. Logs and traces now show the correct incarnation on retries. - \`txtask.go\`: drop the dead \`TxTask.Incarnation\` field. Live source of truth is the \`txIncarnations\` counter plus the \`taskVersion\` wrapper. ## Verification - **Unit tests** (\`execution/stagedsync/exec3_parallel_robustness_test.go\`): - \`TestApplyLoopFlushAsComplete\` - 4-case truth-table on the helper, including the regression guard case "INVALID current tx -> must NOT be Done". - \`TestApplyLoopFlush_InvalidTxWritesAreEstimate\` - drives the production helper through a real \`VersionMap\` using the gnosis 18,483,405 contract+slot, asserts downstream read returns \`MVReadResultDependency\` (not \`MVReadResultDone\`). - Both confirmed to FAIL with the helper temporarily reverted to \`cntInvalid == 0\`, PASS with the fix. - **Gnosis oracle CHECK** (live): swept blocks **18.46M-18.70M+ end-to-end** with zero \`STATE-ORACLE-DIVERGENCE\` events, zero gas-mismatch, zero wrong-trie-root. - **From-genesis resync** (live): currently past historical fail blocks **14,837,280 / 14,845,967 / 16,420,113** without recurrence; still in progress through the rest of the 14.8M-18.5M range. ## Milestone **Needs to be merged before 3.5 is cut.** With PR erigontech#21591 making \`EXEC3_PARALLEL=true\` the default on main, this bug class affects all 3.5+ users running parallel execution. Marked for the 3.5.0 milestone. ## Test plan - [ ] CI green - [ ] Reviewer checks the helper's truth-table covers all four \`(valid, cntInvalid)\` cases - [ ] Reviewer confirms the \`MVReadResultDependency\` -> validator -> re-exec path closes the OCC invariant --------- Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
bloxster
added a commit
that referenced
this pull request
Jun 12, 2026
…21687) Addresses code→docs gaps found in weekly maintenance run w24, plus some Fundamentals housekeeping. **Scope: release/3.5 only.** ### Flags & env vars (`configuring-erigon`) - Add `--snap.chaintoml-url` flag (PR #21584), including `ERIGON_REMOTE_PREVERIFIED` override precedence - Add `--snap.p2p-manifest` flag (PR #20526), tagged *(New in v3.5)* - Update `EXEC3_PARALLEL` default to `true` (PR #21591) - Add `--exec.serial` and `--exec.no-prune` flags (new in v3.5); also document `--exec.batched-io` and `--exec.state-cache` - Document `--exec.workers`. Its effective default is the **full CPU-core count** (inherited from the `EXEC3_WORKERS` fallback when the flag is unset); the flag's own `--help` text says "half the CPU cores", which is a known inconsistency in the binary ### Prune modes / snapshots - Update `pruning-modes.md` with the EIP-8252 retention-window breaking change (v3.5): full mode now prunes block bodies/receipts to the last 262,144 blocks (previously kept all post-merge blocks) and will stop serving older block/receipt data; state-history window grows 100k→262k - Add new "Snapshots Management" page under Fundamentals (`seg du`, snapshot categories, node-type estimates, EIP-8252 retention window). The `seg du` example uses an archive datadir, since the estimator only sums on-disk files and the archive row therefore equals the current total - Fix the `erigon snapshots …` ver-format upgrade/downgrade commands in `get-started/installation/upgrading.md` ### Fundamentals section housekeeping - Fix a Mermaid parse error on the **Architecture** page — the Caplin→Execution edge label had unquoted parentheses (`|new blocks<br/>(Engine API)|`), which the flowchart parser rejects; now quoted - Reorder the Fundamentals sidebar into a clean integer reading order. It had grown to ~23 entries with colliding `sidebar_position` values and scattered related pages. New order: concepts first (Architecture, Database, Pruning Modes, Snapshots Management, Caplin) → configuration → operations/tuning → security → integrations - **NAT** moved out of the CLI Reference subfolder to a top-level Fundamentals page (next to Default Ports): `/fundamentals/configuring-erigon/nat` → `/fundamentals/nat`. The old URL is preserved via a client-side redirect (`@docusaurus/plugin-client-redirects`, pinned to 3.10.0); the site is on GitHub Pages, which can't do host-level 301s - The **CLI Reference** page was flattened (`configuring-erigon/index.mdx` → `configuring-erigon.mdx`); its `/fundamentals/configuring-erigon` URL is unchanged ### Mobile UI - Adds a theme-color meta tag to the docs site so mobile browsers tint the address/status bar with the Erigon brand orange — the same behavior the main website, Cocoon and Zilkworm docs already have. Regenerated `llms.txt` / `llms-full.txt` artifacts. --- _Updated after Copilot + @yperbasis review: replaced the `seg du` example with **real mainnet-archive output** (correct ByteCount renderings, `extensions:` line, byte-exact estimates table); documented `other_extensions`; made `--datadir` optional and chain-agnostic; clarified full-mode block-pruning impact; corrected the `--exec.workers` default and tagged all `--exec.*` flags as new in v3.5; added `--exec.batched-io` / `--exec.state-cache`; fixed the RPC cert filename typos in **TLS Authentication** (`RPC key.pem` → `RPC-key.pem`, `RPC.crtv` → `RPC.crt`); gave NAT an integer sidebar position and moved the **Modules** overview first; removed a stray editorial note in the CLI reference; typed the `seg du` code fences as `text`; de-duplicated pruning-mode concepts on the Snapshots page (now deferring to `pruning-modes.md`); and fixed `sidebar_position` collisions._ --------- Co-authored-by: Andy (NanoClaw) <andy@nanoclaw.ai> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Bloxster <gianni.morselli@erigon.tech>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #17630
Summary
Flip the
EXEC3_PARALLELenv var default fromfalsetotrue, so a plain./build/bin/erigonrun uses parallel execution by default.EXEC3_PARALLEL=false(orERIGON_EXEC3_PARALLEL=false) still forces the serial path.One-line change in
common/dbg/experiments.go.Test plan
make erigonclean.make lintclean.Note on dependency
Backed by the soak validation done in #21590 (chiado from-0 to tip, sepolia from-0 past 4,913,058, hoodi/mainnet clean — see that PR's test plan). #21590 should land first so the new default ships with the correctness fixes already in place. If #21591 merges first by accident, users hitting the SD-revival / metamorphic-CREATE2 patterns in parallel exec would regress until #21590 lands.
A separate gnosis race-class bug (4 occurrences in ~300K blocks around 14.59M-14.89M, recurring + non-deterministic) surfaced during the same soak — it is not addressed by either PR and is tracked in docs/plans/20260602-gnosis-parallel-exec-race-14594499.md. Flipping the default doesn't make it worse: users who hit it can always set
EXEC3_PARALLEL=false.