Skip to content

common/dbg: default EXEC3_PARALLEL=true#21591

Merged
mh0lt merged 2 commits into
mainfrom
mh/exec3-parallel-default-true
Jun 5, 2026
Merged

common/dbg: default EXEC3_PARALLEL=true#21591
mh0lt merged 2 commits into
mainfrom
mh/exec3-parallel-default-true

Conversation

@mh0lt

@mh0lt mh0lt commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Closes #17630

Summary

Flip the EXEC3_PARALLEL env var default from false to true, so a plain ./build/bin/erigon run uses parallel execution by default. EXEC3_PARALLEL=false (or ERIGON_EXEC3_PARALLEL=false) still forces the serial path.

One-line change in common/dbg/experiments.go.

Test plan

  • make erigon clean.
  • make lint clean.
  • CI on this branch.

Note on dependency

Backed by the soak validation done in #21590 (chiado from-0 to tip, sepolia from-0 past 4,913,058, hoodi/mainnet clean — see that PR's test plan). #21590 should land first so the new default ships with the correctness fixes already in place. If #21591 merges first by accident, users hitting the SD-revival / metamorphic-CREATE2 patterns in parallel exec would regress until #21590 lands.

A separate gnosis race-class bug (4 occurrences in ~300K blocks around 14.59M-14.89M, recurring + non-deterministic) surfaced during the same soak — it is not addressed by either PR and is tracked in docs/plans/20260602-gnosis-parallel-exec-race-14594499.md. Flipping the default doesn't make it worse: users who hit it can always set EXEC3_PARALLEL=false.

@mh0lt mh0lt changed the base branch from main to mh/parallel-exec-sd-revival-metamorphic-create2 June 2, 2026 18:06
@mh0lt mh0lt force-pushed the mh/exec3-parallel-default-true branch from d1fe513 to 9246cc2 Compare June 2, 2026 18:07
Flip the EXEC3_PARALLEL env var default from false to true, making
parallel execution the default for plain `./build/bin/erigon` runs.
Setting EXEC3_PARALLEL=false (or ERIGON_EXEC3_PARALLEL=false) still
forces the serial path.

Depends on #21590 (the parallel-exec SD-revival and metamorphic-CREATE2
fixes) — that PR must merge first so the new default lands with the
correctness fixes already in place. The 5-chain soak validation that
backs the flip is documented in #21590's test plan.
@mh0lt mh0lt force-pushed the mh/exec3-parallel-default-true branch from 9246cc2 to 917ae36 Compare June 2, 2026 18:08
@mh0lt mh0lt changed the title common/dbg, cmd/integration, docs: default EXEC3_PARALLEL=true common/dbg: default EXEC3_PARALLEL=true Jun 2, 2026
@mh0lt mh0lt changed the base branch from mh/parallel-exec-sd-revival-metamorphic-create2 to main June 2, 2026 18:08
@mh0lt mh0lt enabled auto-merge June 5, 2026 10:41
@mh0lt mh0lt added this pull request to the merge queue Jun 5, 2026
Merged via the queue into main with commit 011159c Jun 5, 2026
87 of 88 checks passed
@mh0lt mh0lt deleted the mh/exec3-parallel-default-true branch June 5, 2026 11:55
domiwei pushed a commit to domiwei/erigon that referenced this pull request Jun 8, 2026
…mate (erigontech#21667)

## Summary

Fixes a parallel-exec OCC correctness bug: the apply loop was flushing
writes from txs that **failed validation** (`VersionInvalid`) with the
`complete=Done` flag, leaking phantom committed state into the
versionMap
where downstream readers picked it up.

## Symptoms before this fix

- **Intermittent wrong-trie-root failures** during from-genesis parallel
  sync (PR erigontech#21591 made `EXEC3_PARALLEL=true` the default).
- **Pattern-2 family**: gas-mismatch failures on DEX strategy contracts,
  especially in the gnosis 14.8M-18.5M block range.
- **Cascading mis-execution**: a phantom write at tx[N] is picked up by
  tx[N+k] via `MapRead`. Version-only validation passes because
`readVersion == writtenVersion`. The downstream tx commits state derived
from the phantom, and the divergence surfaces as a gas-mismatch tens of
  thousands of blocks later.

Captured deterministically at gnosis block **18,483,405**: tx[3] inc=0
was flagged Invalid by `ValidateVersionBlock`, but its 28 writes were
flushed with `complete=true` because `cntInvalid == 0` (the counter only
tracks *prior* `VersionTooEarly` txs). Slot 0x08 on contract
`0x18b2b7673c6d661923e9460d592699617828b293` was stored with value
`aabS...5981`. tx[16] subsequently read that phantom via `MapRead` and
committed phantom-derived state. The cascade surfaced as a gas-mismatch
roughly 80K blocks downstream.

## Fix

New helper `applyLoopFlushAsComplete(valid, cntInvalid) = valid &&
cntInvalid == 0`
gates the `complete` flag. An invalidated tx now flushes as
**Estimate**,
which causes downstream reads of those cells to return
`MVReadResultDependency`. The validator treats that as `VersionInvalid`,
forcing the dependent tx to re-execute after the retry settles -
restoring
OCC's \`Done = committed\` invariant.

Companion clarity changes in the same files:

- \`exec3_parallel.go\`: read \`txVersion\` from
\`txResult.Task.Version()\`
(the \`*taskVersion\` wrapper that carries the current \`Incarnation\`)
instead of bare \`TxTask\` which always returns \`Incarnation=0\`. Logs
  and traces now show the correct incarnation on retries.
- \`txtask.go\`: drop the dead \`TxTask.Incarnation\` field. Live source
of
truth is the \`txIncarnations\` counter plus the \`taskVersion\`
wrapper.

## Verification

- **Unit tests**
(\`execution/stagedsync/exec3_parallel_robustness_test.go\`):
  - \`TestApplyLoopFlushAsComplete\` - 4-case truth-table on the helper,
including the regression guard case "INVALID current tx -> must NOT be
Done".
  - \`TestApplyLoopFlush_InvalidTxWritesAreEstimate\` - drives the
    production helper through a real \`VersionMap\` using the gnosis
    18,483,405 contract+slot, asserts downstream read returns
    \`MVReadResultDependency\` (not \`MVReadResultDone\`).
  - Both confirmed to FAIL with the helper temporarily reverted to
    \`cntInvalid == 0\`, PASS with the fix.

- **Gnosis oracle CHECK** (live): swept blocks **18.46M-18.70M+
end-to-end**
  with zero \`STATE-ORACLE-DIVERGENCE\` events, zero gas-mismatch, zero
  wrong-trie-root.

- **From-genesis resync** (live): currently past historical fail blocks
  **14,837,280 / 14,845,967 / 16,420,113** without recurrence; still in
  progress through the rest of the 14.8M-18.5M range.

## Milestone

**Needs to be merged before 3.5 is cut.** With PR erigontech#21591 making
\`EXEC3_PARALLEL=true\` the default on main, this bug class affects all
3.5+ users running parallel execution. Marked for the 3.5.0 milestone.

## Test plan

- [ ] CI green
- [ ] Reviewer checks the helper's truth-table covers all four
      \`(valid, cntInvalid)\` cases
- [ ] Reviewer confirms the \`MVReadResultDependency\` -> validator
      -> re-exec path closes the OCC invariant

---------

Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
bloxster added a commit that referenced this pull request Jun 12, 2026
…21687)

Addresses code→docs gaps found in weekly maintenance run w24, plus some
Fundamentals housekeeping. **Scope: release/3.5 only.**

### Flags & env vars (`configuring-erigon`)
- Add `--snap.chaintoml-url` flag (PR #21584), including
`ERIGON_REMOTE_PREVERIFIED` override precedence
- Add `--snap.p2p-manifest` flag (PR #20526), tagged *(New in v3.5)*
- Update `EXEC3_PARALLEL` default to `true` (PR #21591)
- Add `--exec.serial` and `--exec.no-prune` flags (new in v3.5); also
document `--exec.batched-io` and `--exec.state-cache`
- Document `--exec.workers`. Its effective default is the **full
CPU-core count** (inherited from the `EXEC3_WORKERS` fallback when the
flag is unset); the flag's own `--help` text says "half the CPU cores",
which is a known inconsistency in the binary

### Prune modes / snapshots
- Update `pruning-modes.md` with the EIP-8252 retention-window breaking
change (v3.5): full mode now prunes block bodies/receipts to the last
262,144 blocks (previously kept all post-merge blocks) and will stop
serving older block/receipt data; state-history window grows 100k→262k
- Add new "Snapshots Management" page under Fundamentals (`seg du`,
snapshot categories, node-type estimates, EIP-8252 retention window).
The `seg du` example uses an archive datadir, since the estimator only
sums on-disk files and the archive row therefore equals the current
total
- Fix the `erigon snapshots …` ver-format upgrade/downgrade commands in
`get-started/installation/upgrading.md`

### Fundamentals section housekeeping
- Fix a Mermaid parse error on the **Architecture** page — the
Caplin→Execution edge label had unquoted parentheses (`|new
blocks<br/>(Engine API)|`), which the flowchart parser rejects; now
quoted
- Reorder the Fundamentals sidebar into a clean integer reading order.
It had grown to ~23 entries with colliding `sidebar_position` values and
scattered related pages. New order: concepts first (Architecture,
Database, Pruning Modes, Snapshots Management, Caplin) → configuration →
operations/tuning → security → integrations
- **NAT** moved out of the CLI Reference subfolder to a top-level
Fundamentals page (next to Default Ports):
`/fundamentals/configuring-erigon/nat` → `/fundamentals/nat`. The old
URL is preserved via a client-side redirect
(`@docusaurus/plugin-client-redirects`, pinned to 3.10.0); the site is
on GitHub Pages, which can't do host-level 301s
- The **CLI Reference** page was flattened
(`configuring-erigon/index.mdx` → `configuring-erigon.mdx`); its
`/fundamentals/configuring-erigon` URL is unchanged

### Mobile UI
- Adds a theme-color meta tag to the docs site so mobile browsers tint
the address/status bar with the Erigon brand orange — the same behavior
the main website, Cocoon and Zilkworm docs already have.

Regenerated `llms.txt` / `llms-full.txt` artifacts.

---
_Updated after Copilot + @yperbasis review: replaced the `seg du`
example with **real mainnet-archive output** (correct ByteCount
renderings, `extensions:` line, byte-exact estimates table); documented
`other_extensions`; made `--datadir` optional and chain-agnostic;
clarified full-mode block-pruning impact; corrected the `--exec.workers`
default and tagged all `--exec.*` flags as new in v3.5; added
`--exec.batched-io` / `--exec.state-cache`; fixed the RPC cert filename
typos in **TLS Authentication** (`RPC key.pem` → `RPC-key.pem`,
`RPC.crtv` → `RPC.crt`); gave NAT an integer sidebar position and moved
the **Modules** overview first; removed a stray editorial note in the
CLI reference; typed the `seg du` code fences as `text`; de-duplicated
pruning-mode concepts on the Snapshots page (now deferring to
`pruning-modes.md`); and fixed `sidebar_position` collisions._

---------

Co-authored-by: Andy (NanoClaw) <andy@nanoclaw.ai>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Bloxster <gianni.morselli@erigon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complete Testing of parallel exec and turn it in as default

3 participants