Skip to content

debug_executionWitness: legacy + canonical conformance#21629

Merged
awskii merged 70 commits into
mainfrom
awskii/execution-witness-legacy-canonical
Jun 5, 2026
Merged

debug_executionWitness: legacy + canonical conformance#21629
awskii merged 70 commits into
mainfrom
awskii/execution-witness-legacy-canonical

Conversation

@awskii

@awskii awskii commented Jun 4, 2026

Copy link
Copy Markdown
Member

Adds the zkEVM (canonical) conformance runner for debug_executionWitness, the legacy-mode field population, the optional mode request param, and a format spec.

Changes

#21487

Carries #21487's suite plus the fixes that make it pass; #21487 is redundant once CI is green here. The open question from #21487 — whether a fixture-gated suite belongs in the required race-gate — is resolved: it's now a zkevmtest runner shard in the eest-spec-tests workflow (the #21092 pattern), not a go test package.

Mainnet legacy soak vs reth-legacy

Rolling soak near tip (block ~25.25M, ~3.7k blocks checked):

  • 0 health failures — no 500s, no root mismatches; every witness passes verifyWitnessStateless.
  • 90.6% strict set-equal on {keys, codes, state}.
  • Divergences are almost all over-inclusion (204 blocks keys, 86 codes, 62 state) — a few extra nodes per block, still re-executing to the correct root. 3 blocks show minor keys/codes under-inclusion, none affecting verifyWitnessStateless.

awskii added 30 commits May 27, 2026 16:58
… queries (Task 8)

Full corpus run: 15425 PASS / 7107 FAIL subtests across 356/2871 fixture
files. Suite is red by design; failures recorded for human triage, not muted.

Runner fix found during the run: invalid-block fixtures (expectException)
carry an executionWitness but no canonical blocknumber, so debug_executionWitness
can't be queried for them. Capture expectException per block and skip the
witness query for rejected blocks (RunWithTester already asserts rejection).
This removed 972 false "no parseable block number" failures.

Dominant remaining signal: debug_executionWitness returns a strictly smaller
witness (state/codes/headers) than EEST's canonical stateless witness across
nearly every feature - one-directional shortfall, likely a single root cause
in the witness builder. Plus EIP-7702 stateless re-exec rejecting delegated
EOA senders. Recorded under Task 8 results for triage.
…ing, add witness parser unit test

- tools/test-groups: document the greedy-partition ordering invariant
  (subset groups must precede broader ones or resolve empty)
- execution/tests/testutil: add fixture-independent unit test for the
  WitnessBlockTest parser accessors (NumBlocks, ExpectedWitnessForBlock,
  BlockNumberForBlock, BlockExpectsException), covering them when the
  downloaded corpus is absent
- test-all-erigon-race.yml: trim over-verbose cache comment per comment policy
…ness

# Conflicts:
#	execution/tests/testutil/block_test_util.go
main (#21532) changed ExecutionWitnessResult.Headers to []hexutil.Bytes
(RLP-encoded). Update rpcHeaderHashes to RLP-decode like expectedHeaderHashes
instead of reading a map["hash"] field.
zkevm@v0.4.0 charges EIP-8037 AUTH_BASE on 7702 clears of an undelegated
authority; current spec and tests-bal@v7.2.0 refund it. SkipLoad the 21
affected for_amsterdam subtests until the corpus is bumped. Tracking: #21563.
…fixtures

The witness_validation_* fixtures store deliberately mutated executionWitnesses
(stateless-verifier negative tests), so producer comparison always diverges -
erigon's produced witness is the canonical one. SkipLoad the 22 affected files
until a stateless-verify consumer mode exists. Tracking: #21566.
…rom execution witness

debug_executionWitness computes the post-state commitment by replaying a
block's updates in key-sorted order. Deleting a key that reduces a branch to a
single child collapses it, and the collapse tracer touches the surviving
sibling into the witness. When a later insert in the same block refills that
branch, its final form has >=2 children and the sibling is not part of the
canonical witness, so the witness carried one extra node.

Carry the collapsing branch's prefix through CollapseTracer. After the
post-state commitment, read each candidate branch from the in-memory commitment
domain and drop the sibling when the branch ended with >=2 children (transient
collapse); keep siblings whose branch genuinely collapsed.
… filter

- defer SetCollapseTracer(nil) so the tracer is cleared on the error and
  root-mismatch paths, not only on success
- clear collapseTracer in resetForReuse() so a pooled trie cannot retain a
  stale closure
- preallocate siblingPaths
- reword the filter comment: it restores canonical witness membership, it does
  not reproduce the spec's insert-before-delete replay order
Add ERIGON_WITNESS_MODE (default legacy) selecting the debug_executionWitness
output format; canonical preserves the minimized format. Legacy codes emit every
bytecode loaded during execution (accessed plus in-block-created) and a single
empty bytecode when an empty-code account is loaded.
rpc/jsonrpc: update TestExecutionWitness to expect populated Keys
(20B address / 32B slot preimages), replacing the stale Geth-compat
nil assertion now that legacy mode emits keys in both modes.

execution/tests: pin the zkevm witness corpus to canonical mode via
t.Setenv so CI compares against the minimized canonical fixtures
regardless of the legacy deployment default.
Comment thread execution/tests/eest_zkevm_witness/witness_test.go Outdated
Comment thread tools/test-groups Outdated
Per PR #21629 review (the #21092 pattern): the ~22k-test zkevm
execution-witness conformance suite must not run under `go test` /
`make test-all`. Replace it with a `zkevmtest` cmd/evm subcommand executed as
parallel `eest-spec-tests` shards (incl. a -race variant).

- cmd/evm/zkevmrunner.go: new `zkevmtest` subcommand — imports each EEST zkevm
  blockchain_tests fixture, queries debug_executionWitness (canonical) per
  block, multiset-compares state/codes/headers. Expected-fail patterns for the
  known #21563/#21566 divergences are ported via TestMatcher so the shard
  budget stays 0.
- testutil: add BlockTest.RunWithTesterCLI (nil tb, caller-owned tester) and
  TestMatcher.CheckFailureForName for CLI runners; RunCLI now shares the path.
- shards: add zkevm-witness + zkevm-witness-race-all to eest-spec-shards.yml,
  route them in run-eest-spec-test.sh, cache the eest_zkevm tarball in
  test-eest-spec.yml, and add Makefile targets provisioning test-fixtures-zkevm.
- remove the go-test suite (execution/tests/eest_zkevm_witness), its
  execution-eest-zkevm test-group, the zkevm steps in test-all-erigon-race.yml,
  and the GOTEST_PACKAGES filter (back to ./...).
@awskii awskii requested a review from taratorio June 5, 2026 11:16
Comment thread tools/eest-spec-shards.yml Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.

Comment thread rpc/jsonrpc/debug_execution_witness_test.go Outdated
Comment thread docs/plans/witness-legacy-mode-spec.md Outdated
Comment thread rpc/jsonrpc/debug_execution_witness.go Outdated
…-race)

Address PR #21629 review: rename zkevm-witness-race-all -> zkevm-witness-race
(the -all suffix was redundant) and drop the separate non-race zkevm-witness
shard. The race shard already runs the full witness comparison over the whole
corpus, so the non-race shard added no coverage — only faster feedback — and
the original suite was race-only. Race-shard detection now keys on "-race"
(matching the bare suffix) instead of "-race-".
@taratorio

Copy link
Copy Markdown
Member

from claude review

Review: PR #21629debug_executionWitness: legacy + canonical conformance

Overview

Four logical pieces, all coherent with each other:

  1. zkevmtest runner (cmd/evm/zkevmrunner.go) — imports each EEST zkevm@v0.4.0 fixture chain into an in-memory node (BAL executor + historical commitment), queries debug_executionWitness in canonical mode per block, multiset-compares state/codes/headers against the fixture.
  2. Legacy-mode field populationkeys (20B address / 32B slot preimages) and codes (accessed ∪ modified ∪ pre-state, incl. EIP-7702 designators, plus a single empty entry).
  3. Optional mode request paramlegacy (default) / canonical, strict validation.
  4. CI plumbing — two new shards (zkevm-witness, zkevm-witness-race-all) wired through the single-source tools/eest-spec-shards.yml, Makefile, run script, and workflow cache.

Verification performed

  • Built cmd/evm; ran all new unit tests (TestWitnessBlockTestParser, TestBranchData_ChildCount, the four new rpc/jsonrpc witness tests, updated TestExecutionWitness) — all pass locally.
  • CI fully green including both new shards (zkevm-witness 9m29s, zkevm-witness-race-all 33m21s).
  • Traced the shard plumbing end-to-end: *zkevm* base resolution correctly precedes *-stable* in tools/run-eest-spec-test.sh (line 68), the zkevm-witness* route arm catches the -race-all variant, Makefile jq filters partition shards cleanly, and the workflow's load-matrix picks new shards up automatically.
  • Checked the delete-then-recreate edge in accountExists (rpc/jsonrpc/debug_execution_witness.go line 416): safe — UpdateAccountData removes the address from DeletedAccounts, so checking DeletedAccounts first is correct.
  • Confirmed callCodeAccessHook is interface-gated (execution/state/intra_block_state.go line 635) — only RecordingState implements codeAccessTracker, so normal sync execution is unaffected by the GetDelegatedDesignation hook.
  • Confirmed the {0x80} append happens after verifyWitnessStateless and before the sort (debug_execution_witness.go lines 767-786), so the producer's own verification never sees the unreachable node; the RLPDecode skip (trie.go ~line 1553) covers external consumers feeding a legacy witness back in.
  • Trailing *string param is the established optional-arg pattern in this RPC layer; JSON-RPC-level backward compatible.

Findings

No correctness blockers. In order of importance:

1. PR description is stale (worth fixing before merge — it's reviewer/archaeology-facing)

  • It claims the suite is "Kept out of the default go test ./... (make test-all drops it via GOTEST_PACKAGES)" — but the final state has GOTEST_PACKAGES = ./... (Makefile line 93) with no exclusion. Commit 67160ee's exclusion was superseded when 0da251f moved the suite from a go-test package to the cmd/evm zkevmtest CLI runner, which is the cleaner outcome — but the body should say that instead.
  • "runs in its own execution-eest-zkevm race shard" — actual shard names are zkevm-witness / zkevm-witness-race-all.

2. WITNESS_STRICT_VERIFY env handling footgun (debug_execution_witness.go lines 1533, 1543)

  • Code checks os.Getenv(...) != "", so WITNESS_STRICT_VERIFY=false enables strict mode, while the spec doc says =true. Erigon's dbg.EnvBool convention would both parse correctly and hoist the per-read os.Getenv out of the stateless-verify hot loop (it currently fires on every account/storage read during verification).

3. Stale env-var residue in test (debug_execution_witness_test.go)

  • TestResolveWitnessMode failure message reads "param legacy should win over canonical env" — the env override was removed in 7352a9b; the message describes a dead behavior.

4. emptyCodeAccessed overlay asymmetry (question, not a bug)

  • ReadAccountData sets the flag only on the inner-reader path (debug_execution_witness.go line 146); overlay-served reads (account modified in-block, re-read later) and codeOverlay hits with empty code don't. If intentional ("legacy emits the empty entry for pre-state empty-code materialization only"), fine — the mainnet soak's residual ≤1-code diffs suggest this is the remaining heuristic gap vs reth-legacy. A one-line answer in the PR would help future readers.

5. bytes.Contains(node, trie.EmptyRoot[:]) heuristic (debug_execution_witness.go line 776)

  • A storage value or slot key coincidentally equal to EmptyRoot would false-positive the {0x80} append. Harmless (over-inclusion of an unreachable node, post-verification), and matching reth-legacy is the goal — just noting it's containment, not field-position, matching. Acceptable as-is.

6. Minor

  • runZkevmTestsParallel is a third copy of the worker-pool pattern in blockrunner.go/staterunner.go. Consistent with existing idiom, so fine here; a shared helper is a candidate follow-up, not a request for this PR.
  • compareWitness compares headers by recomputed hash rather than raw bytes — tolerates re-encoding differences, which is the right call for RLP fixtures.
  • .gitignore .claude/scheduled_tasks.lock is unrelated housekeeping riding along. Trivial.
  • zkevm-witness-race-all at 33m is now the slowest EEST shard (60m timeout). The PR body already flags the open question of whether a fixture-gated suite belongs in the required race gate — agreed that's worth settling, but not a blocker.

What's done well

Verdict

Approve with nits. Correctness is well-evidenced (22,532-fixture conformance green in CI including under race, plus a 1.2k-block mainnet soak vs reth-legacy). Before merge: update the stale PR description (finding 1), and ideally the WITNESS_STRICT_VERIFY parsing (finding 2) and the stale test message (finding 3) — all small.

awskii added 2 commits June 5, 2026 13:07
- WitnessKeys: dedup via map[common.Address]/map[common.Hash] + bytes.Clone
  instead of round-tripping every key through string(); 20B address and 32B
  slot keys can't collide, so the typed sets reproduce the same dedup with
  fewer allocations (behavior unchanged; TestCollectAccessedState_*).
- TestResolveWitnessMode: drop the stale "canonical env" wording — mode is the
  request param only (the env was removed in this PR).
- witness-legacy-mode-spec.md: note the EIP-7928 system address is also kept
  when accessed by a user transaction, not only on a real state change.
…ad loop

From the PR review: WITNESS_STRICT_VERIFY was checked via os.Getenv(...) != "" on
every account/storage read during stateless verification, so
WITNESS_STRICT_VERIFY=false would enable strict mode. Parse it once with
dbg.EnvBool into witnessStateless.strictVerify (production behavior with the env
unset is unchanged). Also sharpen the empty-code comment to note that only
pre-state empty-code loads emit the empty entry.
@awskii

awskii commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Thanks — went through the claude-review findings:

  • 1 (stale PR description) — updated. It now describes the zkevmtest runner + the single zkevm-witness-race shard, with GOTEST_PACKAGES = ./... (no exclusion).
  • 2 (WITNESS_STRICT_VERIFY footgun) — fixed in ab2007f: parsed once via dbg.EnvBool into a witnessStateless.strictVerify field, out of the per-read verify loop. =false now disables instead of enabling, and the os.Getenv no longer fires on every account/storage read. Production behavior (env unset) is unchanged.
  • 3 (stale test message) — fixed in e47daa8 (dropped the "canonical env" wording).
  • 4 (emptyCodeAccessed overlay asymmetry) — intentional: legacy emits the single empty entry only for pre-state empty-code materialization; overlay/in-block reads are post-state, so they don't set it. That's the remaining ≤1-code heuristic gap vs reth-legacy in the soak. Sharpened the inline comment to say so.
  • 5 (EmptyRoot containment) — leaving as-is: harmless over-inclusion of an unreachable node post-verification, and matching reth-legacy is the goal.
  • 6 (minor)runZkevmTestsParallel follows the existing blockrunner/staterunner idiom (a shared helper is a reasonable follow-up, not this PR); the .gitignore line is a valid entry; headers-by-hash is intentional for RLP fixtures.

@awskii

awskii commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Ran a continuous legacy-mode debug_executionWitness parity soak (erigon-legacy vs a reth-legacy oracle) on mainnet around block 25.25M after this merged: ~4.9k blocks, 0 health failures, ~480 key/state divergences vs reth. Posting the root cause since the divergences look alarming but aren't.

The divergences are benign minimization differences — erigon's witness self-verifies in every one. verifyWitnessStateless runs inline (rpc/jsonrpc/debug_execution_witness.go:766); divergent blocks reproduce the post-state root (logged as a parity diff, never a verify failure). The divergent slots are all zero-valued read-only SLOADs.

Breakdown:

  • keys-over (dominant, 265 blocks): erigon records a zero-read on an EIP-7702 delegated EOA (0xef0100||impl); reth-legacy drops it.
  • keys-under (rare, 4 blocks): erigon drops a zero-read on a born-in-block CREATE2 contract; reth-legacy keeps it. reth also emits duplicate keys, which inflates the raw under-count (e.g. 8 distinct slots showed as u=15).

Root cause is a created-flag asymmetry. The IBS sets created on CREATE/CREATE2, so post-creation zero-SLOADs short-circuit without calling RecordingState.ReadAccountStorage (:186) and never reach WitnessKeys (:934). A 7702 delegation doesn't set created, so its zero-reads go through the reader and get kept. reth has the inverse coverage. Neither omission breaks verification.

Caveat for anyone using WITNESS_STRICT_VERIFY to triage this: it's not a reliable arbiter. It errors on any node physically absent from the witness, but reborn-contract reads short-circuit in the stateless IBS before reaching its strict trie check, so it conflates "absent" with "needed". The lenient self-verify (root match) is the correct test, and it passes.

No correctness fix needed. If exact key-set parity with reth-legacy is wanted, the one safe change is to also minimize 7702-EOA empty zero-reads (mirror the CREATE2 path) — removes the keys-over noise, keeps self-verify valid. Can open a follow-up PR if useful.

pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 6, 2026
…h#21640)

## Problem

`tools/run-eest-spec-test.sh` invokes the cmd/evm runner with `|| true`
and decides shard pass/fail from the `--jsonout` report alone. That
design is correct for the non-race shards (the runners always exit 0 and
report via JSON; crashes surface as missing/truncated JSON, which the
script catches via the `total == 0` check and `set -euo pipefail` on
jq).

It is wrong for the `*-race-*` shards: the Go race runtime reports
detected races to **stderr** and overrides the process exit code to
**66** (even through `os.Exit(0)`), while the run completes normally and
the JSON stays fully valid with every test passing. The `|| true`
discards that exit code, so a race shard whose binary detected a data
race still went green — the `WARNING: DATA RACE` report only visible in
the job log. The race detector's whole value is catching races that
*don't* happen to corrupt results, so this silently weakened the
guarantee of every race shard:

```
blocktests-stable-race-{pre-cancun,cancun,prague,osaka}-{sequential,parallel}
blocktests-devnet-race-amsterdam
```

(and `zkevm-witness-race-all` once erigontech#21629 lands).

Pre-existing since the shards were introduced in erigontech#21092; surfaced while
reviewing erigontech#21629, which doubles the corpus running under the race
detector.

## Fix

Capture the runner's exit code and fail the shard explicitly on 66 with
a pointer to the race report above it in the log. All other exit-code
tolerance is unchanged.

## Verification (local repro)

Built a minimal `-race` Go binary containing a genuine data race that
emits valid passing JSON (`[{"name":"t1","pass":true}]`) and calls
`os.Exit(0)`, then ran it through the real script as `EVM_BIN` against
`blocktests-stable-race-cancun-sequential`:

- **Before** (unmodified script): binary prints `Found 1 data race(s)`
and exits 66 → script prints `ran 1 tests, 0 failed` and exits **0**
(green despite the race).
- **After**: same run exits **1** with `ERROR: data race detected (race
runtime exit code 66); see WARNING: DATA RACE in the log above`.

Regression matrix with stub binaries through the patched script,
confirming only the race case changes behavior:

| Scenario | Before | After |
|---|---|---|
| race detected (exit 66, valid passing JSON) | exit 0 | **exit 1** |
| clean pass (exit 0) | exit 0 | exit 0 |
| non-zero exit ≠ 66 with valid JSON | exit 0 (tolerated by design) |
exit 0 |
| test failures over budget | exit 1 | exit 1 |

`shellcheck --severity=warning` (the lint-workflow invocation) is clean
on the modified script; `make lint` clean.

The race shards running on this PR exercise the patched path end-to-end
in CI (race-free run → `rc=0` → behavior identical to before).

## Note on TDD

This is a shell-script CI guard with no test harness; the Red → Green
cycle above (reproduce the masked exit code through the unmodified
script, then flip it with the fix) stands in for a checked-in failing
test.

## End-to-end verification with a real race in erigon code

Repeated the verification with the full production pipeline instead of
stub binaries: planted a deliberate data race in
`IntraBlockState.CommitBlock` (two unsynchronized goroutines
incrementing a package var, executed once per imported block — never
committed), built the real `evm.race` via `make evm.race`, and ran a
single real EEST fixture
(`cancun/eip1153_tstore/test_basic_tload_after_store.json`, one
`fork_Cancun` test) through the
`blocktests-stable-race-cancun-sequential` shard:

- **old script** (origin/main): race report correctly attributes
`intra_block_state.go` in `CommitBlock.func1`, binary prints `Found 1
data race(s)` and exits 66 → script prints `ran 1 tests, 0 failed`,
**exit 0** — green despite the race.
- **this PR's script**: same binary, same fixture → **exit 1**, last
line `ERROR: data race detected (race runtime exit code 66); see
WARNING: DATA RACE in the log above`.
- **control**: race reverted, `evm.race` rebuilt clean, same fixture
through this PR's script → 0 race warnings, `ran 1 tests, 0 failed`,
**exit 0** — no false positive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants