Skip to content

[glamsterdam-devnet-5] non-deterministic parallel-exec BAL computation rejects canonical blocks as invalid, wedging the CL #21651

@taratorio

Description

@taratorio

Summary

On glamsterdam-devnet-5, the erigon EL on teku-erigon-1 repeatedly rejected canonical, network-finalized blocks as invalid with block access list mismatch while batch-executing during catch-up after a fresh resync. Critically, the computed BAL hash differed across re-execution attempts of the same block — the parallel executor's BAL computation is non-deterministic. Erigon's spurious INVALID verdicts propagated to teku over the engine API; teku invalidated the corresponding fork-choice branches, and once the network justified a checkpoint inside an invalidated branch, teku's fork choice wedged fatally (ProtoArray: Finalized node is unknown crash loop). The node has been dead at slot 7865 since ~16:31 UTC on 2026-06-05 (3,800+ slots behind by now), validators offline. The rest of the network was healthy (~85% participation, finalizing normally), and the sibling erigon nodes (lighthouse/lodestar/prysm/nimbus-erigon-1) executed the same blocks without BAL errors.

Environment

Timeline (UTC, 2026-06-05)

Time Event
08:13, 09:59 Node redeployed with the new build; erigon datadir wiped (eth_syncing startingBlock=0x0, OtterSync) → full resync — same deployment wave as #21650
12:59:59 First BAL validation failure during parallel batch execution near the tip: block 6292, plus exec loop error … parallel exec loop exited with 28 block(s) still pending in pe.blockExecutors (reason=ctx-done-drain: no more pending results)
12:59–16:32 Unwind → re-execute → fail loop, one failure every ~2 min: 104 BAL mismatch errors across 35 distinct canonical blocks (range ~5545–6320), failing block and computed hash varying per attempt
~15:05 teku: Payload … marked as invalid by Execution ClientWill run fork choice because head block … was invalid; also Unable to import blocks: DATA_NOT_AVAILABLE every slot
15:13 teku head freezes at slot 7865 (epoch 245)
16:13:24 teku FatalServiceFailureException: Invalid or unknown justified root: 0xec7020d2… — the network-justified checkpoint (epoch 244) was inside a branch teku had invalidated on the EL's verdict
16:31 → ongoing teku Quartz-timer crash loop ProtoArray: Finalized node is unknown (166k+ log lines); engine API traffic stops, erigon parked mid-sync (Execution stage 4041, Headers 6320)

Key logs (erigon)

First failure:

[EROR] [06-05|12:59:59.865] BAL mismatch: computed   block=6292 hash=0xb5a34d031b37d9e1d3f7b494b623042585b5c93cae741fc2ad1f089f003c8797 headerHash=0x82a1e372cd1a0c1774cb3790eb5f6b3a171e653b38ea68661e8ad891c11c63e1 ...
[WARN] [06-05|12:59:59.870] [4/6 Execution] exec loop error          err="invalid block: parallel exec loop exited with 28 block(s) still pending in pe.blockExecutors [6310 6314 6293 …] (reason=ctx-done-drain: no more pending results)"
[WARN] [06-05|12:59:59.878] [4/6 Execution] Execution failed         err="invalid block, block=6292 (hash=0xd8a32374f68aa884c1b8c34d07c997a35746fa3986652c9815e1cbd6d86c5c8f): block access list mismatch: got 0xb5a34d03… expected 0x82a1e372…"

Non-determinism evidence

Same block, same expected header BAL hash, different computed hash on each re-execution attempt (after unwinds):

Block Expected (header) Computed per attempt
6066 0xff6f4970… 0x5aec9998… (13:11:19), 0x19932689… (13:13:07, 13:23:08, 13:36:30)
6075 0x127c2929… 0x99d0958b… (13:07:43), 0xdfffbe15… (13:28:31), 0xc5346d98… (13:56:56)
6078 0xecae7d49… 0xd239753d… (13:21:19), 0x7e1a11c8… (13:49:44), 0xd81162c8… (13:51:33), 0x72764c55… (14:02:21)
6069 0xc7a34401… 0xddd4c662… (14:00:31, 14:04:07), 0x23a4418f… (14:11:17)

The computed BALs also consistently contain more accounts than the stored sidecar BAL (e.g. block 6079: computed accounts=221/220 across attempts vs stored accounts=219; block 6093: computed 30 vs stored 23; block 6120: computed 26 vs stored 19), with extra precompile-range addresses appearing in the computed set.

All of these blocks executed fine on the first pass through this range (the node had reached ~6292 before the first failure), and are canonical — the network finalized well past them.

Consequence on the CL (teku)

2026-06-05 15:05:53.774 WARN  - Payload for  node ForkChoiceNode[blockRoot=0x46f18929…, payloadStatus=PAYLOAD_STATUS_EMPTY] marked as invalid by Execution Client
2026-06-05 15:05:53.774 WARN  - Will run fork choice because head block 0x46f18929… was invalid
2026-06-05 16:13:24      …      FatalServiceFailureException: Invalid or unknown justified root: 0xec7020d2142c0e663d0941332c48b2cc16bedf54e39e7336667569dfd866d0c4
2026-06-05 16:31:09      ERROR - Job DEFAULT.Timer-N threw an unhandled Exception: … IllegalArgumentException: ProtoArray: Finalized node is unknown ForkChoiceNode[blockRoot=0xec7020d2…, payloadStatus=PAYLOAD_STATUS_PE…

A spurious INVALID is the worst engine API answer an EL can give: the CL prunes canonical branches and, as seen here, can wedge unrecoverably.

Analysis / pointers

  • Validation + logging path: execution/stagedsync/bal_create.go (ProcessBAL). Its comment asserts "the BalancePath cross-check in VersionMap.validateRead ensures deterministic parallel execution" — violated here.
  • This is the bug class TestEngineApiBALParallelConsistencyStress (execution/engineapi/engine_api_bal_test.go) was written to surface: parallel-executor BAL diverging from the assembler/serial BAL under concurrent write pressure — "If this test flakes, it's the same class of bug that makes the glamsterdam assertoor suite fail."
  • The field signature — first mismatch at the tip during a parallel batch, then re-exec attempts after unwind failing at random earlier blocks with varying computed hashes — points at per-block BAL accumulation state not being correctly reset/isolated across conflict re-runs, retries, or unwinds in the parallel executor.
  • Parallel exec was recently made the default (common/dbg: default EXEC3_PARALLEL=true #21591), so this also affects default-config nodes.

Expected behavior

BAL computation must be deterministic and match the serial/assembler result; canonical blocks must never be rejected as INVALID. When a BAL mismatch is detected, failing loudly is correct only if the computation is trustworthy — a non-deterministic checker converts an internal race into consensus-level self-destruction.

Notes

  • Mitigation for devnet nodes until fixed: EXEC3_PARALLEL=false.
  • Related but distinct same-build incident: [glamsterdam-devnet-5] no change sets for unwinding after initial sync causes node to get stuck #21650 (nimbus-erigon-1, unwind/changesets -38006 Too deep reorg wedge).
  • Recovery of teku-erigon-1 likely needs a teku DB wipe/resync — its protoarray has invalidated canonical branches — plus letting erigon finish (or redo) its sync.
  • Full debug report (raw Dora/ClickHouse/RPC evidence with re-derivation commands) available on request.

Metadata

Metadata

Assignees

Labels

Glamsterdamhttps://eips.ethereum.org/EIPS/eip-7773

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions