[glamsterdam-devnet-5] non-deterministic parallel-exec BAL computation rejects canonical blocks as invalid, wedging the CL

## Summary

On `glamsterdam-devnet-5`, the erigon EL on `teku-erigon-1` repeatedly rejected **canonical, network-finalized blocks** as invalid with `block access list mismatch` while batch-executing during catch-up after a fresh resync. Critically, the **computed BAL hash differed across re-execution attempts of the same block** — the parallel executor's BAL computation is non-deterministic. Erigon's spurious INVALID verdicts propagated to teku over the engine API; teku invalidated the corresponding fork-choice branches, and once the network justified a checkpoint inside an invalidated branch, teku's fork choice wedged fatally (`ProtoArray: Finalized node is unknown` crash loop). The node has been dead at slot 7865 since ~16:31 UTC on 2026-06-05 (3,800+ slots behind by now), validators offline. The rest of the network was healthy (~85% participation, finalizing normally), and the sibling erigon nodes (`lighthouse/lodestar/prysm/nimbus-erigon-1`) executed the same blocks without BAL errors.

## Environment

- erigon: branch `glamsterdam-devnet-5`, commit `1ca634d4b094f6b3932ab27227a1fa34895753b1` (`erigon/3.5.0/linux-amd64/go1.25.11`) — same build as #21650
- CL: `teku/v26.4.0+137-g766dcdaefe`
- Network: `glamsterdam-devnet-5` (12s slots, genesis 2026-06-04 13:00:00 UTC, `GLOAS_FORK_EPOCH=30`)
- Node: `teku-erigon-1`

## Timeline (UTC, 2026-06-05)

| Time | Event |
|---|---|
| 08:13, 09:59 | Node redeployed with the new build; erigon datadir wiped (`eth_syncing startingBlock=0x0`, OtterSync) → full resync — same deployment wave as #21650 |
| 12:59:59 | First BAL validation failure during parallel batch execution near the tip: block 6292, plus `exec loop error … parallel exec loop exited with 28 block(s) still pending in pe.blockExecutors (reason=ctx-done-drain: no more pending results)` |
| 12:59–16:32 | Unwind → re-execute → fail loop, one failure every ~2 min: **104 `BAL mismatch` errors across 35 distinct canonical blocks** (range ~5545–6320), failing block and computed hash varying per attempt |
| ~15:05 | teku: `Payload … marked as invalid by Execution Client` → `Will run fork choice because head block … was invalid`; also `Unable to import blocks: DATA_NOT_AVAILABLE` every slot |
| 15:13 | teku head freezes at slot 7865 (epoch 245) |
| 16:13:24 | teku `FatalServiceFailureException: Invalid or unknown justified root: 0xec7020d2…` — the network-justified checkpoint (epoch 244) was inside a branch teku had invalidated on the EL's verdict |
| 16:31 → ongoing | teku Quartz-timer crash loop `ProtoArray: Finalized node is unknown` (166k+ log lines); engine API traffic stops, erigon parked mid-sync (Execution stage 4041, Headers 6320) |

## Key logs (erigon)

First failure:

```
[EROR] [06-05|12:59:59.865] BAL mismatch: computed   block=6292 hash=0xb5a34d031b37d9e1d3f7b494b623042585b5c93cae741fc2ad1f089f003c8797 headerHash=0x82a1e372cd1a0c1774cb3790eb5f6b3a171e653b38ea68661e8ad891c11c63e1 ...
[WARN] [06-05|12:59:59.870] [4/6 Execution] exec loop error          err="invalid block: parallel exec loop exited with 28 block(s) still pending in pe.blockExecutors [6310 6314 6293 …] (reason=ctx-done-drain: no more pending results)"
[WARN] [06-05|12:59:59.878] [4/6 Execution] Execution failed         err="invalid block, block=6292 (hash=0xd8a32374f68aa884c1b8c34d07c997a35746fa3986652c9815e1cbd6d86c5c8f): block access list mismatch: got 0xb5a34d03… expected 0x82a1e372…"
```

## Non-determinism evidence

Same block, same expected header BAL hash, **different computed hash on each re-execution attempt** (after unwinds):

| Block | Expected (header) | Computed per attempt |
|---|---|---|
| 6066 | `0xff6f4970…` | `0x5aec9998…` (13:11:19), `0x19932689…` (13:13:07, 13:23:08, 13:36:30) |
| 6075 | `0x127c2929…` | `0x99d0958b…` (13:07:43), `0xdfffbe15…` (13:28:31), `0xc5346d98…` (13:56:56) |
| 6078 | `0xecae7d49…` | `0xd239753d…` (13:21:19), `0x7e1a11c8…` (13:49:44), `0xd81162c8…` (13:51:33), `0x72764c55…` (14:02:21) |
| 6069 | `0xc7a34401…` | `0xddd4c662…` (14:00:31, 14:04:07), `0x23a4418f…` (14:11:17) |

The computed BALs also consistently contain **more accounts** than the stored sidecar BAL (e.g. block 6079: computed `accounts=221`/`220` across attempts vs stored `accounts=219`; block 6093: computed 30 vs stored 23; block 6120: computed 26 vs stored 19), with extra precompile-range addresses appearing in the computed set.

All of these blocks executed fine on the first pass through this range (the node had reached ~6292 before the first failure), and are canonical — the network finalized well past them.

## Consequence on the CL (teku)

```
2026-06-05 15:05:53.774 WARN  - Payload for  node ForkChoiceNode[blockRoot=0x46f18929…, payloadStatus=PAYLOAD_STATUS_EMPTY] marked as invalid by Execution Client
2026-06-05 15:05:53.774 WARN  - Will run fork choice because head block 0x46f18929… was invalid
2026-06-05 16:13:24      …      FatalServiceFailureException: Invalid or unknown justified root: 0xec7020d2142c0e663d0941332c48b2cc16bedf54e39e7336667569dfd866d0c4
2026-06-05 16:31:09      ERROR - Job DEFAULT.Timer-N threw an unhandled Exception: … IllegalArgumentException: ProtoArray: Finalized node is unknown ForkChoiceNode[blockRoot=0xec7020d2…, payloadStatus=PAYLOAD_STATUS_PE…
```

A spurious INVALID is the worst engine API answer an EL can give: the CL prunes canonical branches and, as seen here, can wedge unrecoverably.

## Analysis / pointers

- Validation + logging path: `execution/stagedsync/bal_create.go` (`ProcessBAL`). Its comment asserts "the BalancePath cross-check in `VersionMap.validateRead` ensures deterministic parallel execution" — violated here.
- This is the bug class `TestEngineApiBALParallelConsistencyStress` (`execution/engineapi/engine_api_bal_test.go`) was written to surface: parallel-executor BAL diverging from the assembler/serial BAL under concurrent write pressure — "If this test flakes, it's the same class of bug that makes the glamsterdam assertoor suite fail."
- The field signature — first mismatch at the tip during a parallel batch, then re-exec attempts after unwind failing at *random earlier blocks* with *varying computed hashes* — points at per-block BAL accumulation state not being correctly reset/isolated across conflict re-runs, retries, or unwinds in the parallel executor.
- Parallel exec was recently made the default (#21591), so this also affects default-config nodes.

## Expected behavior

BAL computation must be deterministic and match the serial/assembler result; canonical blocks must never be rejected as INVALID. When a BAL mismatch is detected, failing loudly is correct *only if* the computation is trustworthy — a non-deterministic checker converts an internal race into consensus-level self-destruction.

## Notes

- Mitigation for devnet nodes until fixed: `EXEC3_PARALLEL=false`.
- Related but distinct same-build incident: #21650 (`nimbus-erigon-1`, unwind/changesets `-38006 Too deep reorg` wedge).
- Recovery of `teku-erigon-1` likely needs a teku DB wipe/resync — its protoarray has invalidated canonical branches — plus letting erigon finish (or redo) its sync.
- Full debug report (raw Dora/ClickHouse/RPC evidence with re-derivation commands) available on request.


Block	Expected (header)	Computed per attempt
6066	`0xff6f4970…`	`0x5aec9998…` (13:11:19), `0x19932689…` (13:13:07, 13:23:08, 13:36:30)
6075	`0x127c2929…`	`0x99d0958b…` (13:07:43), `0xdfffbe15…` (13:28:31), `0xc5346d98…` (13:56:56)
6078	`0xecae7d49…`	`0xd239753d…` (13:21:19), `0x7e1a11c8…` (13:49:44), `0xd81162c8…` (13:51:33), `0x72764c55…` (14:02:21)
6069	`0xc7a34401…`	`0xddd4c662…` (14:00:31, 14:04:07), `0x23a4418f…` (14:11:17)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[glamsterdam-devnet-5] non-deterministic parallel-exec BAL computation rejects canonical blocks as invalid, wedging the CL #21651

Summary

Environment

Timeline (UTC, 2026-06-05)

Key logs (erigon)

Non-determinism evidence

Consequence on the CL (teku)

Analysis / pointers

Expected behavior

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time	Event
08:13, 09:59	Node redeployed with the new build; erigon datadir wiped (`eth_syncing startingBlock=0x0`, OtterSync) → full resync — same deployment wave as #21650
12:59:59	First BAL validation failure during parallel batch execution near the tip: block 6292, plus `exec loop error … parallel exec loop exited with 28 block(s) still pending in pe.blockExecutors (reason=ctx-done-drain: no more pending results)`
12:59–16:32	Unwind → re-execute → fail loop, one failure every ~2 min: 104 `BAL mismatch` errors across 35 distinct canonical blocks (range ~5545–6320), failing block and computed hash varying per attempt
~15:05	teku: `Payload … marked as invalid by Execution Client` → `Will run fork choice because head block … was invalid`; also `Unable to import blocks: DATA_NOT_AVAILABLE` every slot
15:13	teku head freezes at slot 7865 (epoch 245)
16:13:24	teku `FatalServiceFailureException: Invalid or unknown justified root: 0xec7020d2…` — the network-justified checkpoint (epoch 244) was inside a branch teku had invalidated on the EL's verdict
16:31 → ongoing	teku Quartz-timer crash loop `ProtoArray: Finalized node is unknown` (166k+ log lines); engine API traffic stops, erigon parked mid-sync (Execution stage 4041, Headers 6320)

[glamsterdam-devnet-5] non-deterministic parallel-exec BAL computation rejects canonical blocks as invalid, wedging the CL #21651

Description

Summary

Environment

Timeline (UTC, 2026-06-05)

Key logs (erigon)

Non-determinism evidence

Consequence on the CL (teku)

Analysis / pointers

Expected behavior

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions