Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by --experimental.concurrent-commitment wrong-trie-root bug

# Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by concurrent-commitment bug

## Summary

On the EthPandaOps `osaka-repricings-stateful-jochem` benchmarkoor SSTORE-bloated workload (4,200 cold SSTOREs against a single 10 GB EOA storage trie in one 30M-gas block), erigon (bal-devnet-3 base, sequential commitment) sits at **6.1 Mgas/s cold / 7.9 Mgas/s warm**, against geth / besu / nethermind at **~10–14 Mgas/s**. They also read 3–4× less from disk per run, so the gap is real I/O work, not noise. Erigon's bal-devnet-3 baseline is already a ~4× lift over the published canonical erigon number for the same test (1.7 Mgas/s), so the recent commitment + parallel-exec work has done useful work — but we still trail the leaders.

The natural next optimization is `--experimental.concurrent-commitment`, which moves the per-block hashing from one goroutine to 16. We expected this to close some of the gap. **It produces a deterministic wrong trie root on this benchmark (block 24358305)** — first concurrent-commitment batch in the run, before the test block even fires — so we can't measure it. Fixing or working around that bug is the gating step before we can establish whether concurrent-commitment alone closes the gap, or whether deeper work (storage-trie sub-fanout) is needed on top.

---

## Performance comparison

Test: `test_sstore_bloated[10GB-fork_Osaka-NO_CACHE-existing_slots_True-write_new_value_True-30M]` — 4,200 cold SSTOREs against a 10 GB EOA's storage trie, 30M gas, no cache.

**Hardware envelope (matches canonical EIP-7870 fullnode):** 6 vCPUs / 32 GB RAM container, cpu_freq pinned 3.6 GHz, no turbo, performance governor, swap disabled, `drop_memory_caches: "steps"`.

| Client | Source | Test time (s, avg / min) | Mgas/s | Disk read (GB) | Read IOPS | CPU (s) | Mem (GB) |
|---|---|---:|---:|---:|---:|---:|---:|
| **erigon (bal-devnet-3 base, sequential, cold)** | local | 4.89 / 4.86 | **6.13** | 2.69 | 657k | 3.63 | 26 |
| **erigon (bal-devnet-3 base, sequential, warm)** | local | 3.78 / 3.73 | **7.95** | 1.94 | 472k | 2.80 | 19 |
| erigon | published canonical | 17.53 / 17.35 | 1.70 | 5.31 | 169k | 21.98 | 25 |
| besu | published canonical | 2.72 / 2.32 | 10.99 | 0.87 | 49k | 9.67 | 2.5 |
| geth | published canonical | 2.83 / 2.36 | 10.56 | 0.65 | 42k | 4.62 | 1.18 |
| nethermind | published canonical | 2.15 / 1.40 | 13.91 | 0.73 | 47k | 6.25 | 12.2 |

(reth's 1.1 Mgas/s line excluded — that's a missed test on their end, not the comparison we should anchor on.)

**Headlines:**
- bal-devnet-3 sequential commitment is already **~4× faster** than the published canonical erigon (1.70 → 6.13 Mgas/s cold, 7.95 warm). The recent BAL/parallel-exec/commitment work has paid off.
- We still trail geth / besu / nethermind by **~1.5–2×** on this workload, and read **3–4× more** from disk per run. The disk-read gap is the part most likely to yield to commitment-side parallelism.
- Memory footprint is the other outlier: 19–26 GB vs 1–12 GB for everyone else. That's the snapshot mmap working set on a 246 GB MDBX + 2 TB segments dataset and is largely orthogonal to throughput — but worth keeping in mind for total-cost-of-ownership comparisons.

## Why the gap (read-amplification hypothesis)

Geth / besu / nethermind read 0.65–0.87 GB to do this block; bal-devnet-3 reads 2.69 GB cold. With one shared 10 GB storage trie and 4,200 cold slots, the difference is *how the 4,200 trie traversals are batched and which intermediate pages we re-read*. Erigon's `HexPatriciaHashed` does 16-way concurrent-commitment fanout on the first nibble of `keccak256(plainKey)`, but on this workload all 4,200 slots share the same `keccak256(addr)[0]` (one EOA → one nibble), so 1 of 16 subtries does 100% of the work. The other 15 are idle. The leaders presumably do better because they batch and dedupe storage-trie reads inside the single account.

Two paths to close the gap:
1. **Make `--experimental.concurrent-commitment` actually run.** Removing the wrong-root bug below would let us measure whether concurrent-commitment alone is enough, even with the 1-of-16 imbalance (it might still help on cross-account workloads).
2. **Storage-trie sub-fanout** — within the dominant subtrie, fanout 16 ways on `keccak256(slot)[0]`. We have a design and a Phase 1 detection commit for this on `feat/storage-parallel-trie`. Phase 2 (the real mount-at-depth-64 fold) is unwritten and depends on concurrent-commitment producing correct roots first.

---

## Reproducer (end-to-end, for an external machine)

### Access prerequisites (do this first)

Several of the URLs and APIs below sit behind EthPandaOps access controls. Confirm you have what you need before starting — discovering it after a 1.7 TB download is no fun.

- **Snapshot bucket** (`https://snapshots.ethpandaops.io/...`): **NOT publicly accessible.** Our working download was authorised against `mh0lt`'s GitHub account — the bucket is gated by GitHub identity / EthPandaOps allowlist. An external agent on a fresh machine will likely 404/403. To unblock: coordinate with `mh0lt` (or whoever owns the bal-devnet-3 work) to either (a) request EthPandaOps allowlists the new identity, (b) be issued a presigned URL, or (c) receive a forwarded copy of the artefact via another channel. Confirm with `curl -sI <url>` before starting the download — anything other than `HTTP/2 200` means access is not yet granted.
- **Test fixture** (`https://data.ethpandaops.io/benchmarkoor/osaka-repricings-stateful-jochem.tar.gz`) and **opcode trace** (`https://data.ethpandaops.io/benchmarkoor/opcode_trace_results.json`): currently public, downloaded by benchmarkoor itself at run time. If those 404 from your IP, the same fix applies.
- **Canonical published numbers** (`https://benchmarkoor-api.core.ethpandaops.io/api/v1/index/...`): **requires** a bearer token from EthPandaOps. Get one before writing code that depends on it. Pattern: `curl -H 'Authorization: Bearer bmk_...' '<url>'`.
- **Genesis gist** (`https://gist.githubusercontent.com/skylenet/...`): publicly hosted on GitHub Gist — typically fine, but if the gist is deleted you'll need a copy. Save the JSON locally as a fallback.
- **Docker Hub** for `golang:1.24-bookworm` (benchmarkoor build): standard public pull, but rate-limited if unauthenticated. Consider `docker login` if you'll be rebuilding.

### Hardware / OS requirements

- Linux x86_64 (we used Ubuntu 24.04, kernel 6.8).
- 6+ cores you can pin via cpuset (we used AMD EPYC 4244P; the canonical EIP-7870 envelope is 6 logical / 3 physical, plus 1–2 cores of headroom for the host).
- 32 GB RAM allocated to the EL container; host with ≥48 GB recommended.
- NVMe SSD storage strongly recommended — benchmark is read-IOPS heavy (470k–660k IOPS during the test step).
- Docker (we used 27.x).
- **Root access on host** (for `vm.drop_caches`, cpu_freq pinning, cgroup memory caps).
- `zstd`, `aria2c`, `jq`, `python3` installed.

### Disk space requirements

| Item | Size | When |
|---|---:|---|
| Compressed snapshot (`snapshot.tar.zst`) | **1.69 TB** | downloaded once |
| Extracted snapshot (datadir) | **2.3 TB** | persistent |
| Tarball + extract simultaneously (peak) | **~4 TB** | extraction window only |
| MDBX runtime growth during a run | ~340 GB | persistent after first run |
| Docker image (with embedded erigon binary) | ~530 MB | persistent |
| Overlayfs upper/work dirs per run | ~5 GB | ephemeral, cleaned on container stop |

**Recommendation: 4 TB free on the volume that holds the snapshot.** Minimum 3 TB if you delete `snapshot.tar.zst` immediately after extract. We hit "no space left on device" mid-extract on a 3 TB volume, which corrupts the snapshot — see "Snapshot integrity" below.

### Step 1: Download the snapshot

URL: `https://snapshots.ethpandaops.io/perf-devnet-3/erigon/24358000/snapshot.tar.zst` (1690501673719 bytes ≈ 1.69 TB, zstd-compressed, includes both EL chaindata and Caplin CL data).

**Use aria2c, not curl.** `curl --retry` truncates the file on retry without `-C -`, and we lost progress repeatedly. aria2c with 16 parallel segments achieved ~635 MiB/s on a 10 Gbit link.

```bash
mkdir -p /erigon-data/snapshots
aria2c -c -x 16 -s 16 \
  -d /erigon-data/snapshots \
  -o snapshot.tar.zst \
  'https://snapshots.ethpandaops.io/perf-devnet-3/erigon/24358000/snapshot.tar.zst'
```

`-c` = continue on interrupt; `-x 16 -s 16` = 16 parallel connections, 16 segments.

### Step 2: Extract

```bash
mkdir -p /erigon-data/snapshots/erigon/perf-devnet-3/24358000
cd /erigon-data/snapshots/erigon/perf-devnet-3/24358000
tar -I zstd -xf /erigon-data/snapshots/snapshot.tar.zst
```

(`-I zstd` tells tar to pipe through zstd. Plain `tar -xf` won't work — it's not gzipped.)

After extract, the directory should be ~2.3 TB and contain `chaindata/`, `snapshots/`, `caplin/`, plus `salt-blocks.txt` / `salt-state.txt`. Roughly 1606 segment files plus an MDBX `mdbx.dat`.

**Critical: do not pass `--keep-old-files` if extraction fails midway.** That flag preserves zero-byte stub files from the failed run, leaving the snapshot corrupt (we hit this; `salt-blocks.txt` was 0 bytes, expected 4). On failure: delete the partial extract and re-extract from scratch, OR use plain `tar -I zstd -x` (which overwrites stubs).

### Snapshot integrity check

Anything ending in `.seg`, `.kv`, `.v`, `.bt`, `.kvei`, or `salt-*.txt` being zero bytes is corrupt — re-extract:

```bash
find /erigon-data/snapshots/erigon/perf-devnet-3/24358000 -size 0 -type f \
  $ -name '*.seg' -o -name '*.kv' -o -name '*.v' -o -name 'salt-*.txt' $
# expected: empty output
```

(Zero-byte `*.lck` files are benign — those are MDBX lock files. Sparse `.bt`/`.kvei` index sidecars *can* be zero — verify by content type if unsure.)

### Step 3: Build benchmarkoor (with the timeout patch)

benchmarkoor's stock `DefaultReadyTimeout` is 120s. Erigon takes 2+ minutes to come up RPC-ready on a 246 GB MDBX + 2 TB segments dataset, so the harness gives up before erigon is ready. Patch it to 900s.

```bash
git clone https://github.com/ethpandaops/benchmarkoor /tmp/benchmarkoor-src
cd /tmp/benchmarkoor-src
sed -i 's/DefaultReadyTimeout = 120 \* time.Second/DefaultReadyTimeout = 900 * time.Second/' \
  pkg/runner/runner.go
```

The host is missing C deps benchmarkoor needs (libbtrfs / libgpgme / libdevmapper), so build inside Docker:

```bash
cat > /tmp/Dockerfile.benchmarkoor-build <<'EOF'
FROM golang:1.24-bookworm
RUN apt-get update && apt-get install -y --no-install-recommends \
    libbtrfs-dev libgpgme-dev libdevmapper-dev pkg-config \
    && rm -rf /var/lib/apt/lists/*
WORKDIR /src
COPY . .
RUN go build -o /benchmarkoor ./cmd/benchmarkoor
EOF

mkdir -p $HOME/benchmarkoor
docker build -t benchmarkoor-build -f /tmp/Dockerfile.benchmarkoor-build /tmp/benchmarkoor-src
docker run --rm -v $HOME/benchmarkoor:/out benchmarkoor-build sh -c 'cp /benchmarkoor /out/benchmarkoor'
sudo chown root:root $HOME/benchmarkoor/benchmarkoor   # benchmarkoor's cgroup setup expects root-owned binary when run via sudo
```

Result: ~65 MB binary at `~/benchmarkoor/benchmarkoor`.

### Step 4: Build erigon (the binary you want to test)

Standard `make erigon` from the branch under test. For fast iteration we used a Docker fast-swap pattern (rebuild image in <5s vs 5+ min for the full Dockerfile):

```bash
# First time only: build the canonical image once
cd $ERIGON_REPO
make docker DOCKER_TAG=local/erigon:bal-devnet-3   # or use any base image with erigon at /usr/local/bin/erigon

# After every code change: just swap the freshly-built binary into the existing image
make erigon
cat > /tmp/Dockerfile.erigon-swap <<'EOF'
FROM local/erigon:bal-devnet-3
USER root
COPY erigon /usr/local/bin/erigon
RUN chmod +x /usr/local/bin/erigon
USER erigon
EOF
cp build/bin/erigon /tmp/erigon
cd /tmp && docker build -t local/erigon:bal-devnet-3 -f Dockerfile.erigon-swap .
```

### Step 5: benchmarkoor config

```bash
cat > $HOME/benchmarkoor/run.erigon-osaka-sstore.yaml <<'EOF'
global:
  log_level: info

runner:
  client_logs_to_stdout: true
  docker_network: benchmarkoor
  cleanup_on_start: true

  live_reporting:
    enabled: false

  benchmark:
    generate_results_index: true
    generate_suite_stats: true

    tests:
      metadata:
        labels:
          name: perf-devnet-3-24358000-osaka-stateful-erigon-local
          chain: perf-devnet-3
          block: "24358000"
          test-type: stateful
          context: repricing
          fork: osaka
          erigon-build: local-bal-devnet-3

      filter: "sstore_bloated[10GB-fork_Osaka-benchmark_test-cache_strategy_CacheStrategy.NO_CACHE-existing_slots_True-write_new_value_True-benchmark_30M"

      source:
        archive:
          file: https://data.ethpandaops.io/benchmarkoor/osaka-repricings-stateful-jochem.tar.gz
          pre_run_steps:
            - "merged/gas-bump.txt"
            - "merged/funding.txt"
          steps:
            setup:
              - "merged/setup/*.txt"
            test:
              - "merged/testing/*.txt"

      opcode_source:
        file: https://data.ethpandaops.io/benchmarkoor/opcode_trace_results.json

  client:
    config:
      drop_memory_caches: "steps"
      rollback_strategy: container-recreate

      resource_limits:
        cpuset_count: 6
        cpu_freq: "3600MHz"
        cpu_turboboost: false
        cpu_freq_governor: performance
        memory: "32g"
        swap_disabled: true

      genesis:
        erigon: https://gist.githubusercontent.com/skylenet/85704e26f3e833a02a760f623aeaaf9b/raw/1b1dcf664b6cb6db997ba77cd869a51176b6ee06/genesis-perf-devnet-3-24358000-osaka-genesis.json

    datadirs:
      erigon:
        source_dir: /erigon-data/snapshots/erigon/perf-devnet-3/24358000/
        method: overlayfs

  instances:
    - id: erigon-bal-full
      client: erigon
      metadata:
        labels:
          bal-mode: full
      image: local/erigon:bal-devnet-3
      pull_policy: never
      environment:
        ERIGON_MAX_REORG_DEPTH: "512"
        EXEC_TERSE_LOGGER_LEVEL: "3"
      extra_args:
        - --networkid=12159
        - --fcu.background.commit=false
        # add --experimental.concurrent-commitment here to repro the wrong-root bug
      bootstrap_fcu:
        enabled: true
        max_retries: 60
        backoff: 30s
EOF
```

`source_dir` must point at the extracted snapshot dir from Step 2.

### Datadir method (`method:`) — pick `overlayfs`

Three options; the trade-offs matter on a constrained box:

| Method | Speed | Disk overhead | Notes |
|---|---|---|---|
| `overlayfs` (kernel) | fastest | only the per-run diff (~5 GB) | **what we used.** needs root + `overlay` kernel module. cleanly umounts on test end. |
| `fuse-overlayfs` | ~2× slower | only the per-run diff | unprivileged, pure userspace. Use if kernel overlayfs is unavailable. |
| `copy` | fastest **after** the copy completes | full duplicate (+2.3 TB) | requires `2 × snapshot_size` free disk per run. We aborted this on a 4 TB box because 2.3 TB extracted + 2.3 TB copy + benchmarkoor work left no headroom. |

Stick with `overlayfs` unless you have specific reasons not to.

### Step 6: cold-cache wrapper (recommended for cold-baseline numbers)

`drop_memory_caches: "steps"` calls `vm.drop_caches=3` between steps but doesn't reliably evict snapshot mmap pages held by overlayfs lower-dirs. We confirmed empirically: drop fired but warm-run pages persisted (cold first run reads 2.69 GB, warm subsequent runs read 1.94 GB).

For repeatable cold numbers, use this wrapper:

```bash
cat > $HOME/benchmarkoor/run-cold.sh <<'EOF'
#!/usr/bin/env bash
# Run benchmarkoor with a forced cold host page cache.
# Usage: sudo ./run-cold.sh [extra benchmarkoor args]
set -euo pipefail

if [ "$(id -u)" -ne 0 ]; then
  echo "ERROR: must run as root (drop_caches + cpu_freq + cgroup limits)" >&2
  exit 1
fi

CFG="${CFG:-$HOME/benchmarkoor/run.erigon-osaka-sstore.yaml}"
BIN="${BIN:-$HOME/benchmarkoor/benchmarkoor}"

echo "[cold] tearing down stale erigon-bal-full container if any"
docker rm -f erigon-bal-full 2>/dev/null || true
docker ps -a --format '{{.Names}}' \
  | grep -E '^benchmarkoor-.*-erigon-bal-full$' \
  | xargs -r docker rm -f

echo "[cold] unmounting any leftover overlayfs mounts"
mount | awk '/benchmarkoor-overlay/ {print $3}' | while read -r m; do
  umount "$m" 2>/dev/null || umount -l "$m" 2>/dev/null || true
done

echo "[cold] sync + drop_caches"
sync
echo 3 > /proc/sys/vm/drop_caches
echo "[cold] page cache after drop:"
grep -E '^Cached|^Buffers' /proc/meminfo

echo "[cold] launching benchmarkoor"
exec "$BIN" run --config "$CFG" --log-level=info "$@"
EOF
chmod +x $HOME/benchmarkoor/run-cold.sh
```

### Step 7: run

For the sequential-commitment baseline (current best erigon perf):

```bash
sudo $HOME/benchmarkoor/run-cold.sh 2>&1 | tee /tmp/bench-baseline-cold.log
```

Subsequent warm runs (without dropping cache):

```bash
sudo $HOME/benchmarkoor/benchmarkoor run --config $HOME/benchmarkoor/run.erigon-osaka-sstore.yaml --log-level=info \
  2>&1 | tee /tmp/bench-baseline-warm.log
```

To repro the wrong-trie-root bug, add `--experimental.concurrent-commitment` to `extra_args` in the yaml and re-run. The run will fail at block 24358305 during the setup phase (i.e. before the actual test block fires), so you'll see no `result.json` for the test step — only the fail logged in the console output.

### Step 8: read results

```bash
LATEST=$(ls $HOME/benchmarkoor/results/runs/ | grep -v index.json | sort | tail -1)
cat $HOME/benchmarkoor/results/runs/$LATEST/result.json | python3 -c "
import json, sys
d = json.load(sys.stdin)
for n, t in d.get('tests', {}).items():
    s = t.get('steps', {}).get('test', {}).get('aggregated', {})
    if not s or not s.get('time_total'):
        continue
    rt = s.get('resource_totals', {})
    print(f'{n}')
    print(f'  test_time_s={s[\"time_total\"]/1e9:.3f}')
    print(f'  gas_used={s[\"gas_used_total\"]}')
    print(f'  mgas_per_s={(s[\"gas_used_total\"]/(s[\"time_total\"]/1e9))/1e6:.2f}')
    print(f'  disk_read_GB={rt.get(\"disk_read_bytes\",0)/1e9:.2f}')
    print(f'  disk_read_iops={rt.get(\"disk_read_iops\",0)}')
    print(f'  cpu_s={rt.get(\"cpu_usec\",0)/1e6:.2f}')
"
```

### Common failure modes (so you don't repeat ours)

1. **"no space left on device" mid-extract** → either delete `snapshot.tar.zst` first then extract elsewhere, or get a 4 TB+ volume. If extract failed, do a clean re-extract (delete partial first, then `tar -I zstd -x` *without* `--keep-old-files`).
2. **benchmarkoor times out before erigon is ready** → confirm you used the patched 900s `DefaultReadyTimeout`.
3. **`docker: image not found`** → benchmarkoor uses `pull_policy: never`, so the image must be local. Build with the fast-swap step first.
4. **First cold run is much slower than subsequent runs (4.9s vs 3.7s)** → expected. Page cache warms after the first iteration. Use the cold wrapper for repeatable cold numbers.
5. **`Permission denied` on `/proc/sys/vm/drop_caches`** → benchmarkoor must run as root. The cold wrapper enforces this.
6. **curl download repeatedly stalls / restarts from zero** → `curl --retry` truncates without `-C -`. Use aria2c.
7. **Wrong-trie-root on block 24358305 with `--experimental.concurrent-commitment`** → not your fault. That's the blocker bug below.

---

## Blocker: `--experimental.concurrent-commitment` produces wrong trie root deterministically

### Reproducer

Follow the end-to-end reproducer above through Step 5, but uncomment `--experimental.concurrent-commitment` in `extra_args`:

```yaml
      extra_args:
        - --networkid=12159
        - --fcu.background.commit=false
        - --experimental.concurrent-commitment
```

Then run Step 7. Branch under test: `bal-devnet-3` (HEAD `671ece6747`) — bug also reproduces on `feat/storage-parallel-trie` (= bal-devnet-3 + Phase 1 detect + Phase 2a buffering); reverting Phase 2a's buffer-and-replay back to inline `followAndUpdate` (the original Phase 1 shape) reproduces the *exact same* wrong root, so the storage-parallel-trie commits are NOT the cause.

**Failing block:** `24358305` (the LAST setup block, before the SSTORE-bloated test block 24358306).

```
[5/5 Execution] Wrong trie root of block 24358305:
  computed d5b10024a44c952b458ef9fe5957d35c4f8bd3aa673b2b369cd489ab75cc3437
  expected dbb289601651fbd44fbfe8fac02d4e1ab5c2f2a47aff7a0b519a8423b6bf338f
Block hash: 05a6d80ceb1828354ff3768ea2730e0412591bb5fd8627681e83d781152355af
[5/5 Execution] rw exit err="invalid block: wrong trie root, block=24358305"
  stack="[exec3_parallel.go:192 exec3_parallel.go:468 exec3_parallel.go:468
         exec3.go:259 stage_execute.go:391 default_stages.go:328
         sync.go:500 sync.go:331 stageloop.go:598 executor.go:313
         fork_validator.go:297 fork_validator.go:259 exec_module.go:483 ...]"
```

Same hashes, same block, same code path on every run.

### Why block 305 specifically

The setup phase plays 6 small blocks (24358300–24358305). The first commitment batch is always sequential per `// first run always sequential` (`db/state/execctx/domain_shared.go`, `commitment.go:158`). After each batch, `ConcurrentPatriciaHashed.CanDoConcurrentNext()` decides whether the *next* batch can run concurrent.

- Blocks 300–304 → sequential commitment → succeed.
- After 304, `CanDoConcurrentNext()` returns true (root has no extension; zero-prefix branch is large enough).
- Block 305 → first concurrent batch → wrong root.

So this is the *first* concurrent-commitment batch in the run. The defect is in `ParallelHashSort` (`execution/commitment/hex_concurrent_patricia_hashed.go`) or its supporting unfold/fold mechanics, not in cumulative state divergence many batches later.

### What we ruled out

| Hypothesis | Test | Outcome |
|---|---|---|
| Phase 2a's buffering broke `ParallelHashSort` | Reverted Phase 2a's buffer-and-replay back to inline `followAndUpdate` (the original Phase 1 shape) | Same wrong root — Phase 2a innocent |
| Some bal-devnet-3 BAL/parallel-exec interaction | All earlier benchmark runs on bal-devnet-3 *without* the flag → sequential commitment → run clean | bal-devnet-3 fine without the flag |
| `--exec.no-prune` interaction | Both `671ece6747` (base + no-prune fix) and pre-no-prune commits show the same failure | Unrelated |

### What we did NOT yet test (handoff items)

1. **Build `origin/main` with `--experimental.concurrent-commitment` and run the same benchmark.** Tells us whether this is a pre-existing upstream bug or a bal-devnet-3 regression.
2. **Bisect bal-devnet-3** if main is fine. Likely candidates: BAL system-address filter (`gas_table.go`), parallel-exec asynctx pattern fixes, the warmuper changes (#20877/#20884), the BAL-balance seeding fix (#20864), and any commitment-side changes since the last known-good main concurrent-commitment baseline.
3. **`ParallelHashSort` invariants on this block.** With `dbg.SetTrace(true)` on the concurrent trie and serial trie, capture the unfold/fold sequence for block 305 and diff. That should localise where the divergence happens.

---

## Methodology notes

- Canonical published numbers fetched from `https://benchmarkoor-api.core.ethpandaops.io/api/v1/index/suites/2477940593a59252/stats?max_runs_per_client=25`.

## Storage-parallel-trie experiment (paused, branch preserved)

Sub-fanout idea: within the 1-of-16 dominant subtrie on this workload, split 16 ways on `keccak256(slot)[0]`. Two phases committed on `feat/storage-parallel-trie`:

- `2bc8977800` — Phase 1: detect single-account-dominated subtries, log only.
- `3a3bcf3c04` — Phase 2a: warmup-only fanout (16 inner goroutines that `followAndUpdate` clone subtries to populate the OS page cache for the canonical pass).

Phase 2a measurement (with `--experimental.concurrent-commitment` *not* enabled — i.e. dead code): ~0% delta, expected. Once we turned the flag on, the wrong-root bug above blocked everything. Branch preserved on GitHub, not merged. Both phases inert without `--experimental.concurrent-commitment`.

Phase 2b (real mount-at-depth-64 fanout with parallel CPU work) deferred until concurrent commitment is correct.

---

## Acceptance

The performance gap to geth/besu/nethermind is the headline. The path to closing it routes through `--experimental.concurrent-commitment`, so step 1 of the handoff is a working concurrent-commitment baseline on this benchmark — either by fixing the divergence on block 24358305, or by documenting it as a pre-existing main-branch bug and filing the fix there.

Once that exists, the measurement that's actually interesting is: with-flag vs without-flag on the same SSTORE-bloated block, both warm and cold. If concurrent-commitment closes most of the gap, we're done. If not, Phase 2 of the storage-parallel-trie work is the follow-up.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by --experimental.concurrent-commitment wrong-trie-root bug #20920

Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by concurrent-commitment bug

Summary

Performance comparison

Why the gap (read-amplification hypothesis)

Reproducer (end-to-end, for an external machine)

Access prerequisites (do this first)

Hardware / OS requirements

Disk space requirements

Step 1: Download the snapshot

Step 2: Extract

Snapshot integrity check

Step 3: Build benchmarkoor (with the timeout patch)

Step 4: Build erigon (the binary you want to test)

Step 5: benchmarkoor config

Datadir method (`method:`) — pick `overlayfs`

Step 6: cold-cache wrapper (recommended for cold-baseline numbers)

Step 7: run

Step 8: read results

Common failure modes (so you don't repeat ours)

Blocker: `--experimental.concurrent-commitment` produces wrong trie root deterministically

Reproducer

Why block 305 specifically

What we ruled out

What we did NOT yet test (handoff items)

Methodology notes

Storage-parallel-trie experiment (paused, branch preserved)

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Client	Source	Test time (s, avg / min)	Mgas/s	Disk read (GB)	Read IOPS	CPU (s)	Mem (GB)
erigon (bal-devnet-3 base, sequential, cold)	local	4.89 / 4.86	6.13	2.69	657k	3.63	26
erigon (bal-devnet-3 base, sequential, warm)	local	3.78 / 3.73	7.95	1.94	472k	2.80	19
erigon	published canonical	17.53 / 17.35	1.70	5.31	169k	21.98	25
besu	published canonical	2.72 / 2.32	10.99	0.87	49k	9.67	2.5
geth	published canonical	2.83 / 2.36	10.56	0.65	42k	4.62	1.18
nethermind	published canonical	2.15 / 1.40	13.91	0.73	47k	6.25	12.2

Item	Size	When
Compressed snapshot (`snapshot.tar.zst`)	1.69 TB	downloaded once
Extracted snapshot (datadir)	2.3 TB	persistent
Tarball + extract simultaneously (peak)	~4 TB	extraction window only
MDBX runtime growth during a run	~340 GB	persistent after first run
Docker image (with embedded erigon binary)	~530 MB	persistent
Overlayfs upper/work dirs per run	~5 GB	ephemeral, cleaned on container stop

Method	Speed	Disk overhead	Notes
`overlayfs` (kernel)	fastest	only the per-run diff (~5 GB)	what we used. needs root + `overlay` kernel module. cleanly umounts on test end.
`fuse-overlayfs`	~2× slower	only the per-run diff	unprivileged, pure userspace. Use if kernel overlayfs is unavailable.
`copy`	fastest after the copy completes	full duplicate (+2.3 TB)	requires `2 × snapshot_size` free disk per run. We aborted this on a 4 TB box because 2.3 TB extracted + 2.3 TB copy + benchmarkoor work left no headroom.

Hypothesis	Test	Outcome
Phase 2a's buffering broke `ParallelHashSort`	Reverted Phase 2a's buffer-and-replay back to inline `followAndUpdate` (the original Phase 1 shape)	Same wrong root — Phase 2a innocent
Some bal-devnet-3 BAL/parallel-exec interaction	All earlier benchmark runs on bal-devnet-3 without the flag → sequential commitment → run clean	bal-devnet-3 fine without the flag
`--exec.no-prune` interaction	Both `671ece6747` (base + no-prune fix) and pre-no-prune commits show the same failure	Unrelated

Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by --experimental.concurrent-commitment wrong-trie-root bug #20920

Description

Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by concurrent-commitment bug

Summary

Performance comparison

Why the gap (read-amplification hypothesis)

Reproducer (end-to-end, for an external machine)

Access prerequisites (do this first)

Hardware / OS requirements

Disk space requirements

Step 1: Download the snapshot

Step 2: Extract

Snapshot integrity check

Step 3: Build benchmarkoor (with the timeout patch)

Step 4: Build erigon (the binary you want to test)

Step 5: benchmarkoor config

Datadir method (method:) — pick overlayfs

Step 6: cold-cache wrapper (recommended for cold-baseline numbers)

Step 7: run

Step 8: read results

Common failure modes (so you don't repeat ours)

Blocker: --experimental.concurrent-commitment produces wrong trie root deterministically

Reproducer

Why block 305 specifically

What we ruled out

What we did NOT yet test (handoff items)

Methodology notes

Storage-parallel-trie experiment (paused, branch preserved)

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Datadir method (`method:`) — pick `overlayfs`

Blocker: `--experimental.concurrent-commitment` produces wrong trie root deterministically