You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by concurrent-commitment bug
Summary
On the EthPandaOps osaka-repricings-stateful-jochem benchmarkoor SSTORE-bloated workload (4,200 cold SSTOREs against a single 10 GB EOA storage trie in one 30M-gas block), erigon (bal-devnet-3 base, sequential commitment) sits at 6.1 Mgas/s cold / 7.9 Mgas/s warm, against geth / besu / nethermind at ~10–14 Mgas/s. They also read 3–4× less from disk per run, so the gap is real I/O work, not noise. Erigon's bal-devnet-3 baseline is already a ~4× lift over the published canonical erigon number for the same test (1.7 Mgas/s), so the recent commitment + parallel-exec work has done useful work — but we still trail the leaders.
The natural next optimization is --experimental.concurrent-commitment, which moves the per-block hashing from one goroutine to 16. We expected this to close some of the gap. It produces a deterministic wrong trie root on this benchmark (block 24358305) — first concurrent-commitment batch in the run, before the test block even fires — so we can't measure it. Fixing or working around that bug is the gating step before we can establish whether concurrent-commitment alone closes the gap, or whether deeper work (storage-trie sub-fanout) is needed on top.
Performance comparison
Test: test_sstore_bloated[10GB-fork_Osaka-NO_CACHE-existing_slots_True-write_new_value_True-30M] — 4,200 cold SSTOREs against a 10 GB EOA's storage trie, 30M gas, no cache.
(reth's 1.1 Mgas/s line excluded — that's a missed test on their end, not the comparison we should anchor on.)
Headlines:
bal-devnet-3 sequential commitment is already ~4× faster than the published canonical erigon (1.70 → 6.13 Mgas/s cold, 7.95 warm). The recent BAL/parallel-exec/commitment work has paid off.
We still trail geth / besu / nethermind by ~1.5–2× on this workload, and read 3–4× more from disk per run. The disk-read gap is the part most likely to yield to commitment-side parallelism.
Memory footprint is the other outlier: 19–26 GB vs 1–12 GB for everyone else. That's the snapshot mmap working set on a 246 GB MDBX + 2 TB segments dataset and is largely orthogonal to throughput — but worth keeping in mind for total-cost-of-ownership comparisons.
Why the gap (read-amplification hypothesis)
Geth / besu / nethermind read 0.65–0.87 GB to do this block; bal-devnet-3 reads 2.69 GB cold. With one shared 10 GB storage trie and 4,200 cold slots, the difference is how the 4,200 trie traversals are batched and which intermediate pages we re-read. Erigon's HexPatriciaHashed does 16-way concurrent-commitment fanout on the first nibble of keccak256(plainKey), but on this workload all 4,200 slots share the same keccak256(addr)[0] (one EOA → one nibble), so 1 of 16 subtries does 100% of the work. The other 15 are idle. The leaders presumably do better because they batch and dedupe storage-trie reads inside the single account.
Two paths to close the gap:
Make --experimental.concurrent-commitment actually run. Removing the wrong-root bug below would let us measure whether concurrent-commitment alone is enough, even with the 1-of-16 imbalance (it might still help on cross-account workloads).
Storage-trie sub-fanout — within the dominant subtrie, fanout 16 ways on keccak256(slot)[0]. We have a design and a Phase 1 detection commit for this on feat/storage-parallel-trie. Phase 2 (the real mount-at-depth-64 fold) is unwritten and depends on concurrent-commitment producing correct roots first.
Reproducer (end-to-end, for an external machine)
Access prerequisites (do this first)
Several of the URLs and APIs below sit behind EthPandaOps access controls. Confirm you have what you need before starting — discovering it after a 1.7 TB download is no fun.
Snapshot bucket (https://snapshots.ethpandaops.io/...): NOT publicly accessible. Our working download was authorised against mh0lt's GitHub account — the bucket is gated by GitHub identity / EthPandaOps allowlist. An external agent on a fresh machine will likely 404/403. To unblock: coordinate with mh0lt (or whoever owns the bal-devnet-3 work) to either (a) request EthPandaOps allowlists the new identity, (b) be issued a presigned URL, or (c) receive a forwarded copy of the artefact via another channel. Confirm with curl -sI <url> before starting the download — anything other than HTTP/2 200 means access is not yet granted.
Test fixture (https://data.ethpandaops.io/benchmarkoor/osaka-repricings-stateful-jochem.tar.gz) and opcode trace (https://data.ethpandaops.io/benchmarkoor/opcode_trace_results.json): currently public, downloaded by benchmarkoor itself at run time. If those 404 from your IP, the same fix applies.
Canonical published numbers (https://benchmarkoor-api.core.ethpandaops.io/api/v1/index/...): requires a bearer token from EthPandaOps. Get one before writing code that depends on it. Pattern: curl -H 'Authorization: Bearer bmk_...' '<url>'.
Genesis gist (https://gist.githubusercontent.com/skylenet/...): publicly hosted on GitHub Gist — typically fine, but if the gist is deleted you'll need a copy. Save the JSON locally as a fallback.
Docker Hub for golang:1.24-bookworm (benchmarkoor build): standard public pull, but rate-limited if unauthenticated. Consider docker login if you'll be rebuilding.
Hardware / OS requirements
Linux x86_64 (we used Ubuntu 24.04, kernel 6.8).
6+ cores you can pin via cpuset (we used AMD EPYC 4244P; the canonical EIP-7870 envelope is 6 logical / 3 physical, plus 1–2 cores of headroom for the host).
32 GB RAM allocated to the EL container; host with ≥48 GB recommended.
NVMe SSD storage strongly recommended — benchmark is read-IOPS heavy (470k–660k IOPS during the test step).
Recommendation: 4 TB free on the volume that holds the snapshot. Minimum 3 TB if you delete snapshot.tar.zst immediately after extract. We hit "no space left on device" mid-extract on a 3 TB volume, which corrupts the snapshot — see "Snapshot integrity" below.
Step 1: Download the snapshot
URL: https://snapshots.ethpandaops.io/perf-devnet-3/erigon/24358000/snapshot.tar.zst (1690501673719 bytes ≈ 1.69 TB, zstd-compressed, includes both EL chaindata and Caplin CL data).
Use aria2c, not curl.curl --retry truncates the file on retry without -C -, and we lost progress repeatedly. aria2c with 16 parallel segments achieved ~635 MiB/s on a 10 Gbit link.
mkdir -p /erigon-data/snapshots/erigon/perf-devnet-3/24358000
cd /erigon-data/snapshots/erigon/perf-devnet-3/24358000
tar -I zstd -xf /erigon-data/snapshots/snapshot.tar.zst
(-I zstd tells tar to pipe through zstd. Plain tar -xf won't work — it's not gzipped.)
After extract, the directory should be ~2.3 TB and contain chaindata/, snapshots/, caplin/, plus salt-blocks.txt / salt-state.txt. Roughly 1606 segment files plus an MDBX mdbx.dat.
Critical: do not pass --keep-old-files if extraction fails midway. That flag preserves zero-byte stub files from the failed run, leaving the snapshot corrupt (we hit this; salt-blocks.txt was 0 bytes, expected 4). On failure: delete the partial extract and re-extract from scratch, OR use plain tar -I zstd -x (which overwrites stubs).
Snapshot integrity check
Anything ending in .seg, .kv, .v, .bt, .kvei, or salt-*.txt being zero bytes is corrupt — re-extract:
(Zero-byte *.lck files are benign — those are MDBX lock files. Sparse .bt/.kvei index sidecars can be zero — verify by content type if unsure.)
Step 3: Build benchmarkoor (with the timeout patch)
benchmarkoor's stock DefaultReadyTimeout is 120s. Erigon takes 2+ minutes to come up RPC-ready on a 246 GB MDBX + 2 TB segments dataset, so the harness gives up before erigon is ready. Patch it to 900s.
git clone https://github.com/ethpandaops/benchmarkoor /tmp/benchmarkoor-src
cd /tmp/benchmarkoor-src
sed -i 's/DefaultReadyTimeout = 120 \* time.Second/DefaultReadyTimeout = 900 * time.Second/' \
pkg/runner/runner.go
The host is missing C deps benchmarkoor needs (libbtrfs / libgpgme / libdevmapper), so build inside Docker:
Result: ~65 MB binary at ~/benchmarkoor/benchmarkoor.
Step 4: Build erigon (the binary you want to test)
Standard make erigon from the branch under test. For fast iteration we used a Docker fast-swap pattern (rebuild image in <5s vs 5+ min for the full Dockerfile):
# First time only: build the canonical image oncecd$ERIGON_REPO
make docker DOCKER_TAG=local/erigon:bal-devnet-3 # or use any base image with erigon at /usr/local/bin/erigon# After every code change: just swap the freshly-built binary into the existing image
make erigon
cat > /tmp/Dockerfile.erigon-swap <<'EOF'FROM local/erigon:bal-devnet-3USER rootCOPY erigon /usr/local/bin/erigonRUN chmod +x /usr/local/bin/erigonUSER erigonEOF
cp build/bin/erigon /tmp/erigon
cd /tmp && docker build -t local/erigon:bal-devnet-3 -f Dockerfile.erigon-swap .
source_dir must point at the extracted snapshot dir from Step 2.
Datadir method (method:) — pick overlayfs
Three options; the trade-offs matter on a constrained box:
Method
Speed
Disk overhead
Notes
overlayfs (kernel)
fastest
only the per-run diff (~5 GB)
what we used. needs root + overlay kernel module. cleanly umounts on test end.
fuse-overlayfs
~2× slower
only the per-run diff
unprivileged, pure userspace. Use if kernel overlayfs is unavailable.
copy
fastest after the copy completes
full duplicate (+2.3 TB)
requires 2 × snapshot_size free disk per run. We aborted this on a 4 TB box because 2.3 TB extracted + 2.3 TB copy + benchmarkoor work left no headroom.
Stick with overlayfs unless you have specific reasons not to.
Step 6: cold-cache wrapper (recommended for cold-baseline numbers)
drop_memory_caches: "steps" calls vm.drop_caches=3 between steps but doesn't reliably evict snapshot mmap pages held by overlayfs lower-dirs. We confirmed empirically: drop fired but warm-run pages persisted (cold first run reads 2.69 GB, warm subsequent runs read 1.94 GB).
For repeatable cold numbers, use this wrapper:
cat >$HOME/benchmarkoor/run-cold.sh <<'EOF'#!/usr/bin/env bash# Run benchmarkoor with a forced cold host page cache.# Usage: sudo ./run-cold.sh [extra benchmarkoor args]set -euo pipefailif [ "$(id -u)" -ne 0 ]; then echo "ERROR: must run as root (drop_caches + cpu_freq + cgroup limits)" >&2 exit 1fiCFG="${CFG:-$HOME/benchmarkoor/run.erigon-osaka-sstore.yaml}"BIN="${BIN:-$HOME/benchmarkoor/benchmarkoor}"echo "[cold] tearing down stale erigon-bal-full container if any"docker rm -f erigon-bal-full 2>/dev/null || truedocker ps -a --format '{{.Names}}' \ | grep -E '^benchmarkoor-.*-erigon-bal-full$' \ | xargs -r docker rm -fecho "[cold] unmounting any leftover overlayfs mounts"mount | awk '/benchmarkoor-overlay/ {print $3}' | while read -r m; do umount "$m" 2>/dev/null || umount -l "$m" 2>/dev/null || truedoneecho "[cold] sync + drop_caches"syncecho 3 > /proc/sys/vm/drop_cachesecho "[cold] page cache after drop:"grep -E '^Cached|^Buffers' /proc/meminfoecho "[cold] launching benchmarkoor"exec "$BIN" run --config "$CFG" --log-level=info "$@"EOF
chmod +x $HOME/benchmarkoor/run-cold.sh
Step 7: run
For the sequential-commitment baseline (current best erigon perf):
sudo $HOME/benchmarkoor/run-cold.sh 2>&1| tee /tmp/bench-baseline-cold.log
Subsequent warm runs (without dropping cache):
sudo $HOME/benchmarkoor/benchmarkoor run --config $HOME/benchmarkoor/run.erigon-osaka-sstore.yaml --log-level=info \
2>&1| tee /tmp/bench-baseline-warm.log
To repro the wrong-trie-root bug, add --experimental.concurrent-commitment to extra_args in the yaml and re-run. The run will fail at block 24358305 during the setup phase (i.e. before the actual test block fires), so you'll see no result.json for the test step — only the fail logged in the console output.
Step 8: read results
LATEST=$(ls $HOME/benchmarkoor/results/runs/ | grep -v index.json | sort | tail -1)
cat $HOME/benchmarkoor/results/runs/$LATEST/result.json | python3 -c "import json, sysd = json.load(sys.stdin)for n, t in d.get('tests', {}).items(): s = t.get('steps', {}).get('test', {}).get('aggregated', {}) if not s or not s.get('time_total'): continue rt = s.get('resource_totals', {}) print(f'{n}') print(f' test_time_s={s[\"time_total\"]/1e9:.3f}') print(f' gas_used={s[\"gas_used_total\"]}') print(f' mgas_per_s={(s[\"gas_used_total\"]/(s[\"time_total\"]/1e9))/1e6:.2f}') print(f' disk_read_GB={rt.get(\"disk_read_bytes\",0)/1e9:.2f}') print(f' disk_read_iops={rt.get(\"disk_read_iops\",0)}') print(f' cpu_s={rt.get(\"cpu_usec\",0)/1e6:.2f}')"
Common failure modes (so you don't repeat ours)
"no space left on device" mid-extract → either delete snapshot.tar.zst first then extract elsewhere, or get a 4 TB+ volume. If extract failed, do a clean re-extract (delete partial first, then tar -I zstd -xwithout--keep-old-files).
benchmarkoor times out before erigon is ready → confirm you used the patched 900s DefaultReadyTimeout.
docker: image not found → benchmarkoor uses pull_policy: never, so the image must be local. Build with the fast-swap step first.
First cold run is much slower than subsequent runs (4.9s vs 3.7s) → expected. Page cache warms after the first iteration. Use the cold wrapper for repeatable cold numbers.
Permission denied on /proc/sys/vm/drop_caches → benchmarkoor must run as root. The cold wrapper enforces this.
curl download repeatedly stalls / restarts from zero → curl --retry truncates without -C -. Use aria2c.
Wrong-trie-root on block 24358305 with --experimental.concurrent-commitment → not your fault. That's the blocker bug below.
Then run Step 7. Branch under test: bal-devnet-3 (HEAD 671ece6747) — bug also reproduces on feat/storage-parallel-trie (= bal-devnet-3 + Phase 1 detect + Phase 2a buffering); reverting Phase 2a's buffer-and-replay back to inline followAndUpdate (the original Phase 1 shape) reproduces the exact same wrong root, so the storage-parallel-trie commits are NOT the cause.
Failing block:24358305 (the LAST setup block, before the SSTORE-bloated test block 24358306).
Same hashes, same block, same code path on every run.
Why block 305 specifically
The setup phase plays 6 small blocks (24358300–24358305). The first commitment batch is always sequential per // first run always sequential (db/state/execctx/domain_shared.go, commitment.go:158). After each batch, ConcurrentPatriciaHashed.CanDoConcurrentNext() decides whether the next batch can run concurrent.
Blocks 300–304 → sequential commitment → succeed.
After 304, CanDoConcurrentNext() returns true (root has no extension; zero-prefix branch is large enough).
Block 305 → first concurrent batch → wrong root.
So this is the first concurrent-commitment batch in the run. The defect is in ParallelHashSort (execution/commitment/hex_concurrent_patricia_hashed.go) or its supporting unfold/fold mechanics, not in cumulative state divergence many batches later.
What we ruled out
Hypothesis
Test
Outcome
Phase 2a's buffering broke ParallelHashSort
Reverted Phase 2a's buffer-and-replay back to inline followAndUpdate (the original Phase 1 shape)
Same wrong root — Phase 2a innocent
Some bal-devnet-3 BAL/parallel-exec interaction
All earlier benchmark runs on bal-devnet-3 without the flag → sequential commitment → run clean
bal-devnet-3 fine without the flag
--exec.no-prune interaction
Both 671ece6747 (base + no-prune fix) and pre-no-prune commits show the same failure
Unrelated
What we did NOT yet test (handoff items)
Build origin/main with --experimental.concurrent-commitment and run the same benchmark. Tells us whether this is a pre-existing upstream bug or a bal-devnet-3 regression.
ParallelHashSort invariants on this block. With dbg.SetTrace(true) on the concurrent trie and serial trie, capture the unfold/fold sequence for block 305 and diff. That should localise where the divergence happens.
Methodology notes
Canonical published numbers fetched from https://benchmarkoor-api.core.ethpandaops.io/api/v1/index/suites/2477940593a59252/stats?max_runs_per_client=25.
Sub-fanout idea: within the 1-of-16 dominant subtrie on this workload, split 16 ways on keccak256(slot)[0]. Two phases committed on feat/storage-parallel-trie:
3a3bcf3c04 — Phase 2a: warmup-only fanout (16 inner goroutines that followAndUpdate clone subtries to populate the OS page cache for the canonical pass).
Phase 2a measurement (with --experimental.concurrent-commitmentnot enabled — i.e. dead code): ~0% delta, expected. Once we turned the flag on, the wrong-root bug above blocked everything. Branch preserved on GitHub, not merged. Both phases inert without --experimental.concurrent-commitment.
Phase 2b (real mount-at-depth-64 fanout with parallel CPU work) deferred until concurrent commitment is correct.
Acceptance
The performance gap to geth/besu/nethermind is the headline. The path to closing it routes through --experimental.concurrent-commitment, so step 1 of the handoff is a working concurrent-commitment baseline on this benchmark — either by fixing the divergence on block 24358305, or by documenting it as a pre-existing main-branch bug and filing the fix there.
Once that exists, the measurement that's actually interesting is: with-flag vs without-flag on the same SSTORE-bloated block, both warm and cold. If concurrent-commitment closes most of the gap, we're done. If not, Phase 2 of the storage-parallel-trie work is the follow-up.
Erigon trails geth/besu/nethermind by ~1.5–2× on cold-SSTORE-bloated workloads; planned fix blocked by concurrent-commitment bug
Summary
On the EthPandaOps
osaka-repricings-stateful-jochembenchmarkoor SSTORE-bloated workload (4,200 cold SSTOREs against a single 10 GB EOA storage trie in one 30M-gas block), erigon (bal-devnet-3 base, sequential commitment) sits at 6.1 Mgas/s cold / 7.9 Mgas/s warm, against geth / besu / nethermind at ~10–14 Mgas/s. They also read 3–4× less from disk per run, so the gap is real I/O work, not noise. Erigon's bal-devnet-3 baseline is already a ~4× lift over the published canonical erigon number for the same test (1.7 Mgas/s), so the recent commitment + parallel-exec work has done useful work — but we still trail the leaders.The natural next optimization is
--experimental.concurrent-commitment, which moves the per-block hashing from one goroutine to 16. We expected this to close some of the gap. It produces a deterministic wrong trie root on this benchmark (block 24358305) — first concurrent-commitment batch in the run, before the test block even fires — so we can't measure it. Fixing or working around that bug is the gating step before we can establish whether concurrent-commitment alone closes the gap, or whether deeper work (storage-trie sub-fanout) is needed on top.Performance comparison
Test:
test_sstore_bloated[10GB-fork_Osaka-NO_CACHE-existing_slots_True-write_new_value_True-30M]— 4,200 cold SSTOREs against a 10 GB EOA's storage trie, 30M gas, no cache.Hardware envelope (matches canonical EIP-7870 fullnode): 6 vCPUs / 32 GB RAM container, cpu_freq pinned 3.6 GHz, no turbo, performance governor, swap disabled,
drop_memory_caches: "steps".(reth's 1.1 Mgas/s line excluded — that's a missed test on their end, not the comparison we should anchor on.)
Headlines:
Why the gap (read-amplification hypothesis)
Geth / besu / nethermind read 0.65–0.87 GB to do this block; bal-devnet-3 reads 2.69 GB cold. With one shared 10 GB storage trie and 4,200 cold slots, the difference is how the 4,200 trie traversals are batched and which intermediate pages we re-read. Erigon's
HexPatriciaHasheddoes 16-way concurrent-commitment fanout on the first nibble ofkeccak256(plainKey), but on this workload all 4,200 slots share the samekeccak256(addr)[0](one EOA → one nibble), so 1 of 16 subtries does 100% of the work. The other 15 are idle. The leaders presumably do better because they batch and dedupe storage-trie reads inside the single account.Two paths to close the gap:
--experimental.concurrent-commitmentactually run. Removing the wrong-root bug below would let us measure whether concurrent-commitment alone is enough, even with the 1-of-16 imbalance (it might still help on cross-account workloads).keccak256(slot)[0]. We have a design and a Phase 1 detection commit for this onfeat/storage-parallel-trie. Phase 2 (the real mount-at-depth-64 fold) is unwritten and depends on concurrent-commitment producing correct roots first.Reproducer (end-to-end, for an external machine)
Access prerequisites (do this first)
Several of the URLs and APIs below sit behind EthPandaOps access controls. Confirm you have what you need before starting — discovering it after a 1.7 TB download is no fun.
https://snapshots.ethpandaops.io/...): NOT publicly accessible. Our working download was authorised againstmh0lt's GitHub account — the bucket is gated by GitHub identity / EthPandaOps allowlist. An external agent on a fresh machine will likely 404/403. To unblock: coordinate withmh0lt(or whoever owns the bal-devnet-3 work) to either (a) request EthPandaOps allowlists the new identity, (b) be issued a presigned URL, or (c) receive a forwarded copy of the artefact via another channel. Confirm withcurl -sI <url>before starting the download — anything other thanHTTP/2 200means access is not yet granted.https://data.ethpandaops.io/benchmarkoor/osaka-repricings-stateful-jochem.tar.gz) and opcode trace (https://data.ethpandaops.io/benchmarkoor/opcode_trace_results.json): currently public, downloaded by benchmarkoor itself at run time. If those 404 from your IP, the same fix applies.https://benchmarkoor-api.core.ethpandaops.io/api/v1/index/...): requires a bearer token from EthPandaOps. Get one before writing code that depends on it. Pattern:curl -H 'Authorization: Bearer bmk_...' '<url>'.https://gist.githubusercontent.com/skylenet/...): publicly hosted on GitHub Gist — typically fine, but if the gist is deleted you'll need a copy. Save the JSON locally as a fallback.golang:1.24-bookworm(benchmarkoor build): standard public pull, but rate-limited if unauthenticated. Considerdocker loginif you'll be rebuilding.Hardware / OS requirements
vm.drop_caches, cpu_freq pinning, cgroup memory caps).zstd,aria2c,jq,python3installed.Disk space requirements
snapshot.tar.zst)Recommendation: 4 TB free on the volume that holds the snapshot. Minimum 3 TB if you delete
snapshot.tar.zstimmediately after extract. We hit "no space left on device" mid-extract on a 3 TB volume, which corrupts the snapshot — see "Snapshot integrity" below.Step 1: Download the snapshot
URL:
https://snapshots.ethpandaops.io/perf-devnet-3/erigon/24358000/snapshot.tar.zst(1690501673719 bytes ≈ 1.69 TB, zstd-compressed, includes both EL chaindata and Caplin CL data).Use aria2c, not curl.
curl --retrytruncates the file on retry without-C -, and we lost progress repeatedly. aria2c with 16 parallel segments achieved ~635 MiB/s on a 10 Gbit link.mkdir -p /erigon-data/snapshots aria2c -c -x 16 -s 16 \ -d /erigon-data/snapshots \ -o snapshot.tar.zst \ 'https://snapshots.ethpandaops.io/perf-devnet-3/erigon/24358000/snapshot.tar.zst'-c= continue on interrupt;-x 16 -s 16= 16 parallel connections, 16 segments.Step 2: Extract
mkdir -p /erigon-data/snapshots/erigon/perf-devnet-3/24358000 cd /erigon-data/snapshots/erigon/perf-devnet-3/24358000 tar -I zstd -xf /erigon-data/snapshots/snapshot.tar.zst(
-I zstdtells tar to pipe through zstd. Plaintar -xfwon't work — it's not gzipped.)After extract, the directory should be ~2.3 TB and contain
chaindata/,snapshots/,caplin/, plussalt-blocks.txt/salt-state.txt. Roughly 1606 segment files plus an MDBXmdbx.dat.Critical: do not pass
--keep-old-filesif extraction fails midway. That flag preserves zero-byte stub files from the failed run, leaving the snapshot corrupt (we hit this;salt-blocks.txtwas 0 bytes, expected 4). On failure: delete the partial extract and re-extract from scratch, OR use plaintar -I zstd -x(which overwrites stubs).Snapshot integrity check
Anything ending in
.seg,.kv,.v,.bt,.kvei, orsalt-*.txtbeing zero bytes is corrupt — re-extract:(Zero-byte
*.lckfiles are benign — those are MDBX lock files. Sparse.bt/.kveiindex sidecars can be zero — verify by content type if unsure.)Step 3: Build benchmarkoor (with the timeout patch)
benchmarkoor's stock
DefaultReadyTimeoutis 120s. Erigon takes 2+ minutes to come up RPC-ready on a 246 GB MDBX + 2 TB segments dataset, so the harness gives up before erigon is ready. Patch it to 900s.The host is missing C deps benchmarkoor needs (libbtrfs / libgpgme / libdevmapper), so build inside Docker:
Result: ~65 MB binary at
~/benchmarkoor/benchmarkoor.Step 4: Build erigon (the binary you want to test)
Standard
make erigonfrom the branch under test. For fast iteration we used a Docker fast-swap pattern (rebuild image in <5s vs 5+ min for the full Dockerfile):Step 5: benchmarkoor config
source_dirmust point at the extracted snapshot dir from Step 2.Datadir method (
method:) — pickoverlayfsThree options; the trade-offs matter on a constrained box:
overlayfs(kernel)overlaykernel module. cleanly umounts on test end.fuse-overlayfscopy2 × snapshot_sizefree disk per run. We aborted this on a 4 TB box because 2.3 TB extracted + 2.3 TB copy + benchmarkoor work left no headroom.Stick with
overlayfsunless you have specific reasons not to.Step 6: cold-cache wrapper (recommended for cold-baseline numbers)
drop_memory_caches: "steps"callsvm.drop_caches=3between steps but doesn't reliably evict snapshot mmap pages held by overlayfs lower-dirs. We confirmed empirically: drop fired but warm-run pages persisted (cold first run reads 2.69 GB, warm subsequent runs read 1.94 GB).For repeatable cold numbers, use this wrapper:
Step 7: run
For the sequential-commitment baseline (current best erigon perf):
Subsequent warm runs (without dropping cache):
To repro the wrong-trie-root bug, add
--experimental.concurrent-commitmenttoextra_argsin the yaml and re-run. The run will fail at block 24358305 during the setup phase (i.e. before the actual test block fires), so you'll see noresult.jsonfor the test step — only the fail logged in the console output.Step 8: read results
Common failure modes (so you don't repeat ours)
snapshot.tar.zstfirst then extract elsewhere, or get a 4 TB+ volume. If extract failed, do a clean re-extract (delete partial first, thentar -I zstd -xwithout--keep-old-files).DefaultReadyTimeout.docker: image not found→ benchmarkoor usespull_policy: never, so the image must be local. Build with the fast-swap step first.Permission deniedon/proc/sys/vm/drop_caches→ benchmarkoor must run as root. The cold wrapper enforces this.curl --retrytruncates without-C -. Use aria2c.--experimental.concurrent-commitment→ not your fault. That's the blocker bug below.Blocker:
--experimental.concurrent-commitmentproduces wrong trie root deterministicallyReproducer
Follow the end-to-end reproducer above through Step 5, but uncomment
--experimental.concurrent-commitmentinextra_args:Then run Step 7. Branch under test:
bal-devnet-3(HEAD671ece6747) — bug also reproduces onfeat/storage-parallel-trie(= bal-devnet-3 + Phase 1 detect + Phase 2a buffering); reverting Phase 2a's buffer-and-replay back to inlinefollowAndUpdate(the original Phase 1 shape) reproduces the exact same wrong root, so the storage-parallel-trie commits are NOT the cause.Failing block:
24358305(the LAST setup block, before the SSTORE-bloated test block 24358306).Same hashes, same block, same code path on every run.
Why block 305 specifically
The setup phase plays 6 small blocks (24358300–24358305). The first commitment batch is always sequential per
// first run always sequential(db/state/execctx/domain_shared.go,commitment.go:158). After each batch,ConcurrentPatriciaHashed.CanDoConcurrentNext()decides whether the next batch can run concurrent.CanDoConcurrentNext()returns true (root has no extension; zero-prefix branch is large enough).So this is the first concurrent-commitment batch in the run. The defect is in
ParallelHashSort(execution/commitment/hex_concurrent_patricia_hashed.go) or its supporting unfold/fold mechanics, not in cumulative state divergence many batches later.What we ruled out
ParallelHashSortfollowAndUpdate(the original Phase 1 shape)--exec.no-pruneinteraction671ece6747(base + no-prune fix) and pre-no-prune commits show the same failureWhat we did NOT yet test (handoff items)
origin/mainwith--experimental.concurrent-commitmentand run the same benchmark. Tells us whether this is a pre-existing upstream bug or a bal-devnet-3 regression.gas_table.go), parallel-exec asynctx pattern fixes, the warmuper changes (warmuper: blocking and more (#20819) #20877/[bal-devnet-3] warmuper: blocking and more (#20877) #20884), the BAL-balance seeding fix ([bal-devnet-3] execution/state: don't seed initial BAL balance from post-write reads #20864), and any commitment-side changes since the last known-good main concurrent-commitment baseline.ParallelHashSortinvariants on this block. Withdbg.SetTrace(true)on the concurrent trie and serial trie, capture the unfold/fold sequence for block 305 and diff. That should localise where the divergence happens.Methodology notes
https://benchmarkoor-api.core.ethpandaops.io/api/v1/index/suites/2477940593a59252/stats?max_runs_per_client=25.Storage-parallel-trie experiment (paused, branch preserved)
Sub-fanout idea: within the 1-of-16 dominant subtrie on this workload, split 16 ways on
keccak256(slot)[0]. Two phases committed onfeat/storage-parallel-trie:2bc8977800— Phase 1: detect single-account-dominated subtries, log only.3a3bcf3c04— Phase 2a: warmup-only fanout (16 inner goroutines thatfollowAndUpdateclone subtries to populate the OS page cache for the canonical pass).Phase 2a measurement (with
--experimental.concurrent-commitmentnot enabled — i.e. dead code): ~0% delta, expected. Once we turned the flag on, the wrong-root bug above blocked everything. Branch preserved on GitHub, not merged. Both phases inert without--experimental.concurrent-commitment.Phase 2b (real mount-at-depth-64 fanout with parallel CPU work) deferred until concurrent commitment is correct.
Acceptance
The performance gap to geth/besu/nethermind is the headline. The path to closing it routes through
--experimental.concurrent-commitment, so step 1 of the handoff is a working concurrent-commitment baseline on this benchmark — either by fixing the divergence on block 24358305, or by documenting it as a pre-existing main-branch bug and filing the fix there.Once that exists, the measurement that's actually interesting is: with-flag vs without-flag on the same SSTORE-bloated block, both warm and cold. If concurrent-commitment closes most of the gap, we're done. If not, Phase 2 of the storage-parallel-trie work is the follow-up.