performance: cherry-pick 5 improvements to main by AskAlexSharov · Pull Request #21374 · erigontech/erigon

AskAlexSharov · 2026-05-23T06:26:25Z

Cherry-pick of 5 performance improvements from the performance branch to main:

warmuper: cancelable worker — warmuper: cancelable worker #20941
db/state, cmd/integration: 4x larger commitment rebuild shard, squeeze flag transparent — [r3.4] db/state, cmd/integration: 4x larger commitment rebuild shard, squeeze flag transparent #21147
db/rawdb: increase ChangeSets3 prune loop stride 100→1000, move log inside stride check — increase ChangeSets3 prune limit at chain tip #21204
db/seg: increase bufio pool size from 256KB to 512KB
db/kv/prune: remove dead limit parameter from TableScanningPrune (accepted but never forwarded internally)

… squeeze flag transparent (#21147) ## Summary Two tuning/transparency changes for commitment rebuild on `release/3.4`. Subset of `awskii/r34-inc-shard-def-size` — the `minStepsForReferencing` and `AggregatorSqueezeCommitmentValues` constant changes from that branch are **intentionally excluded** here. - **`db/state/squeeze.go`**: raise `shardStepsSize` cap from 16 → 64 steps during `RebuildCommitmentFiles`. Larger shards cut per-shard overhead on long rebuilds. - **`db/state/squeeze.go`**: stop forcing `ReplaceKeysInValues=true` inside the rebuild's squeeze path. The post-rebuild squeeze now actually honours the caller's `squeeze` flag — `if !squeeze { return }` (was `if !squeeze && !statecfg.Schema.CommitmentDomain.ReplaceKeysInValues`), and `ForTestReplaceKeysInValues(..., squeeze)` (was hardcoded `true`). - **`cmd/integration/commands/flags.go`**: flip the `--squeeze` flag default from `true` → `false` so the integration `commitment_rebuild` command no longer squeezes by default. Net effect: rebuild is faster (bigger shards) and squeeze is opt-in via flag, not silently forced.

## Summary On heavy-state chains (bloatnet), `ChangeSets3` was the dominant chaindata growth source post-catch-up — file grew unboundedly because prune couldn't keep up with the per-block changeset write rate. **Root cause:** the `pruneDiffsLimitOnChainTip = 1000` cap in `PruneExecutionStage` (active when `initialCycle=false`). On bloatnet: - per-block changeset entries: ~1000–1500 (each ~5 KB serialized diff chunks) - per commit-cycle: ~40 blocks executed → ~40k–60k entries written - per commit-cycle: ChangeSets3 prune drains at most 1000 (or until 2s timeout) → drain rate is **roughly 1–2% of write rate** - net: ChangeSets3 grows ~1–2 GB per minute under heavy load, pushing chaindata file size up by tens of GB per hour Observed on a 12-hour bloatnet run: ChangeSets3 stayed at 0 B during catch-up (`initialCycle=true` overrides the cap to `math.MaxInt`), then ballooned from 0 → 40 GB in the ~3 hours after the chain caught up. File size grew 38 GB → 181 GB over the same window, with ~80% of the new space attributable to ChangeSets3 + write amplification from a too-small reclaim pool. ## Changes 1. **execution/stagedsync: bump ChangeSets3 chain-tip prune limit 1000 → 200000.** The 2s timeout still bounds wall time; the cap raise removes the artificial floor on how many entries one call drains. With 200k cap × 2s timeout, a single PruneExecutionStage invocation can drain up to ~1 GB of changesets — well above the per-cycle write rate. 2. **db/rawdb: PruneTable: fold logEvery + ctx + timeout into one mod-1000 check.** Per-iteration `select`-on-`logEvery.C` was a syscall on every row. Moved into the same mod-stride as ctx-done + timeout, and bumped stride 100 → 1000. For 200k-row prunes this shaves the per-iter overhead noticeably without affecting timeout responsiveness (1000 iters at ~microseconds each = under 10 ms granularity). ## Notes - Catch-up path (`initialCycle=true`) is unaffected — the override there already uses `math.MaxInt` / 1h. - Mainnet's per-block changeset rate is much lower than bloatnet's, so the old 1000 cap was rarely binding. The new 200k cap is just as benign there (the 2s timeout caps actual work). - The bump pairs with the prune-in-CommitCycle change (#21192) — that gave us a second prune call per FCU iteration, but both paths shared the 1000 cap. Doubling calls doesn't help if each is throttled. ## Test plan - [ ] CI on \`performance\` - [ ] Mainnet sync still healthy (cap raise + stride change are non-functional w.r.t. correctness; only affect drain throughput)

…prs_35

AskAlexSharov · 2026-05-27T00:34:52Z


 func withSqueeze(cmd *cobra.Command) {
-	cmd.Flags().BoolVar(&squeeze, "squeeze", true, "use offset-pointers from commitment.kv to account.kv")
+	cmd.Flags().BoolVar(&squeeze, "squeeze", false, "use offset-pointers from commitment.kv to account.kv")


@awskii is it oke to get to main?

…ech#21498) ## Background erigontech#21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}` to `test-hive-eest.yml` only, with the explicit note that other matrix-bearing reusable workflows could get the same treatment "if another workflow becomes the bottleneck." It has — and there is also a second problem erigontech#21483 didn't address: **GitHub does not auto-remove the failed PR from the merge queue**. CI Gate run [26573584442](https://github.com/erigontech/erigon/actions/runs/26573584442) for PR erigontech#21374 demonstrated both gaps: - `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker Hub blip (`alpine:latest` manifest HEAD returned `unknown:` while building `hive/hiveproxy`). - hive-eest's `fail-fast` cancelled siblings in **1 second**; `hive / test-hive` (still on `fail-fast: false`) kept dispatching matrix legs — `engine, api, serial` and `engine, cancun, parallel` started at 14:36:01 / 14:36:17 (~7 min *after* `gh run cancel`) and ran to `success`. ci-gate couldn't reach a terminal state until those finished at 14:43:54, delaying eviction by **~14 minutes**. - Even once ci-gate reported `conclusion: failure` at 14:44:05, GitHub did **not** remove PR erigontech#21374 from the queue: the entry stayed at position 2 with state `UNMERGEABLE`. The queue only advanced because PR erigontech#21483 was manually `jump`ed over it. ## Changes ### 1. Roll out merge_group fail-fast to the remaining matrix workflows Same gating as erigontech#21483 (`${{ github.event_name == 'merge_group' }}`), applied to: - `test-hive.yml` - `test-all-erigon.yml` - `test-all-erigon-race.yml` - `test-eest-spec.yml` - `test-bench.yml` - `test-kurtosis-assertoor.yml` Behaviour matches erigontech#21483: in `merge_group`, first failed shard cancels its siblings at the GitHub API layer (no waiting for runner drain); in `pull_request` / `schedule` / `workflow_dispatch`, all shards continue so authors keep the full per-shard breakdown. ### 2. Auto-dequeue UNMERGEABLE PRs whose required check failed New step at the end of `ci-gate.yml`'s ci-gate job: ```yaml - name: Dequeue failed merge-queue PR if: failure() && github.event_name == 'merge_group' ... ``` The step: 1. **Inspects `needs.*.result` and skips when all are `cancelled` with no `failure`.** That pattern is a queue reshuffle (a PR ahead of us merged, our merge-group SHA is stale), where GitHub re-creates a new merge_group event for us; dequeuing here would be wrong. Confirmed by run [26573568764](https://github.com/erigontech/erigon/actions/runs/26573568764), where ci-gate's job conclusion was `failure` (needs cancelled → `Check all required jobs` exits 1) but the *run* was cancelled by GitHub during a reshuffle. 2. Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>` (handles multi-segment bases like `release/3.4`). 3. Resolves the PR number to a GraphQL node ID and calls the `dequeuePullRequest` mutation. Soft-fails on errors (warning, not non-zero exit) so a dequeue glitch never masks ci-gate's own failure signal. Permissions bumped from `pull-requests: read` to `pull-requests: write` for the mutation. ## Why both in one PR Both target the same incident class (broken PR sits at the head of the queue blocking everything else). The fail-fast change shrinks time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually evicts the failed PR. Either alone is a partial fix — having both means a broken PR's run goes red fast *and* the queue advances without anyone needing to manually jump over it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>

…igontech#21504) ## Why Both merge-queue failures of erigontech#21374 were transient CI infrastructure blips on network-dependent steps that had **no retry** — not problems with the PR's code: - **Docker Hub registry error** building hive's `hiveproxy` image: `Head https://registry-1.docker.io/v2/library/alpine/manifests/latest: unknown:` → `Tests: , Failed:` (zero tests ran). - **github.com 403** during the in-builder erigon clone (`fatal: unable to access 'https://github.com/erigontech/erigon/': 403`). That specific path is already largely addressed by erigontech#21447 (build erigon locally instead of cloning inside hive's builder), but the Docker Hub base-image pulls remain in both `docker build` and `./hive`. The merge-queue contract (per `CI-GUIDELINES.md`) is "a failure means the code is wrong — zero false positives." These infra blips are exactly the false positives that contract forbids, and they re-queue whole batches. ## What Add a small retry to the network-dependent steps in both `test-hive.yml` and `test-hive-eest.yml`: - **Build steps** (`docker build` of the local erigon image, `go get`, `go build`): wrapped in a `retry()` helper — 3 attempts, linear backoff. - **`./hive` run**: retried **only when too few tests were parsed** (`tests < 4`) — the signature of a setup/image-build failure. A completed run (`tests >= 4`) is judged on its first result and never retried. ## Why this is safe for the merge queue - A **genuine test failure is never retried** — only the *fast* infra-setup failure path is, so the retry cannot mask a real regression. - Because retries only trigger on the fast-fail (image build dying in seconds), added latency is seconds + backoff, not multiplied test runtime. - This is step-level resilience, not reliance on merge-queue re-runs (which `CI-GUIDELINES.md` explicitly discourages as a flake mask). ## Testing - Verified the retry logic locally under `bash -e -o pipefail` (the shell GitHub uses for `run:` steps): infra-fail-then-recover → passes after retry; genuine failure → not retried, fails immediately; persistent infra failure → retries to max then fails. - `actionlint` clean — no new shellcheck findings (and removes one pre-existing SC2181). - `make lint` → 0 issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AskAlexSharov and others added 5 commits May 23, 2026 12:44

warmuper: cancelable worker (#20941)

f0ce3e5

db/seg: increase bufio pool size from 256KB to 512KB

e303dbc

db/kv/prune: remove dead limit param from TableScanningPrune

97d9f60

AskAlexSharov requested review from awskii, mh0lt, sudeepdino008, taratorio and yperbasis as code owners May 23, 2026 06:26

taratorio approved these changes May 23, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into alex/cp_performance_5…

5411cfa

…prs_35

yperbasis added the performance label May 26, 2026

yperbasis added this pull request to the merge queue May 26, 2026

yperbasis removed this pull request from the merge queue due to a manual request May 26, 2026

yperbasis added this pull request to the merge queue May 26, 2026

yperbasis removed this pull request from the merge queue due to a manual request May 26, 2026

yperbasis added this pull request to the merge queue May 26, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026

AskAlexSharov commented May 27, 2026

View reviewed changes

Merge branch 'main' into alex/cp_performance_5prs_35

b555b86

yperbasis added this pull request to the merge queue May 28, 2026

yperbasis removed this pull request from the merge queue due to a manual request May 28, 2026

yperbasis mentioned this pull request May 28, 2026

ci: extend merge_group fail-fast and auto-dequeue failed PRs #21498

Merged

yperbasis added this pull request to the merge queue May 28, 2026

yperbasis removed this pull request from the merge queue due to a manual request May 28, 2026

yperbasis added this pull request to the merge queue May 28, 2026

Merged via the queue into main with commit 774daa0 May 29, 2026
90 checks passed

yperbasis deleted the alex/cp_performance_5prs_35 branch May 29, 2026 00:08

yperbasis mentioned this pull request May 29, 2026

ci: retry transient image-build/registry failures in hive CI gate #21504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance: cherry-pick 5 improvements to main#21374

performance: cherry-pick 5 improvements to main#21374
yperbasis merged 7 commits into
mainfrom
alex/cp_performance_5prs_35

AskAlexSharov commented May 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AskAlexSharov May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

AskAlexSharov commented May 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AskAlexSharov May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants