Skip to content

performance: cherry-pick 5 improvements to main#21374

Merged
yperbasis merged 7 commits into
mainfrom
alex/cp_performance_5prs_35
May 29, 2026
Merged

performance: cherry-pick 5 improvements to main#21374
yperbasis merged 7 commits into
mainfrom
alex/cp_performance_5prs_35

Conversation

@AskAlexSharov

Copy link
Copy Markdown
Collaborator

Cherry-pick of 5 performance improvements from the performance branch to main:

  1. warmuper: cancelable worker — warmuper: cancelable worker #20941
  2. db/state, cmd/integration: 4x larger commitment rebuild shard, squeeze flag transparent — [r3.4] db/state, cmd/integration: 4x larger commitment rebuild shard, squeeze flag transparent #21147
  3. db/rawdb: increase ChangeSets3 prune loop stride 100→1000, move log inside stride check — increase ChangeSets3 prune limit at chain tip #21204
  4. db/seg: increase bufio pool size from 256KB to 512KB
  5. db/kv/prune: remove dead limit parameter from TableScanningPrune (accepted but never forwarded internally)

AskAlexSharov and others added 5 commits May 23, 2026 12:44
… squeeze flag transparent (#21147)

## Summary

Two tuning/transparency changes for commitment rebuild on `release/3.4`.
Subset of `awskii/r34-inc-shard-def-size` — the `minStepsForReferencing`
and `AggregatorSqueezeCommitmentValues` constant changes from that
branch are **intentionally excluded** here.

- **`db/state/squeeze.go`**: raise `shardStepsSize` cap from 16 → 64
steps during `RebuildCommitmentFiles`. Larger shards cut per-shard
overhead on long rebuilds.
- **`db/state/squeeze.go`**: stop forcing `ReplaceKeysInValues=true`
inside the rebuild's squeeze path. The post-rebuild squeeze now actually
honours the caller's `squeeze` flag — `if !squeeze { return }` (was `if
!squeeze && !statecfg.Schema.CommitmentDomain.ReplaceKeysInValues`), and
`ForTestReplaceKeysInValues(..., squeeze)` (was hardcoded `true`).
- **`cmd/integration/commands/flags.go`**: flip the `--squeeze` flag
default from `true` → `false` so the integration `commitment_rebuild`
command no longer squeezes by default.

Net effect: rebuild is faster (bigger shards) and squeeze is opt-in via
flag, not silently forced.
## Summary

On heavy-state chains (bloatnet), `ChangeSets3` was the dominant
chaindata growth source post-catch-up — file grew unboundedly because
prune couldn't keep up with the per-block changeset write rate.

**Root cause:** the `pruneDiffsLimitOnChainTip = 1000` cap in
`PruneExecutionStage` (active when `initialCycle=false`). On bloatnet:
- per-block changeset entries: ~1000–1500 (each ~5 KB serialized diff
chunks)
- per commit-cycle: ~40 blocks executed → ~40k–60k entries written
- per commit-cycle: ChangeSets3 prune drains at most 1000 (or until 2s
timeout) → drain rate is **roughly 1–2% of write rate**
- net: ChangeSets3 grows ~1–2 GB per minute under heavy load, pushing
chaindata file size up by tens of GB per hour

Observed on a 12-hour bloatnet run: ChangeSets3 stayed at 0 B during
catch-up (`initialCycle=true` overrides the cap to `math.MaxInt`), then
ballooned from 0 → 40 GB in the ~3 hours after the chain caught up. File
size grew 38 GB → 181 GB over the same window, with ~80% of the new
space attributable to ChangeSets3 + write amplification from a too-small
reclaim pool.

## Changes

1. **execution/stagedsync: bump ChangeSets3 chain-tip prune limit 1000 →
200000.**
The 2s timeout still bounds wall time; the cap raise removes the
artificial floor on how many entries one call drains. With 200k cap × 2s
timeout, a single PruneExecutionStage invocation can drain up to ~1 GB
of changesets — well above the per-cycle write rate.

2. **db/rawdb: PruneTable: fold logEvery + ctx + timeout into one
mod-1000 check.**
Per-iteration `select`-on-`logEvery.C` was a syscall on every row. Moved
into the same mod-stride as ctx-done + timeout, and bumped stride 100 →
1000. For 200k-row prunes this shaves the per-iter overhead noticeably
without affecting timeout responsiveness (1000 iters at ~microseconds
each = under 10 ms granularity).

## Notes

- Catch-up path (`initialCycle=true`) is unaffected — the override there
already uses `math.MaxInt` / 1h.
- Mainnet's per-block changeset rate is much lower than bloatnet's, so
the old 1000 cap was rarely binding. The new 200k cap is just as benign
there (the 2s timeout caps actual work).
- The bump pairs with the prune-in-CommitCycle change (#21192) — that
gave us a second prune call per FCU iteration, but both paths shared the
1000 cap. Doubling calls doesn't help if each is throttled.

## Test plan

- [ ] CI on \`performance\`
- [ ] Mainnet sync still healthy (cap raise + stride change are
non-functional w.r.t. correctness; only affect drain throughput)
@yperbasis yperbasis added this pull request to the merge queue May 26, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 26, 2026
@yperbasis yperbasis added this pull request to the merge queue May 26, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 26, 2026
@yperbasis yperbasis added this pull request to the merge queue May 26, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026

func withSqueeze(cmd *cobra.Command) {
cmd.Flags().BoolVar(&squeeze, "squeeze", true, "use offset-pointers from commitment.kv to account.kv")
cmd.Flags().BoolVar(&squeeze, "squeeze", false, "use offset-pointers from commitment.kv to account.kv")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awskii is it oke to get to main?

@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 28, 2026
@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 28, 2026
@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit 774daa0 May 29, 2026
90 checks passed
@yperbasis yperbasis deleted the alex/cp_performance_5prs_35 branch May 29, 2026 00:08
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 29, 2026
…ech#21498)

## Background

erigontech#21483 added `fail-fast: ${{ github.event_name == 'merge_group' }}` to
`test-hive-eest.yml` only, with the explicit note that other
matrix-bearing reusable workflows could get the same treatment "if
another workflow becomes the bottleneck." It has — and there is also a
second problem erigontech#21483 didn't address: **GitHub does not auto-remove the
failed PR from the merge queue**.

CI Gate run
[26573584442](https://github.com/erigontech/erigon/actions/runs/26573584442)
for PR erigontech#21374 demonstrated both gaps:

- `hive-eest / rlp, serial` failed at 14:29:49 from a transient Docker
Hub blip (`alpine:latest` manifest HEAD returned `unknown:` while
building `hive/hiveproxy`).
- hive-eest's `fail-fast` cancelled siblings in **1 second**; `hive /
test-hive` (still on `fail-fast: false`) kept dispatching matrix legs —
`engine, api, serial` and `engine, cancun, parallel` started at 14:36:01
/ 14:36:17 (~7 min *after* `gh run cancel`) and ran to `success`.
ci-gate couldn't reach a terminal state until those finished at
14:43:54, delaying eviction by **~14 minutes**.
- Even once ci-gate reported `conclusion: failure` at 14:44:05, GitHub
did **not** remove PR erigontech#21374 from the queue: the entry stayed at
position 2 with state `UNMERGEABLE`. The queue only advanced because PR
erigontech#21483 was manually `jump`ed over it.

## Changes

### 1. Roll out merge_group fail-fast to the remaining matrix workflows

Same gating as erigontech#21483 (`${{ github.event_name == 'merge_group' }}`),
applied to:

- `test-hive.yml`
- `test-all-erigon.yml`
- `test-all-erigon-race.yml`
- `test-eest-spec.yml`
- `test-bench.yml`
- `test-kurtosis-assertoor.yml`

Behaviour matches erigontech#21483: in `merge_group`, first failed shard cancels
its siblings at the GitHub API layer (no waiting for runner drain); in
`pull_request` / `schedule` / `workflow_dispatch`, all shards continue
so authors keep the full per-shard breakdown.

### 2. Auto-dequeue UNMERGEABLE PRs whose required check failed

New step at the end of `ci-gate.yml`'s ci-gate job:

```yaml
- name: Dequeue failed merge-queue PR
  if: failure() && github.event_name == 'merge_group'
  ...
```

The step:

1. **Inspects `needs.*.result` and skips when all are `cancelled` with
no `failure`.** That pattern is a queue reshuffle (a PR ahead of us
merged, our merge-group SHA is stale), where GitHub re-creates a new
merge_group event for us; dequeuing here would be wrong. Confirmed by
run
[26573568764](https://github.com/erigontech/erigon/actions/runs/26573568764),
where ci-gate's job conclusion was `failure` (needs cancelled → `Check
all required jobs` exits 1) but the *run* was cancelled by GitHub during
a reshuffle.
2. Parses the PR number from `gh-readonly-queue/<base>/pr-<N>-<sha>`
(handles multi-segment bases like `release/3.4`).
3. Resolves the PR number to a GraphQL node ID and calls the
`dequeuePullRequest` mutation. Soft-fails on errors (warning, not
non-zero exit) so a dequeue glitch never masks ci-gate's own failure
signal.

Permissions bumped from `pull-requests: read` to `pull-requests: write`
for the mutation.

## Why both in one PR

Both target the same incident class (broken PR sits at the head of the
queue blocking everything else). The fail-fast change shrinks
time-to-fail for ci-gate from ~14 min to seconds; the dequeue actually
evicts the failed PR. Either alone is a partial fix — having both means
a broken PR's run goes red fast *and* the queue advances without anyone
needing to manually jump over it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request May 29, 2026
…igontech#21504)

## Why

Both merge-queue failures of erigontech#21374 were transient CI infrastructure
blips on network-dependent steps that had **no retry** — not problems
with the PR's code:

- **Docker Hub registry error** building hive's `hiveproxy` image: `Head
https://registry-1.docker.io/v2/library/alpine/manifests/latest:
unknown:` → `Tests: , Failed:` (zero tests ran).
- **github.com 403** during the in-builder erigon clone (`fatal: unable
to access 'https://github.com/erigontech/erigon/': 403`). That specific
path is already largely addressed by erigontech#21447 (build erigon locally
instead of cloning inside hive's builder), but the Docker Hub base-image
pulls remain in both `docker build` and `./hive`.

The merge-queue contract (per `CI-GUIDELINES.md`) is "a failure means
the code is wrong — zero false positives." These infra blips are exactly
the false positives that contract forbids, and they re-queue whole
batches.

## What

Add a small retry to the network-dependent steps in both `test-hive.yml`
and `test-hive-eest.yml`:

- **Build steps** (`docker build` of the local erigon image, `go get`,
`go build`): wrapped in a `retry()` helper — 3 attempts, linear backoff.
- **`./hive` run**: retried **only when too few tests were parsed**
(`tests < 4`) — the signature of a setup/image-build failure. A
completed run (`tests >= 4`) is judged on its first result and never
retried.

## Why this is safe for the merge queue

- A **genuine test failure is never retried** — only the *fast*
infra-setup failure path is, so the retry cannot mask a real regression.
- Because retries only trigger on the fast-fail (image build dying in
seconds), added latency is seconds + backoff, not multiplied test
runtime.
- This is step-level resilience, not reliance on merge-queue re-runs
(which `CI-GUIDELINES.md` explicitly discourages as a flake mask).

## Testing

- Verified the retry logic locally under `bash -e -o pipefail` (the
shell GitHub uses for `run:` steps): infra-fail-then-recover → passes
after retry; genuine failure → not retried, fails immediately;
persistent infra failure → retries to max then fails.
- `actionlint` clean — no new shellcheck findings (and removes one
pre-existing SC2181).
- `make lint` → 0 issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants