Skip to content

execution/execmodule: apply runloop refactor with furious-prune fix#21245

Merged
sudeepdino008 merged 16 commits into
mainfrom
runloop_refactor
May 20, 2026
Merged

execution/execmodule: apply runloop refactor with furious-prune fix#21245
sudeepdino008 merged 16 commits into
mainfrom
runloop_refactor

Conversation

@sudeepdino008

@sudeepdino008 sudeepdino008 commented May 18, 2026

Copy link
Copy Markdown
Member

Delayed flip of initialCycle

Before: initialCycle := limitedBigJump, gated on LoopBlockLimit (default 5000) — small catchups and steady-state at-tip both produced initialCycle=false, so the system couldn't distinguish them. That's the bloat trigger on chains where block batch exec can still write more than a 2s-prune-per-block can clear.

Now: initialCycle := headNum > finishProgressBefore && headNum - finishProgressBefore > 16 (the headNum > guard avoids uint64 underflow when FCU goes back, e.g. ePBS on Glamsterdam). Block-count threshold of 16 trips on bloatnet's ~40-block bursts (with margin for heavier blocks) and stays out of the way for steady-state 1-block tip FCUs, small organic catchups, and 1-block backwards FCUs.

A TODO in the code marks the eventual rename initialCycleatTip (inverted polarity) across stage/prune APIs.

Bloatnet: why this matters

bloatnet stabilises at ~38 GB on tip only when initialCycle=false is held until execution actually reaches tip:

  • A 40-block batch (what --batchSize=100mb settles on) writes ~1.4 GB into MDBX.
  • At-tip prune budget (SecondsPerSlot/3 ≈ 2 s) removes less than one such batch.
  • Flipping too early — while bursts are still arriving — lets writes outrun prune and MDBX runs away.

The previous threshold (LoopBlockLimit=5000) never tripped on bloatnet's 40-block bursts. The new threshold of 16 engages the catchup prune budget for any burst the prune budget can't clear in one slot.

ProcessFrozenBlocks: callback-driven, with inline file building

  • Adopts RunLoopConfig (PruneFn + CommitCycle).
  • PruneFn: runs pe.sync.RunPrune on the loop tx.
  • CommitCycle: flush + ClearRam + commit, then kicks agg.BuildFilesInBackgroundbehaviour change: PFB previously didn't trigger file building inline, so files only advanced after normal sync resumed. Cap follows CollateAndPrune (via new Aggregator.MaxCollationTxNum() getter). On the last iter (!hasMore) returns (nil, nil) to skip a wasted BeginTemporalRw.
  • Drops the post-RunLoop final flush — handled inline.
  • Outer defer tx.Rollback() converted to closure form to track tx reassignments across iterations (fixes a latent leak on ShouldBreak / mid-loop-error paths).

updateForkChoice RunLoop: prune ↔ commit split

  • Splits the previous monolithic CommitCycle into PruneFn + CommitCycle. The split makes the prune tx and the flush tx independently owned, opening the door for a future "same RwTx for prune + flush" optimisation (one commit per cycle instead of two). Not taken in this PR — we keep CollateAndPruneIfNeeded (which owns its own RW tx); the structure is just ready for it.
  • PruneFn: closes roTx, then runs runForkchoicePrune with initialCycle=true hardcoded so the catchup prune budget runs against the in-flight bursts. Post-RunLoop prune still uses the real initialCycle.
  • CommitCycle: opens commitRwTx, flushes + ClearRam, commits, re-opens roTx + overlay.
  • Both early-return on !initialCycle (at tip → post-RunLoop path handles flush+commit+prune). In catchup, every iter — last one included — runs through both callbacks.

Plumbing

  • New RunLoopConfig.PruneFn replaces the in-loop pe.sync.RunPrune call.
  • Drops PruneTimeout and BeforeIteration (both unreferenced after the split).
  • db/state: adds Aggregator.MaxCollationTxNum() getter so callers can apply the same cap pattern as CollateAndPrune.

@sudeepdino008 sudeepdino008 marked this pull request as draft May 18, 2026 06:06
sudeepdino008 added a commit that referenced this pull request May 18, 2026
Applies the intent of PR #21245 onto rev_up_check2 (off performance
branch, so a direct cherry-pick wasn't clean).

RunLoop refactor:
- New PruneFn callback alongside CommitCycleFn
- CommitCycleFn now takes hasMore so impl can skip BeginTemporalRw on
  the final iter (PFB no longer needs the post-loop flush+commit)
- Dropped BeforeIteration, PruneTimeout, FirstCycle from RunLoopConfig
- RunLoop always invokes CommitCycle; caller returns (nil,nil) to skip

ProcessFrozenBlocks: PruneFn wraps pe.sync.RunPrune; CommitCycle kicks
agg.BuildFilesInBackground after each commit so seg-build progresses
alongside PFB.

updateForkChoice:
- initialCycle = !isSynced (was limitedBigJump) — prune budget tracks
  actual sync state, not just LoopBlockLimit chunking
- Tip case (!initialCycle): PruneFn and CommitCycle both no-op; the
  post-RunLoop runForkchoiceFlushCommit + runForkchoicePrune handle
  the single block
- Catchup case: PruneFn forces initialCycle=false to runForkchoicePrune
  so it always uses furious budget regardless of in-loop initialCycle

Skipped from upstream PR: aggregator.go's MaxCollationTxNum getter +
its optional cap in BuildFilesInBackground — the underlying
SetMaxCollationTxNum / maxCollationTxNum atomic field aren't on
performance, so BuildFilesInBackground is called uncapped. Functionally
neutral; just slightly less collation throttle control.
sudeepdino008 added a commit that referenced this pull request May 19, 2026
Applies the intent of PR #21245 onto rev_up_check2 (off performance
branch, so a direct cherry-pick wasn't clean).

RunLoop refactor:
- New PruneFn callback alongside CommitCycleFn
- CommitCycleFn now takes hasMore so impl can skip BeginTemporalRw on
  the final iter (PFB no longer needs the post-loop flush+commit)
- Dropped BeforeIteration, PruneTimeout, FirstCycle from RunLoopConfig
- RunLoop always invokes CommitCycle; caller returns (nil,nil) to skip

ProcessFrozenBlocks: PruneFn wraps pe.sync.RunPrune; CommitCycle kicks
agg.BuildFilesInBackground after each commit so seg-build progresses
alongside PFB.

updateForkChoice:
- initialCycle = !isSynced (was limitedBigJump) — prune budget tracks
  actual sync state, not just LoopBlockLimit chunking
- Tip case (!initialCycle): PruneFn and CommitCycle both no-op; the
  post-RunLoop runForkchoiceFlushCommit + runForkchoicePrune handle
  the single block
- Catchup case: PruneFn forces initialCycle=false to runForkchoicePrune
  so it always uses furious budget regardless of in-loop initialCycle

Skipped from upstream PR: aggregator.go's MaxCollationTxNum getter +
its optional cap in BuildFilesInBackground — the underlying
SetMaxCollationTxNum / maxCollationTxNum atomic field aren't on
performance, so BuildFilesInBackground is called uncapped. Functionally
neutral; just slightly less collation throttle control.
Comment thread execution/execmodule/forkchoice.go Outdated
Comment thread execution/execmodule/forkchoice.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors PipelineExecutor.RunLoop to separate pruning from the commit/refresh cycle, and adjusts forkchoice execution to better distinguish “at tip” vs “catchup burst” behavior (to avoid prune-budget bloat during short bursts). It also updates ProcessFrozenBlocks to use the new callback-based RunLoop and to trigger state-file building inline, plus adds an Aggregator getter to support consistent collation caps.

Changes:

  • Split RunLoop’s “prune + commit” logic into PruneFn and CommitCycle, and update ProcessFrozenBlocks/forkchoice to use the new structure.
  • Change forkchoice’s initialCycle computation to trigger catchup behavior for moderate bursts (threshold = 16 blocks) instead of relying on large loop limits.
  • Add Aggregator.MaxCollationTxNum() so callers can cap background file building consistently with collation/prune logic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
execution/execmodule/forkchoice.go Updates forkchoice RunLoop to use new PruneFn + CommitCycle split and changes initialCycle detection logic.
execution/execmodule/executor.go Refactors RunLoop API to callback-based prune/commit, updates ProcessFrozenBlocks accordingly, and adjusts tx rollback handling.
db/state/aggregator.go Adds MaxCollationTxNum() getter for collation cap visibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/execmodule/forkchoice.go Outdated
Comment thread execution/execmodule/executor.go
Comment thread execution/execmodule/executor.go
@sudeepdino008 sudeepdino008 changed the title execution/execmodule: split RunLoop prune from commit-cycle execution/execmodule: apply runloop refactor with furious-prune fix May 19, 2026
@sudeepdino008 sudeepdino008 marked this pull request as ready for review May 19, 2026 10:14
@sudeepdino008 sudeepdino008 changed the title execution/execmodule: apply runloop refactor with furious-prune fix [wip] execution/execmodule: apply runloop refactor with furious-prune fix May 19, 2026
@sudeepdino008 sudeepdino008 changed the title [wip] execution/execmodule: apply runloop refactor with furious-prune fix execution/execmodule: apply runloop refactor with furious-prune fix May 20, 2026
@sudeepdino008 sudeepdino008 added this pull request to the merge queue May 20, 2026
Merged via the queue into main with commit 0533658 May 20, 2026
129 checks passed
@sudeepdino008 sudeepdino008 deleted the runloop_refactor branch May 20, 2026 02:37
AskAlexSharov pushed a commit that referenced this pull request May 20, 2026
…21270)

Manual reapply of #21245 (runloop refactor: split prune from
commit-cycle) onto the `performance` branch, plus follow-up fixes for
the post-frozen-blocks prune-budget regression observed on bloatnet.

## Changes

- **Split `CommitCycleFn` from `PruneFn`** in `RunLoopConfig`; drop
`BeforeIteration`/`PruneTimeout`/`FirstCycle`.
- **`ProcessFrozenBlocks`**: kicks `agg.BuildFilesInBackground` from the
new `CommitCycle` callback so snapshot files advance during PFB
(previously they only progressed after normal sync resumed). On the last
iter (`!hasMore`) returns `(nil, nil)` to skip a wasted
`BeginTemporalRw`. Outer `defer tx.Rollback()` is closure-form so it
follows the closure's `tx` reassignments across iterations.
- **`updateForkChoice`**: `PruneFn` is a no-op at tip (post-RunLoop path
handles flush+commit+prune); in catchup it drains pipeline prune via
`runForkchoicePrune(true)` — `initialCycle=true` so
`PruneExecutionStage` gets the catchup budget instead of the 2 s slot
budget that can't keep up with 40-block bursts.
- **`initialCycle` predicate** — block-count delta against
`finishProgressBefore`:
  ```go
  const smallBlockJumpThreshold = 16
  headNum := fcuHeader.Number.Uint64()
initialCycle := headNum > finishProgressBefore &&
headNum-finishProgressBefore > smallBlockJumpThreshold
  ```
The `headNum >` guard avoids `uint64` underflow when an FCU goes back
(e.g. ePBS on Glamsterdam). Threshold 16 engages catchup mode for
bloatnet's ~40-block bursts and stays out of the way for steady-state
1-block tip FCUs and test fixtures.
- **`runForkchoicePrune`**: short-chain skip-gate removed (`maxTxNum <
(stepSize*5)/4`). The gate left `ChangeSets3` un-pruned on disk for
short chains, which broke `MaxReorgDepth` enforcement in tests
(`TestFcuReturnsReorgTooDeepCode38006` on the upstream `main` branch).
Skip removal is also consistent with the `exec3/storage-component`
direction (`c380b438e7`).
- **FCU `CommitCycle` safety**: `defer commitRwTx.Rollback()`
immediately after `BeginTemporalRw` so the Commit-error path doesn't
leak the RW tx. Rollback after a successful Commit is idempotent (per
Copilot review on #21245).

## Why the predicate change matters — bloatnet

bloatnet stabilises at ~38 GB on tip only when `initialCycle=false` is
held until execution actually reaches tip:

- A 40-block batch (what `--batchSize=100mb` settles on) writes ~1.4 GB
into MDBX.
- At-tip prune budget (`SecondsPerSlot/3` ≈ 2 s) removes less than one
such batch.
- Flipping too early — while bursts are still arriving — lets writes
outrun prune and MDBX runs away.

The earlier `!isSynced` and `wall-clock head-age` variants either
flipped too eagerly (Caplin keeps headers/finish aligned during bursts)
or never (test fixtures use ancient timestamps, regressing several
rpc/jsonrpc + engineapi tests). Block-count delta with threshold 16 is a
clean proxy and works for tests, mainnet tip, and bloatnet bursts.

| Cycle | Agg | Prune | initialCycle | db_size after |
|---|---|---|---|---|
| 1 | 2m12s | 39.3s | true | 36.29 GB |
| 2 | 2m23s | 34.6s | true | 36.57 GB |
| 3 | 2m24s | 38.6s | true | 36.76 GB |

Plateau holds across cycles vs. the prior failure mode where the file
extended +6 GB/cycle once the predicate flipped early.

## Remaining differences vs #21245 (intentional)

- `Aggregator.MaxCollationTxNum()` getter and the
`BuildFilesInBackground` cap pattern from #21245 are NOT ported here —
the underlying `maxCollationTxNum` field doesn't exist on the
`performance` branch lineage. Followup, requires also adding the field
on this branch.
- `runForkchoicePrune` body uses `e.db.UpdateTemporal(...)` directly
(matches this branch's storage design — `agg.CollateAndPruneIfNeeded`
ownership is being moved out of the FCU path on the
`exec3/storage-component` track), whereas #21245 still calls
`CollateAndPruneIfNeeded` via this function.

## Test plan

- [ ] CI green on rerun (race-tests / tests-mac-linux all OSes; sonar).
- [x] Bloatnet → chain-tip; db_size stays bounded.
- [ ] Once at tip, confirm block-count delta predicate flips
`initialCycle=false` for steady-state 1-block FCUs.
- [ ] Sanity-check mainnet behaviour (FCU bursts ≤16 blocks at tip).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants