Symptom
On bloatnet, once execution catches up to chain tip, the node enters a steady-state cycle of 22-36 block unwinds every ~6 minutes. [4/6 Execution] Unwind Execution from=... to=... logs fire continuously.
Histogram of unwind sizes from one ~3h run (110 unwind events):
1 size=11
1 size=14
4 size=15
1 size=16
4 size=18
2 size=19
3 size=20
5 size=21
9 size=22
14 size=23 ← most common
3 size=24
5 size=25
4 size=26
4 size=27
8 size=28
1 size=30
2 size=31
4 size=32
3 size=33
4 size=34
...
many size=35-36
Root cause (diagnosed)
Caplin issues FCUs whose head block is 11+ minutes old (i.e., ~55 slots behind real chain tip). Execution races ahead of these stale FCUs while waiting for fresh ones, then when the FCU arrives, execution is forced to unwind to match.
Added a log inside updateForkChoice just before entering the unwind path:
[forkchoice] entering unwind path
fcuNum=24701224 fcuHash=0xb97c238...
canonHash=0xb97c238... ← matches FCU hash exactly
hashesDiffer=false
finishProgress=24701259 ← execution 35 blocks ahead
executionAhead=true
head updated hash=0xb97c238... number=24701224 age=11m30s execution=0s
Note age=11m30s and execution=0s: the FCU is for an 11.5-min-old block; the "head update" itself takes no execution time because everything's already been executed (in the overlay).
Reproducibility
Tested on:
Both show the same 22-36 block unwind cadence. Not introduced by #21270.
Impact
- CPU waste: each cycle, ~30 blocks of execution work get discarded (in-memory overlay) and re-done. Effective tip throughput is roughly 1/(N+1) of what it should be, where N is the average unwind size.
- db_size unaffected: the discarded state is in the MemoryMutation overlay only, never persisted to MDBX. Plateau holds.
- Logs noisy: every FCU at tip emits an unwind log; makes log forensics harder.
Open questions
- Why does Caplin's FCU lag chain tip by 11+ minutes on bloatnet specifically? Is it the
chain_tip_sync stage's gossip pacing, or a finality-delay artifact?
- Should the execution side cap
finishProgress to fcuHeader.Number to avoid racing ahead? Tradeoff: idle CPU vs. wasted re-execution.
- Are the FCUs Caplin sends actually for the current head it knows about, or is there a code path that intentionally regresses?
Repro
Run a bloatnet node on performance (or earlier), wait until execution reaches chain tip (~9-12 hours from genesis). The unwind pattern appears immediately at tip and persists indefinitely.
Symptom
On bloatnet, once execution catches up to chain tip, the node enters a steady-state cycle of 22-36 block unwinds every ~6 minutes.
[4/6 Execution] Unwind Execution from=... to=...logs fire continuously.Histogram of unwind sizes from one ~3h run (110 unwind events):
Root cause (diagnosed)
Caplin issues FCUs whose head block is 11+ minutes old (i.e., ~55 slots behind real chain tip). Execution races ahead of these stale FCUs while waiting for fresh ones, then when the FCU arrives, execution is forced to unwind to match.
Added a log inside
updateForkChoicejust before entering the unwind path:Note
age=11m30sandexecution=0s: the FCU is for an 11.5-min-old block; the "head update" itself takes no execution time because everything's already been executed (in the overlay).Reproducibility
Tested on:
b1528ef41c(performance branch HEAD, contains execution/execmodule: apply runloop refactor with furious-prune fix #21270)f60776ef00(parent of execution/execmodule: apply runloop refactor with furious-prune fix #21270)Both show the same 22-36 block unwind cadence. Not introduced by #21270.
Impact
Open questions
chain_tip_syncstage's gossip pacing, or a finality-delay artifact?finishProgresstofcuHeader.Numberto avoid racing ahead? Tradeoff: idle CPU vs. wasted re-execution.Repro
Run a bloatnet node on
performance(or earlier), wait until execution reaches chain tip (~9-12 hours from genesis). The unwind pattern appears immediately at tip and persists indefinitely.