Skip to content

FCU stale head causes repeated execution unwinds at tip #21301

@sudeepdino008

Description

@sudeepdino008

Symptom

On bloatnet, once execution catches up to chain tip, the node enters a steady-state cycle of 22-36 block unwinds every ~6 minutes. [4/6 Execution] Unwind Execution from=... to=... logs fire continuously.

Histogram of unwind sizes from one ~3h run (110 unwind events):

   1 size=11
   1 size=14
   4 size=15
   1 size=16
   4 size=18
   2 size=19
   3 size=20
   5 size=21
   9 size=22
  14 size=23     ← most common
   3 size=24
   5 size=25
   4 size=26
   4 size=27
   8 size=28
   1 size=30
   2 size=31
   4 size=32
   3 size=33
   4 size=34
  ...
  many size=35-36

Root cause (diagnosed)

Caplin issues FCUs whose head block is 11+ minutes old (i.e., ~55 slots behind real chain tip). Execution races ahead of these stale FCUs while waiting for fresh ones, then when the FCU arrives, execution is forced to unwind to match.

Added a log inside updateForkChoice just before entering the unwind path:

[forkchoice] entering unwind path
  fcuNum=24701224  fcuHash=0xb97c238...
  canonHash=0xb97c238...   ← matches FCU hash exactly
  hashesDiffer=false
  finishProgress=24701259  ← execution 35 blocks ahead
  executionAhead=true
head updated hash=0xb97c238... number=24701224 age=11m30s execution=0s

Note age=11m30s and execution=0s: the FCU is for an 11.5-min-old block; the "head update" itself takes no execution time because everything's already been executed (in the overlay).

Reproducibility

Tested on:

Both show the same 22-36 block unwind cadence. Not introduced by #21270.

Impact

  • CPU waste: each cycle, ~30 blocks of execution work get discarded (in-memory overlay) and re-done. Effective tip throughput is roughly 1/(N+1) of what it should be, where N is the average unwind size.
  • db_size unaffected: the discarded state is in the MemoryMutation overlay only, never persisted to MDBX. Plateau holds.
  • Logs noisy: every FCU at tip emits an unwind log; makes log forensics harder.

Open questions

  1. Why does Caplin's FCU lag chain tip by 11+ minutes on bloatnet specifically? Is it the chain_tip_sync stage's gossip pacing, or a finality-delay artifact?
  2. Should the execution side cap finishProgress to fcuHeader.Number to avoid racing ahead? Tradeoff: idle CPU vs. wasted re-execution.
  3. Are the FCUs Caplin sends actually for the current head it knows about, or is there a code path that intentionally regresses?

Repro

Run a bloatnet node on performance (or earlier), wait until execution reaches chain tip (~9-12 hours from genesis). The unwind pattern appears immediately at tip and persists indefinitely.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions