Skip to content

cl/clstages: recover from panics in stage ActionFunc#21558

Merged
awskii merged 1 commit into
mainfrom
yperbasis/caplin-clstage-panic-recover
Jun 2, 2026
Merged

cl/clstages: recover from panics in stage ActionFunc#21558
awskii merged 1 commit into
mainfrom
yperbasis/caplin-clstage-panic-recover

Conversation

@yperbasis

Copy link
Copy Markdown
Member

What

Add a recover() to the stage goroutine in clstages.StartWithStage so a panic in any Caplin stage's ActionFunc is converted into a stage error (logged at Error with a stack trace) instead of propagating out of the unrecovered goroutine and terminating the whole erigon process.

Why

Caplin stages process peer-controlled req/resp responses. A malformed response that trips a nil-pointer dereference — or any other panic — inside a stage currently crashes the entire node, because the goroutine in StartWithStage has no defer recover():

go func() {
    select {
    case errch <- currentStage.ActionFunc(ctx, lg, cfg, args):
    case <-ctx.Done():
        errch <- ctx.Err()
    }
}()

An unrecovered panic in this goroutine is a remote, unauthenticated crash (DoS) vector. This change is defense-in-depth that covers every current and future stage, not just whichever call site happens to be reachable today. It matches the recover() patterns already used elsewhere in Caplin (cl/sentinel/handlers, cl/phase1/network/gossip) and the stage loop's own stated intent — "caplin is designed to always be able to recover regardless of db state".

Change

// Recover so a panic in a stage (e.g. from a malformed peer response) degrades
// to a stage error instead of terminating the whole node.
defer func() {
    if r := recover(); r != nil {
        lg.Error("[Caplin] panic in clstage", "err", r, "stack", dbg.Stack())
        errch <- fmt.Errorf("panic in clstage: %v", r)
    }
}()

The recovered panic is surfaced to the stage's TransitionFunc as an error, so the stage loop continues exactly as it would for any other stage error.

Test

TestStartWithStageRecoversFromActionFuncPanic (added TDD red→green): a stage whose ActionFunc panics. Without the recover the test binary crashes with panic: kaboom (out of the goroutine at clstages.go:50); with it, the panic is surfaced as a stage error and StartWithStage returns normally.

Verification

  • go test ./cl/clstages/
  • make lint ✅ (0 issues)
  • make erigon integration

🤖 Generated with Claude Code

@yperbasis yperbasis requested a review from domiwei as a code owner June 1, 2026 12:07
@yperbasis yperbasis added the Caplin Caplin: Consensus Layer, Beacon API label Jun 1, 2026
@yperbasis yperbasis requested a review from Copilot June 1, 2026 12:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds panic recovery around Caplin stage ActionFunc execution so a panic in any stage is converted into an error (with stack trace logging) instead of crashing the entire Erigon process.

Changes:

  • Wrap stage ActionFunc execution in a defer recover() inside the stage goroutine and log the panic with a stack trace.
  • Convert the recovered panic into a stage error via the stage error channel so it flows through the normal stage error path.
  • Add a unit test covering the “panic in ActionFunc is recovered and surfaced as error” behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
cl/clstages/clstages.go Adds panic recovery + error conversion in the stage execution goroutine.
cl/clstages/clstages_test.go Adds a regression test ensuring StartWithStage does not crash on ActionFunc panic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cl/clstages/clstages.go
A panic in a Caplin stage's ActionFunc currently propagates out of the
unrecovered stage goroutine and terminates the whole erigon process.
Stages process peer-controlled req/resp responses, so a malformed
response that trips a nil dereference (or any other panic) is a remote
crash vector.

Recover in the stage goroutine so a panic degrades to a stage error
(logged at Error with a stack trace) and the stage loop continues,
matching Caplin's "always able to recover" design. This is
defense-in-depth that covers every current and future stage, not just
the call site that happens to panic today.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis force-pushed the yperbasis/caplin-clstage-panic-recover branch from cc46c1d to 28ddbf1 Compare June 1, 2026 13:42
@awskii awskii added this pull request to the merge queue Jun 2, 2026
Merged via the queue into main with commit 74fbe0d Jun 2, 2026
90 checks passed
@awskii awskii deleted the yperbasis/caplin-clstage-panic-recover branch June 2, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Caplin Caplin: Consensus Layer, Beacon API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants