cl/clstages: recover from panics in stage ActionFunc#21558
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds panic recovery around Caplin stage ActionFunc execution so a panic in any stage is converted into an error (with stack trace logging) instead of crashing the entire Erigon process.
Changes:
- Wrap stage
ActionFuncexecution in adefer recover()inside the stage goroutine and log the panic with a stack trace. - Convert the recovered panic into a stage error via the stage error channel so it flows through the normal stage error path.
- Add a unit test covering the “panic in ActionFunc is recovered and surfaced as error” behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cl/clstages/clstages.go |
Adds panic recovery + error conversion in the stage execution goroutine. |
cl/clstages/clstages_test.go |
Adds a regression test ensuring StartWithStage does not crash on ActionFunc panic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
A panic in a Caplin stage's ActionFunc currently propagates out of the unrecovered stage goroutine and terminates the whole erigon process. Stages process peer-controlled req/resp responses, so a malformed response that trips a nil dereference (or any other panic) is a remote crash vector. Recover in the stage goroutine so a panic degrades to a stage error (logged at Error with a stack trace) and the stage loop continues, matching Caplin's "always able to recover" design. This is defense-in-depth that covers every current and future stage, not just the call site that happens to panic today. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cc46c1d to
28ddbf1
Compare
awskii
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add a
recover()to the stage goroutine inclstages.StartWithStageso a panic in any Caplin stage'sActionFuncis converted into a stage error (logged atErrorwith a stack trace) instead of propagating out of the unrecovered goroutine and terminating the whole erigon process.Why
Caplin stages process peer-controlled req/resp responses. A malformed response that trips a nil-pointer dereference — or any other panic — inside a stage currently crashes the entire node, because the goroutine in
StartWithStagehas nodefer recover():An unrecovered panic in this goroutine is a remote, unauthenticated crash (DoS) vector. This change is defense-in-depth that covers every current and future stage, not just whichever call site happens to be reachable today. It matches the
recover()patterns already used elsewhere in Caplin (cl/sentinel/handlers,cl/phase1/network/gossip) and the stage loop's own stated intent — "caplin is designed to always be able to recover regardless of db state".Change
The recovered panic is surfaced to the stage's
TransitionFuncas an error, so the stage loop continues exactly as it would for any other stage error.Test
TestStartWithStageRecoversFromActionFuncPanic(added TDD red→green): a stage whoseActionFuncpanics. Without the recover the test binary crashes withpanic: kaboom(out of the goroutine atclstages.go:50); with it, the panic is surfaced as a stage error andStartWithStagereturns normally.Verification
go test ./cl/clstages/✅make lint✅ (0 issues)make erigon integration✅🤖 Generated with Claude Code