stagedsync: add BAD_BLOCK_HALT env var, fix parallel STOP_AFTER_BLOCK#19803
stagedsync: add BAD_BLOCK_HALT env var, fix parallel STOP_AFTER_BLOCK#19803mh0lt wants to merge 3 commits into
Conversation
3a262b7 to
3d99a76
Compare
Add BAD_BLOCK_HALT env variable to halt the process on invalid blocks instead of retrying forever via the CL fork choice loop. In production the retry is correct (the CL may provide a corrected block), but during testing we want to stop on the first bad block for investigation. Also fix STOP_AFTER_BLOCK in the parallel path which returned an error that got retried forever. Both serial and parallel paths now use os.Exit(0) for a clean stop. The serial path previously used panic(). Wire badBlockHalt into the parallel ErrInvalidBlock handler in ExecV3 so it reports the bad header to the CL and returns the error, causing the stage to fail permanently rather than loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3d99a76 to
1ddc754
Compare
yperbasis
left a comment
There was a problem hiding this comment.
Review from Claude:
Key concerns
-
os.Exit() bypasses all cleanup — The main issue. Both STOP_AFTER_BLOCK and BAD_BLOCK_HALT use os.Exit() mid-execution with open MDBX transactions and potentially unflushed state. Defers won't run,
transactions won't roll back. Consider a sentinel error (e.g., ErrHaltRequested) that propagates cleanly, or at minimum flush critical state before exiting. -
Inconsistent badBlockHalt handling between serial and parallel paths:
- Serial (exec3_serial.go): returns the error cleanly — defers run ✓
- Parallel (exec3.go): calls os.Exit(1) — no cleanup ✗
The serial approach is safer and should be the model.
- Missing wiring sites — node/eth/backend.go:873 still hardcodes false for badBlockHalt. If the env var should be universal, this should use dbg.BadBlockHalt too.
Recommendation
Revise so the parallel BAD_BLOCK_HALT path returns the error like the serial path does, rather than calling os.Exit(1). If the CL retry loop swallows the error in the parallel case, document why and consider
context cancellation or a halt channel instead of os.Exit. Wire dbg.BadBlockHalt into backend.go for consistency.
There was a problem hiding this comment.
Pull request overview
Adds debug controls to stop staged sync early during testing and improves handling of invalid blocks, especially in the exec3 parallel path.
Changes:
- Introduce
BAD_BLOCK_HALTenv var and plumb it into senders/execution stage configs. - Change
STOP_AFTER_BLOCKbehavior to exit cleanly (vs panic / endlessly-retried error), including the parallel executor path. - Add parallel-path handling in
ExecV3forrules.ErrInvalidBlockto report bad PoS headers and optionally halt.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| execution/stagedsync/stageloop/stageloop.go | Wires dbg.BadBlockHalt into stage configurations for senders and execution. |
| execution/stagedsync/exec3_serial.go | Replaces panic() with os.Exit(0) when STOP_AFTER_BLOCK is reached. |
| execution/stagedsync/exec3_parallel.go | Replaces endlessly-retried “stop” error with os.Exit(0) when STOP_AFTER_BLOCK is reached. |
| execution/stagedsync/exec3.go | Adds parallel-path ErrInvalidBlock handling, PoS bad-header reporting, and BAD_BLOCK_HALT exit behavior. |
| common/dbg/experiments.go | Adds BadBlockHalt debug env flag (BAD_BLOCK_HALT). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if execErr != nil && errors.Is(execErr, rules.ErrInvalidBlock) { | ||
| if lastHeader != nil { | ||
| if cfg.hd != nil && cfg.hd.POSSync() { | ||
| cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash) | ||
| } | ||
| } | ||
| if cfg.badBlockHalt { | ||
| logger.Error(fmt.Sprintf("[%s] BAD_BLOCK_HALT: halting on invalid block", execStage.LogPrefix()), "err", execErr) | ||
| os.Exit(1) | ||
| } | ||
| } |
There was a problem hiding this comment.
This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.
I'll add comentry and warnings to manage expectations.
| if dbg.StopAfterBlock > 0 && applyResult.BlockNum == dbg.StopAfterBlock { | ||
| return fmt.Errorf("stopping: block %d complete", applyResult.BlockNum) | ||
| pe.logger.Info(fmt.Sprintf("[%s] STOP_AFTER_BLOCK reached, halting", pe.logPrefix), "block", applyResult.BlockNum) | ||
| os.Exit(0) | ||
| } |
There was a problem hiding this comment.
This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.
I'll add comentry and warnings to manage expectations.
| if dbg.StopAfterBlock > 0 && blockNum == dbg.StopAfterBlock { | ||
| panic(fmt.Sprintf("stopping: block %d complete", blockNum)) | ||
| //return fmt.Errorf("stopping: block %d complete", blockNum) | ||
| se.logger.Info(fmt.Sprintf("[%s] STOP_AFTER_BLOCK reached, halting", se.logPrefix), "block", blockNum) | ||
| os.Exit(0) |
There was a problem hiding this comment.
This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.
I'll add comentry and warnings to manage expectations.
| if dbg.StopAfterBlock > 0 && applyResult.BlockNum == dbg.StopAfterBlock { | ||
| return fmt.Errorf("stopping: block %d complete", applyResult.BlockNum) | ||
| pe.logger.Info(fmt.Sprintf("[%s] STOP_AFTER_BLOCK reached, halting", pe.logPrefix), "block", applyResult.BlockNum) | ||
| os.Exit(0) | ||
| } |
There was a problem hiding this comment.
This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.
I'll add comentry and warnings to manage expectations.
| if lastHeader != nil { | ||
| if cfg.hd != nil && cfg.hd.POSSync() { | ||
| cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash) | ||
| } | ||
| } |
There was a problem hiding this comment.
This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.
I'll add comentry and warnings to manage expectations.
Serial path: badBlockHalt was returning the error instead of calling os.Exit(1). The returned error propagates to the stage loop which retries indefinitely, defeating the purpose of the halt flag. Fix by calling os.Exit(1) to match the parallel path behavior. All paths: upgrade STOP_AFTER_BLOCK log from Info to Warn and add explanatory comments clarifying that os.Exit without commit is intentional — the DB is left at the pre-block state so the condition can be reproduced on the next run. This addresses reviewer concerns about bypassing the normal transaction lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
badBlockHalt=true has two distinct uses: 1. BAD_BLOCK_HALT env var (debug): exit without committing to preserve DB state for replay. Only makes sense for developer debugging. 2. Fork validation (NewInMemoryExecution): return error to caller so it can handle bad block detection (e.g. ReportBadHeaderPoS, hive tests). The previous change made all badBlockHalt=true paths call os.Exit(1), which crashed the node during hive tests when fork validation encountered an invalid block - hive expects an INVALID response, not a crash. Fix: only exit when both badBlockHalt=true AND dbg.BadBlockHalt (env var) is set. When badBlockHalt=true without the env var (fork validation path), return the error so the caller can handle it normally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yperbasis
left a comment
There was a problem hiding this comment.
Bug: lastHeader is always nil in the parallel error path
In exec3_parallel.go:459-461, when execErr is anything other than context.Canceled or ErrLoopExhausted, the function returns nil for lastHeader:
// exec3_parallel.go:459-461
if execErr != nil {
if !(errors.Is(execErr, context.Canceled) || errors.Is(execErr, &ErrLoopExhausted{})) {
return nil, rwTx, execErr // ← nil, not lastHeader
}
}
Even though lastHeader is correctly set inside the closure at line 256 (before the post-validator check at line 269 that wraps rules.ErrInvalidBlock), the return statement discards it.
So in the PR's new code in exec3.go:258-268:
if execErr != nil && errors.Is(execErr, rules.ErrInvalidBlock) {
if lastHeader != nil { // ← always false in parallel path!
cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash)
}
if cfg.badBlockHalt && dbg.BadBlockHalt {
os.Exit(1) // ← fires, but without reporting the bad header
}
}
ReportBadHeaderPoS will never execute for parallel execution. The os.Exit(1) still fires, but without first telling the header downloader which header is bad, which undermines the purpose.
Suggested fix: Either return lastHeader (not nil) from pe.exec() on ErrInvalidBlock, or follow the pattern used by handleIncorrectRootHashError (line 390-392) which plumbs applyResult.BlockHash and
applyResult.ParentHash directly through the error, not through the return value.
…pick #19803) Add BAD_BLOCK_HALT env variable: os.Exit(1) on invalid blocks instead of returning errors that the CL retries indefinitely. Without this, the forkchoice retry loop re-executes on dirty state producing different (wrong) errors. Fix STOP_AFTER_BLOCK: os.Exit(0) for clean debug halt without commit, allowing re-run from the same state. Wire badBlockHalt from dbg.BadBlockHalt in default and pipeline stages. Based on PR #19803. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…design) Address yperbasis review item 2 on PR #20805. The intentional os.Exit behaviour is kept (see PR #19803 for the design rationale: bypass deferred commit/flush so the DB is left exactly as it was before the triggering block, and the next run reproduces the stop point), but the implementation is brought into alignment across both paths. - exec3.go: hoist BAD_BLOCK_HALT check out of the if-parallel branch and into the post-branch ErrInvalidBlock handler so the env var triggers os.Exit uniformly for serial AND parallel. Previously the serial path silently ignored BAD_BLOCK_HALT — execErr just propagated up and got retried. - exec3_serial.go: replace the residual panic("stopping: ...") for STOP_AFTER_BLOCK with the same os.Exit(0) pattern the parallel path already uses. panic() unwinds defers and fires the very commit/flush we're trying to avoid, defeating the harness. - comments on both halt sites: spell out *why* os.Exit is the right call (preserve pre-block DB state, no commit, debug-only) and reference PR #19803. Both halts respect the cfg.badBlockHalt && dbg.BadBlockHalt two-flag gate, so fork validation (NewInMemoryExecution sets cfg.badBlockHalt without the env var) still gets the safe error-return behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Superseded by #20805 (commit b72aa7b), which incorporated all of this PR's changes — the |
Summary
Intentional design: exit without commit
Both `BAD_BLOCK_HALT` and `STOP_AFTER_BLOCK` call `os.Exit` without committing the current transaction. This is deliberate:
This bypasses the normal staged-sync transaction lifecycle (commit, flush, progress update) intentionally. These are debugging tools only and should not be set in production. In production (env vars unset), the normal behaviour is preserved: bad blocks trigger an unwind and the CL can provide a corrected block via fork choice, and STOP_AFTER_BLOCK is inert.
This also has the consequence of short-cutting the normal behaviour where the CL replays the BadBlock back to the EL wich ends up in an EL-CL loop this has the following unfortunate sideffects:
The effect of these flags are one error ro look at and the ability to restart and see the error repeated if it is detemanistic.
Test plan
🤖 Generated with Claude Code