Skip to content

stagedsync: add BAD_BLOCK_HALT env var, fix parallel STOP_AFTER_BLOCK#19803

Closed
mh0lt wants to merge 3 commits into
mainfrom
fix/bad-block-halt-env
Closed

stagedsync: add BAD_BLOCK_HALT env var, fix parallel STOP_AFTER_BLOCK#19803
mh0lt wants to merge 3 commits into
mainfrom
fix/bad-block-halt-env

Conversation

@mh0lt

@mh0lt mh0lt commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add `BAD_BLOCK_HALT` env variable to halt the process on invalid blocks instead of retrying via the CL fork choice loop
  • Fix `STOP_AFTER_BLOCK` in the parallel path — was returning an error that got retried forever, now uses `os.Exit(0)` for a clean stop
  • Fix `BAD_BLOCK_HALT` in the serial path — was returning an error that the stage loop would retry indefinitely, now uses `os.Exit(1)` to match the parallel path
  • Wire `badBlockHalt` into the parallel `ErrInvalidBlock` handler in `ExecV3` so it reports the bad header and fails permanently
  • Clean up the serial path's `STOP_AFTER_BLOCK` from `panic()` to `os.Exit(0)`

Intentional design: exit without commit

Both `BAD_BLOCK_HALT` and `STOP_AFTER_BLOCK` call `os.Exit` without committing the current transaction. This is deliberate:

  • The DB is left in the state it was in before the triggering block was applied
  • On the next run, execution resumes from the same point, allowing the bad block or stop condition to be reproduced and traced again
  • Any in-progress writes are discarded — there is no risk of persisting partial state

This bypasses the normal staged-sync transaction lifecycle (commit, flush, progress update) intentionally. These are debugging tools only and should not be set in production. In production (env vars unset), the normal behaviour is preserved: bad blocks trigger an unwind and the CL can provide a corrected block via fork choice, and STOP_AFTER_BLOCK is inert.

This also has the consequence of short-cutting the normal behaviour where the CL replays the BadBlock back to the EL wich ends up in an EL-CL loop this has the following unfortunate sideffects:

  1. Agentic process monitoring does not notice the process is broken unless its explicitly asked to look for loops - which cuases other unfortunate side effects
  2. As the log is polluted with many 1000's of repeated blocks debugging requires disambiguating them.

The effect of these flags are one error ro look at and the ability to restart and see the error repeated if it is detemanistic.

Test plan

  • Set `BAD_BLOCK_HALT=true` and run against a datadir with known corrupt state — verify process halts on first invalid block
  • Set `STOP_AFTER_BLOCK=N` with parallel execution — verify process exits cleanly at block N
  • Set `STOP_AFTER_BLOCK=N` with serial execution — verify process exits cleanly at block N
  • Set `BAD_BLOCK_HALT=true` with serial execution — verify process halts (previously returned error and retried)
  • Without env vars set, verify normal retry behaviour is unchanged

🤖 Generated with Claude Code

@mh0lt mh0lt requested a review from Giulio2002 as a code owner March 11, 2026 11:51
@mh0lt mh0lt force-pushed the fix/bad-block-halt-env branch 2 times, most recently from 3a262b7 to 3d99a76 Compare March 11, 2026 12:07
@mh0lt mh0lt marked this pull request as draft March 11, 2026 12:08
Add BAD_BLOCK_HALT env variable to halt the process on invalid blocks
instead of retrying forever via the CL fork choice loop. In production
the retry is correct (the CL may provide a corrected block), but during
testing we want to stop on the first bad block for investigation.

Also fix STOP_AFTER_BLOCK in the parallel path which returned an error
that got retried forever. Both serial and parallel paths now use os.Exit(0)
for a clean stop. The serial path previously used panic().

Wire badBlockHalt into the parallel ErrInvalidBlock handler in ExecV3
so it reports the bad header to the CL and returns the error, causing
the stage to fail permanently rather than loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mh0lt mh0lt force-pushed the fix/bad-block-halt-env branch from 3d99a76 to 1ddc754 Compare March 11, 2026 12:10
@mh0lt mh0lt marked this pull request as ready for review March 11, 2026 12:19
@yperbasis yperbasis requested a review from Copilot March 13, 2026 14:11

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from Claude:

Key concerns

  1. os.Exit() bypasses all cleanup — The main issue. Both STOP_AFTER_BLOCK and BAD_BLOCK_HALT use os.Exit() mid-execution with open MDBX transactions and potentially unflushed state. Defers won't run,
    transactions won't roll back. Consider a sentinel error (e.g., ErrHaltRequested) that propagates cleanly, or at minimum flush critical state before exiting.

  2. Inconsistent badBlockHalt handling between serial and parallel paths:

  • Serial (exec3_serial.go): returns the error cleanly — defers run ✓
  • Parallel (exec3.go): calls os.Exit(1) — no cleanup ✗

The serial approach is safer and should be the model.

  1. Missing wiring sites — node/eth/backend.go:873 still hardcodes false for badBlockHalt. If the env var should be universal, this should use dbg.BadBlockHalt too.

Recommendation

Revise so the parallel BAD_BLOCK_HALT path returns the error like the serial path does, rather than calling os.Exit(1). If the CL retry loop swallows the error in the parallel case, document why and consider
context cancellation or a halt channel instead of os.Exit. Wire dbg.BadBlockHalt into backend.go for consistency.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds debug controls to stop staged sync early during testing and improves handling of invalid blocks, especially in the exec3 parallel path.

Changes:

  • Introduce BAD_BLOCK_HALT env var and plumb it into senders/execution stage configs.
  • Change STOP_AFTER_BLOCK behavior to exit cleanly (vs panic / endlessly-retried error), including the parallel executor path.
  • Add parallel-path handling in ExecV3 for rules.ErrInvalidBlock to report bad PoS headers and optionally halt.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
execution/stagedsync/stageloop/stageloop.go Wires dbg.BadBlockHalt into stage configurations for senders and execution.
execution/stagedsync/exec3_serial.go Replaces panic() with os.Exit(0) when STOP_AFTER_BLOCK is reached.
execution/stagedsync/exec3_parallel.go Replaces endlessly-retried “stop” error with os.Exit(0) when STOP_AFTER_BLOCK is reached.
execution/stagedsync/exec3.go Adds parallel-path ErrInvalidBlock handling, PoS bad-header reporting, and BAD_BLOCK_HALT exit behavior.
common/dbg/experiments.go Adds BadBlockHalt debug env flag (BAD_BLOCK_HALT).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +259 to +269
if execErr != nil && errors.Is(execErr, rules.ErrInvalidBlock) {
if lastHeader != nil {
if cfg.hd != nil && cfg.hd.POSSync() {
cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash)
}
}
if cfg.badBlockHalt {
logger.Error(fmt.Sprintf("[%s] BAD_BLOCK_HALT: halting on invalid block", execStage.LogPrefix()), "err", execErr)
os.Exit(1)
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.

I'll add comentry and warnings to manage expectations.

Comment on lines 395 to 398
if dbg.StopAfterBlock > 0 && applyResult.BlockNum == dbg.StopAfterBlock {
return fmt.Errorf("stopping: block %d complete", applyResult.BlockNum)
pe.logger.Info(fmt.Sprintf("[%s] STOP_AFTER_BLOCK reached, halting", pe.logPrefix), "block", applyResult.BlockNum)
os.Exit(0)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.

I'll add comentry and warnings to manage expectations.

Comment on lines 210 to +212
if dbg.StopAfterBlock > 0 && blockNum == dbg.StopAfterBlock {
panic(fmt.Sprintf("stopping: block %d complete", blockNum))
//return fmt.Errorf("stopping: block %d complete", blockNum)
se.logger.Info(fmt.Sprintf("[%s] STOP_AFTER_BLOCK reached, halting", se.logPrefix), "block", blockNum)
os.Exit(0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.

I'll add comentry and warnings to manage expectations.

Comment on lines 395 to 398
if dbg.StopAfterBlock > 0 && applyResult.BlockNum == dbg.StopAfterBlock {
return fmt.Errorf("stopping: block %d complete", applyResult.BlockNum)
pe.logger.Info(fmt.Sprintf("[%s] STOP_AFTER_BLOCK reached, halting", pe.logPrefix), "block", applyResult.BlockNum)
os.Exit(0)
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.

I'll add comentry and warnings to manage expectations.

Comment on lines +260 to +264
if lastHeader != nil {
if cfg.hd != nil && cfg.hd.POSSync() {
cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash)
}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we don't want to commit changes. This is intended for debugging and we want the program to exit with no commit so that we can re-run and reproduce the error.

I'll add comentry and warnings to manage expectations.

mh0lt and others added 2 commits March 13, 2026 16:09
Serial path: badBlockHalt was returning the error instead of calling
os.Exit(1). The returned error propagates to the stage loop which
retries indefinitely, defeating the purpose of the halt flag. Fix by
calling os.Exit(1) to match the parallel path behavior.

All paths: upgrade STOP_AFTER_BLOCK log from Info to Warn and add
explanatory comments clarifying that os.Exit without commit is
intentional — the DB is left at the pre-block state so the condition
can be reproduced on the next run. This addresses reviewer concerns
about bypassing the normal transaction lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
badBlockHalt=true has two distinct uses:
1. BAD_BLOCK_HALT env var (debug): exit without committing to preserve DB
   state for replay. Only makes sense for developer debugging.
2. Fork validation (NewInMemoryExecution): return error to caller so it
   can handle bad block detection (e.g. ReportBadHeaderPoS, hive tests).

The previous change made all badBlockHalt=true paths call os.Exit(1),
which crashed the node during hive tests when fork validation encountered
an invalid block - hive expects an INVALID response, not a crash.

Fix: only exit when both badBlockHalt=true AND dbg.BadBlockHalt (env var)
is set. When badBlockHalt=true without the env var (fork validation path),
return the error so the caller can handle it normally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: lastHeader is always nil in the parallel error path

In exec3_parallel.go:459-461, when execErr is anything other than context.Canceled or ErrLoopExhausted, the function returns nil for lastHeader:

// exec3_parallel.go:459-461
if execErr != nil {
if !(errors.Is(execErr, context.Canceled) || errors.Is(execErr, &ErrLoopExhausted{})) {
return nil, rwTx, execErr // ← nil, not lastHeader
}
}

Even though lastHeader is correctly set inside the closure at line 256 (before the post-validator check at line 269 that wraps rules.ErrInvalidBlock), the return statement discards it.

So in the PR's new code in exec3.go:258-268:

if execErr != nil && errors.Is(execErr, rules.ErrInvalidBlock) {
if lastHeader != nil { // ← always false in parallel path!
cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash)
}
if cfg.badBlockHalt && dbg.BadBlockHalt {
os.Exit(1) // ← fires, but without reporting the bad header
}
}

ReportBadHeaderPoS will never execute for parallel execution. The os.Exit(1) still fires, but without first telling the header downloader which header is bad, which undermines the purpose.

Suggested fix: Either return lastHeader (not nil) from pe.exec() on ErrInvalidBlock, or follow the pattern used by handleIncorrectRootHashError (line 390-392) which plumbs applyResult.BlockHash and
applyResult.ParentHash directly through the error, not through the return value.

@yperbasis yperbasis added this to the 3.5.0 milestone Mar 17, 2026
mh0lt pushed a commit that referenced this pull request Mar 23, 2026
…pick #19803)

Add BAD_BLOCK_HALT env variable: os.Exit(1) on invalid blocks instead
of returning errors that the CL retries indefinitely. Without this,
the forkchoice retry loop re-executes on dirty state producing different
(wrong) errors.

Fix STOP_AFTER_BLOCK: os.Exit(0) for clean debug halt without commit,
allowing re-run from the same state.

Wire badBlockHalt from dbg.BadBlockHalt in default and pipeline stages.

Based on PR #19803.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mh0lt pushed a commit that referenced this pull request May 3, 2026
…design)

Address yperbasis review item 2 on PR #20805. The intentional os.Exit
behaviour is kept (see PR #19803 for the design rationale: bypass
deferred commit/flush so the DB is left exactly as it was before the
triggering block, and the next run reproduces the stop point), but the
implementation is brought into alignment across both paths.

- exec3.go: hoist BAD_BLOCK_HALT check out of the if-parallel branch
  and into the post-branch ErrInvalidBlock handler so the env var
  triggers os.Exit uniformly for serial AND parallel. Previously
  the serial path silently ignored BAD_BLOCK_HALT — execErr just
  propagated up and got retried.

- exec3_serial.go: replace the residual panic("stopping: ...") for
  STOP_AFTER_BLOCK with the same os.Exit(0) pattern the parallel
  path already uses. panic() unwinds defers and fires the very
  commit/flush we're trying to avoid, defeating the harness.

- comments on both halt sites: spell out *why* os.Exit is the right
  call (preserve pre-block DB state, no commit, debug-only) and
  reference PR #19803.

Both halts respect the cfg.badBlockHalt && dbg.BadBlockHalt two-flag
gate, so fork validation (NewInMemoryExecution sets cfg.badBlockHalt
without the env var) still gets the safe error-return behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mh0lt

mh0lt commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #20805 (commit b72aa7b), which incorporated all of this PR's changes — the BAD_BLOCK_HALT env var, the STOP_AFTER_BLOCK os.Exit fixes on both serial and parallel paths, and the stageloop wiring. #20805 consolidated the serial/parallel BAD_BLOCK_HALT handling into a single ExecV3-level handler (which explicitly references this PR's design in its comment). Closing as already merged via #20805.

@mh0lt mh0lt closed this May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants