stagedsync: add BAD_BLOCK_HALT env var, fix parallel STOP_AFTER_BLOCK by mh0lt · Pull Request #19803 · erigontech/erigon

mh0lt · 2026-03-11T11:51:24Z

Summary

Add `BAD_BLOCK_HALT` env variable to halt the process on invalid blocks instead of retrying via the CL fork choice loop
Fix `STOP_AFTER_BLOCK` in the parallel path — was returning an error that got retried forever, now uses `os.Exit(0)` for a clean stop
Fix `BAD_BLOCK_HALT` in the serial path — was returning an error that the stage loop would retry indefinitely, now uses `os.Exit(1)` to match the parallel path
Wire `badBlockHalt` into the parallel `ErrInvalidBlock` handler in `ExecV3` so it reports the bad header and fails permanently
Clean up the serial path's `STOP_AFTER_BLOCK` from `panic()` to `os.Exit(0)`

Intentional design: exit without commit

Both `BAD_BLOCK_HALT` and `STOP_AFTER_BLOCK` call `os.Exit` without committing the current transaction. This is deliberate:

The DB is left in the state it was in before the triggering block was applied
On the next run, execution resumes from the same point, allowing the bad block or stop condition to be reproduced and traced again
Any in-progress writes are discarded — there is no risk of persisting partial state

This bypasses the normal staged-sync transaction lifecycle (commit, flush, progress update) intentionally. These are debugging tools only and should not be set in production. In production (env vars unset), the normal behaviour is preserved: bad blocks trigger an unwind and the CL can provide a corrected block via fork choice, and STOP_AFTER_BLOCK is inert.

This also has the consequence of short-cutting the normal behaviour where the CL replays the BadBlock back to the EL wich ends up in an EL-CL loop this has the following unfortunate sideffects:

Agentic process monitoring does not notice the process is broken unless its explicitly asked to look for loops - which cuases other unfortunate side effects
As the log is polluted with many 1000's of repeated blocks debugging requires disambiguating them.

The effect of these flags are one error ro look at and the ability to restart and see the error repeated if it is detemanistic.

Test plan

Set `BAD_BLOCK_HALT=true` and run against a datadir with known corrupt state — verify process halts on first invalid block
Set `STOP_AFTER_BLOCK=N` with parallel execution — verify process exits cleanly at block N
Set `STOP_AFTER_BLOCK=N` with serial execution — verify process exits cleanly at block N
Set `BAD_BLOCK_HALT=true` with serial execution — verify process halts (previously returned error and retried)
Without env vars set, verify normal retry behaviour is unchanged

🤖 Generated with Claude Code

Add BAD_BLOCK_HALT env variable to halt the process on invalid blocks instead of retrying forever via the CL fork choice loop. In production the retry is correct (the CL may provide a corrected block), but during testing we want to stop on the first bad block for investigation. Also fix STOP_AFTER_BLOCK in the parallel path which returned an error that got retried forever. Both serial and parallel paths now use os.Exit(0) for a clean stop. The serial path previously used panic(). Wire badBlockHalt into the parallel ErrInvalidBlock handler in ExecV3 so it reports the bad header to the CL and returns the error, causing the stage to fail permanently rather than loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yperbasis

Review from Claude:

Key concerns

os.Exit() bypasses all cleanup — The main issue. Both STOP_AFTER_BLOCK and BAD_BLOCK_HALT use os.Exit() mid-execution with open MDBX transactions and potentially unflushed state. Defers won't run,
transactions won't roll back. Consider a sentinel error (e.g., ErrHaltRequested) that propagates cleanly, or at minimum flush critical state before exiting.
Inconsistent badBlockHalt handling between serial and parallel paths:

Serial (exec3_serial.go): returns the error cleanly — defers run ✓
Parallel (exec3.go): calls os.Exit(1) — no cleanup ✗

The serial approach is safer and should be the model.

Missing wiring sites — node/eth/backend.go:873 still hardcodes false for badBlockHalt. If the env var should be universal, this should use dbg.BadBlockHalt too.

Recommendation

Revise so the parallel BAD_BLOCK_HALT path returns the error like the serial path does, rather than calling os.Exit(1). If the CL retry loop swallows the error in the parallel case, document why and consider
context cancellation or a halt channel instead of os.Exit. Wire dbg.BadBlockHalt into backend.go for consistency.

Copilot

Pull request overview

Adds debug controls to stop staged sync early during testing and improves handling of invalid blocks, especially in the exec3 parallel path.

Changes:

Introduce BAD_BLOCK_HALT env var and plumb it into senders/execution stage configs.
Change STOP_AFTER_BLOCK behavior to exit cleanly (vs panic / endlessly-retried error), including the parallel executor path.
Add parallel-path handling in ExecV3 for rules.ErrInvalidBlock to report bad PoS headers and optionally halt.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
execution/stagedsync/stageloop/stageloop.go	Wires `dbg.BadBlockHalt` into stage configurations for senders and execution.
execution/stagedsync/exec3_serial.go	Replaces `panic()` with `os.Exit(0)` when `STOP_AFTER_BLOCK` is reached.
execution/stagedsync/exec3_parallel.go	Replaces endlessly-retried “stop” error with `os.Exit(0)` when `STOP_AFTER_BLOCK` is reached.
execution/stagedsync/exec3.go	Adds parallel-path `ErrInvalidBlock` handling, PoS bad-header reporting, and `BAD_BLOCK_HALT` exit behavior.
common/dbg/experiments.go	Adds `BadBlockHalt` debug env flag (`BAD_BLOCK_HALT`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

mh0lt · 2026-03-13T16:00:31Z

+		if execErr != nil && errors.Is(execErr, rules.ErrInvalidBlock) {
+			if lastHeader != nil {
+				if cfg.hd != nil && cfg.hd.POSSync() {
+					cfg.hd.ReportBadHeaderPoS(lastHeader.Hash(), lastHeader.ParentHash)
+				}
+			}
+			if cfg.badBlockHalt {
+				logger.Error(fmt.Sprintf("[%s] BAD_BLOCK_HALT: halting on invalid block", execStage.LogPrefix()), "err", execErr)
+				os.Exit(1)
+			}
+		}