fix(build/orchestrator): plan-review-loop synth_failure restart-storm (4 layered bugs + docs)#101
Merged
Conversation
…as false progress signal
## Bug
The HeartbeatTracker watchdog in monitor.ts compares the latest sidecar
heartbeat against a persisted "last seen" tracker to decide whether the
build run has stalled. The `moved` predicate was:
moved = sidecar.stateLastUpdatedAt !== tracker.lastSeenStateLastUpdatedAt
|| sidecar.drainProcessedCount !== tracker.lastSeenDrainProcessedCount
`stateLastUpdatedAt` is the write timestamp on the state file. Every
state rewrite touches it — including no-op rewrites that the auto-resume
path performs each time the orchestrator re-enters the loop. So during
any prolonged hang where the orchestrator keeps auto-resuming (e.g. a
subagent that returns without making progress and gets retried), the
stall clock resets on every cycle and the no-progress watchdog never
fires the USER_ACTION_REQUIRED escalation. Observed in the field as a
30+ minute hang with only heartbeat output, build_stall_threshold_ms
set to 30 min, and the user-action prompt never triggered.
## Fix
Drop `stateLastUpdatedAt` from the `moved` predicate. Add `phase` (the
sidecar's current phase index, which only advances on real forward
progress) as the new positive progress signal. The tracker retains
`drainProcessedCount` (also a real progress signal — only increments
when a drain item is actually processed). `lastSeenStateLastUpdatedAt`
is kept in the tracker file for backwards compatibility with pre-fix
on-disk tracker state, but the moved-check ignores it on read and the
write path preserves it for forward-compat readers.
Pre-fix tracker files without `lastSeenPhase` parse and load fine; the
moved-check is gated on `tracker.lastSeenPhase !== undefined` so the
old-tracker case falls through to "no positive signal observed", which
correctly arms the stall on the next tick.
## Regression tests
Three new tests in monitor.test.ts cover:
1. stateLastUpdatedAt-only advance → USER_ACTION_REQUIRED fires
(the original regression — auto-resume tracker-reset)
2. phase advance → MONITOR_REENTER + RUN_RUNNING (positive case)
3. pre-fix tracker without lastSeenPhase → still escalates correctly
(backwards-compat)
Full monitor.test.ts suite: 36 pass / 0 fail / 101 expect() calls.
## Investigation note: plan-review-loop STALEMATE
A second hypothesis ("plan-review-loop missing STALEMATE emission on
round-cap with synth ok:false") was investigated and ruled out. The
STALEMATE path already exists in plan-review-loop.ts at two sites
(synth_failure on the continue path and on the round-cap path), both
return finalResult(..., exitCode=1, ...), and cli.ts:11710-11722
properly throws ExitError(1) on the loopResult. End-to-end wiring is
correct. The observed 30+ minute hang was entirely attributable to
Bug #1 above (tracker reset on auto-resume).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (Bug A)
cli.ts's outer try/finally at the heartbeat boundary calls
\`process.exit(exitCode)\` in the finally block. exitCode is initialized to
1 at function entry and set to 0 on the success path BEFORE any failure
branches throw — so a \`throw new ExitError(N)\` after a successful prior
phase exits with whatever exitCode happens to be at that moment (typically
0, i.e. false success), NOT the thrown ExitError's code. The top-level
catch at \`main().catch(err => process.exit(err.code))\` never runs because
the process already died inside the finally.
This is the load-bearing structural bug in the plan-review-loop restart
storm: a synth_failure that throws ExitError(1) exits 0, the monitor sees
success-shaped exit + a "synth_failure" state status, auto-resumes the
crashed run, which hits the same failure, ad infinitum.
Fix: capture any ExitError thrown inside the outer try via a
\`pendingExitError\` variable. The finally block computes
\`process.exit(pendingExitError?.code ?? exitCode)\` so the thrown code
wins. Pattern is three small additions:
- \`let pendingExitError: ExitError | null = null;\` near \`let exitCode\`
- a new \`catch (err) { if (err instanceof ExitError) pendingExitError = err; throw err; }\` between the try body and finally
- \`process.exit(pendingExitError?.code ?? exitCode)\` at the bottom of finally
This is correct for every \`throw new ExitError(N)\` in cli.ts — Exit 3
STALEMATE, Exit 4 USER ABORT, Exit 130 SIGINT, Exit 1 synth_failure — not
just the one fault path that surfaced it. Worth landing on its own merits.
Test pins the structural fix via static-grep checks plus a JS semantics
sanity check that the pattern correctly preserves thrown codes through
finally. Behavioral coverage comes via plan-reviewer-loop.test.ts and the
new cli-synth-failure-resume.test.ts in subsequent commits.
See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug A.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…STALEMATE event (Bug D+E loop-side)
plan-review-loop.ts now distinguishes "synth fails at round cap" from
"synth fails mid-loop". The cap case becomes the new outcome
\`synth_failure_stalemate\` with exit code 3 (matching SKILL.md Step 5.5's
documented STALEMATE contract) and emits a stdout JSON event
\`{"event":"STALEMATE","reason":"synth_failure_at_round_cap",...}\` so
operators and monitors can detect convergence failure programmatically.
Mid-loop synth_failure is unchanged — keeps exit code 1, outcome
\`synth_failure\`, retry semantics intact for transient model/network
blips. cli.ts's retry-promotion logic (Bug B fix in the next commit) is
the load-bearing fix for repeated mid-loop synth_failure that creates
the restart storm.
NOTE: the cap-synth-failure path is largely defensive — in practice,
\`shouldBailAdaptive\` at line 380 routes \`round >= maxRounds\` through
the stalemate gate BEFORE the loop body's synth call fires. So the new
exit-3 branch is reached only when an alternative loop flow (e.g.,
\`--no-adaptive-cap\` plus continue-path bail-out) lands at the cap with
a failing synth. Keeping the defensive code anyway so future
\`shouldBailAdaptive\` changes don't regress the contract.
Tests added in plan-reviewer-loop.test.ts:
- Mid-loop synth_failure with maxRounds=5 still returns exit 1 + outcome
\`synth_failure\` (pins the no-false-positive invariant — the contract
change is scoped to the cap, not promoted to every synth failure).
See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug D+E.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…romote to STALEMATE (Bug B)
cli.ts's resume condition at line 11552 previously included
\`state.planReview.status === "synth_failure"\` as a retry trigger,
optimized for transient mid-loop crashes. Combined with Bug A (now fixed)
and the monitor's auto-resume on RUN_STALE, this created the restart
storm: the orchestrator re-ran the same failing loop on every resume,
ad infinitum.
Per the plan, do NOT auto-retry synth_failure on resume. Add a stalemate
guard BEFORE the existing retry condition:
if status === "synth_failure" OR "synth_failure_stalemate":
emit STALEMATE event to stdout
promote status to "synth_failure_stalemate" if needed
throw ExitError(3) — operator intervention required
The transient-retry case the old code optimized for is, from the operator's
perspective, the same shape as the restart-storm case — both result in a
crashed run that the orchestrator can't recover automatically. The safer
default is to halt with clear guidance:
1. Edit the plan to address objections, re-run.
2. Re-launch with --no-plan-review to bypass the gate.
3. Delete state.planReview to restart the loop fresh.
Also routes \`loopResult.outcome === "synth_failure_stalemate"\` (from the
cap path in plan-review-loop) to record \`status: "synth_failure_stalemate"\`
in the saved state, so the resume guard above recognizes it as definitely-
stalemate (not just first occurrence).
Tests pin the static wiring across five concerns:
- Stalemate guard appears BEFORE the loop-launch if
- Both "synth_failure" and "synth_failure_stalemate" statuses are guarded
- "synth_failure" removed from the resume-retry condition
- exit-3 handler routes synth_failure_stalemate outcome correctly
- STALEMATE event carries reason field for telemetry
See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug B.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er-run restart cap (Bug C component 2) Two new guards on the monitor's auto-resume site, complementing the heartbeat-tracker fix from commit bd10c04 (Bug C component 1): 1. **Stalemate refusal**: when \`state.planReview.status\` is \`synth_failure\` or \`synth_failure_stalemate\`, the monitor refuses auto-resume and emits a new \`RUN_HALTED_STALEMATE\` event instead. Prevents the storm cycle from starting even if cli.ts's stalemate guard (Bug B) were ever bypassed. 2. **Per-run restart cap**: counts auto-resume attempts in a sidecar file at \`<stateDir>/<stateSlug>.resume-count\`. Default cap 3, overridable via \`GSTACK_MONITOR_MAX_AUTO_RESUME\`. On reaching the cap, emits new \`RUN_HALTED_RESTART_CAP\` event. Defense in depth for any future failure class where auto-resume keeps spinning without converging. Both new events get exit code 11 (USER_ACTION_REQUIRED-tier) so wrapper scripts that key on operator-intervention events get the same signal without needing to learn new exit codes. Counter semantics: bumped only on actual spawn attempts (dry-run with \`spawnResume: false\` doesn't consume budget). The counter file naturally goes stale once a run completes; the next run on a different stateSlug gets a fresh counter. Failure to write the counter file is silent (best-effort) — the stalemate refusal guard above is the primary defense, and the cap is a secondary safety net. Tests cover: stalemate refusal for both synth_failure and synth_failure_stalemate statuses, cap hits after default 3 resumes, env var override raises the cap, RUN_RESUMED metadata surfaces resumeCount + maxResumes for observability, counter persists across monitor ticks. See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug C. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…case After the plan-review-loop restart-storm fix (commits f2be7bf → 6d0d9d3), Exit 3 STALEMATE has three sub-cases instead of two. Document the new synth_failure_stalemate path and the operator recovery options specific to it (override with --no-plan-review, edit plan + delete state.planReview to restart fresh, etc.). Also note the monitor's new restart-cap behavior (default 3, GSTACK_MONITOR_MAX_AUTO_RESUME) so operators reading Step 5.5 know what RUN_HALTED_STALEMATE and RUN_HALTED_RESTART_CAP events mean when they appear in the monitor log. Regenerated SKILL.md from the .tmpl change via gen:skill-docs. See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md SKILL.md task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the synth_failure restart storm — a 5-bug chain where a single plan-review-loop convergence failure produces infinite auto-resume cycles. Surfaced by fault investigation
MANUAL_INVESTIGATION:0:997a403b.md; full failure-chain analysis in the plan file.The investigation report identified 2 layers. Reading the actual run evidence revealed 5 interacting bugs. This PR fixes 4 of them; the 5th (heartbeat-tracker false-progress signal) was already fixed by `bd10c04c` and is an ancestor commit of this branch.
Failure chain:
Commits (5 new, all stacked on `bd10c04c`):
Test Coverage
Per-commit new tests:
Total: 13 new tests. All 71 targeted tests pass. Full `bun test` exit=0.
The cap-synth-failure path in `plan-review-loop.ts` is defensive — `shouldBailAdaptive` at line 380 routes `round >= maxRounds` through the stalemate gate before the loop body's synth call fires, so the new exit-3 branch is reached only via the continue-path after a bail-out (or if someone changes `shouldBailAdaptive`). Kept anyway as defense-in-depth; the no-false-positive test pins the contract scope.
Pre-Landing Review
No SQL, no LLM trust boundary, no user input handling. Atomic-write pattern (temp-file-rename via `process.pid`) mirrors the existing `heartbeatTrackerPath` helper, so concurrent monitor processes won't race. `pendingExitError` pattern is JavaScript-semantically guaranteed once the three pieces are wired (catch sets, finally reads). No issues found.
Plan Completion
All items from `~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md` DONE:
No items deferred. No items unverifiable.
Integration with `bd10c04c`
This branch stacks on `bd10c04c` (the heartbeat-tracker fix you'd already committed). The branch `fix/heartbeat-tracker-auto-resume-false-progress` becomes superseded once this lands — `bd10c04c` is an ancestor commit of this PR's HEAD, so the work lands as part of the merge with no duplicate commits or rebase needed.
Test plan
🤖 Generated with Claude Code