fix(build/orchestrator): plan-review-loop synth_failure restart-storm (4 layered bugs + docs) by anbangr · Pull Request #101 · anbangr/gstack

anbangr · 2026-05-26T02:27:30Z

Summary

Fixes the synth_failure restart storm — a 5-bug chain where a single plan-review-loop convergence failure produces infinite auto-resume cycles. Surfaced by fault investigation MANUAL_INVESTIGATION:0:997a403b.md; full failure-chain analysis in the plan file.

The investigation report identified 2 layers. Reading the actual run evidence revealed 5 interacting bugs. This PR fixes 4 of them; the 5th (heartbeat-tracker false-progress signal) was already fixed by `bd10c04c` and is an ancestor commit of this branch.

Failure chain:

Bug	Where	What
A	`cli.ts:13274`	`process.exit(exitCode)` in finally masks thrown ExitError codes. `exitCode` defaults to 1, gets set to 0 on success path before failure branches throw, so `ExitError(3)` after a successful prior phase exits 0 (false success). Top-level catch never runs.
B	`cli.ts:11552-11558`	`state.planReview.status === "synth_failure"` triggers retry on resume. Every resume re-runs the same failing loop.
C component 1	`monitor.ts` heartbeat tracker	Used `stateLastUpdatedAt` (a write timestamp that ticks on every state rewrite including `spawnResume`'s no-op rewrite) as a progress signal. Fixed in `bd10c04c` — ancestor commit, included here.
C component 2	`monitor.ts` auto-resume site	No stalemate guard, no per-run restart cap. Monitor sees `RUN_STALE` → `spawnResume` indefinitely.
D + E	`plan-review-loop.ts`	No stdout STALEMATE event; `synth_failure` uses Exit 1, contradicting SKILL.md Step 5.5's documented Exit 3 = STALEMATE contract.

Commits (5 new, all stacked on `bd10c04c`):

`f2be7bf3` Fix A — Honor thrown ExitError code in outer finally. Three-line `pendingExitError` pattern. Load-bearing — fixes every `ExitError` site in cli.ts, not just this fault.
`418b6efb` Fix D + E (loop-side) — New `synth_failure_stalemate` outcome + Exit 3 + stdout STALEMATE event when synth fails at the round cap. Mid-loop synth_failure unchanged (still exit 1, still retryable per Fix B's policy).
`47029c07` Fix B — Refuse auto-retry on resume; promote `synth_failure` → `synth_failure_stalemate` and throw `ExitError(3)` with operator-facing recovery guidance.
`6d0d9d3a` Fix C component 2 — Monitor refuses auto-resume when state shows stalemate; per-run restart cap (default 3, override via `GSTACK_MONITOR_MAX_AUTO_RESUME`). Two new monitor events: `RUN_HALTED_STALEMATE`, `RUN_HALTED_RESTART_CAP`.
`fb98ff2d` Docs — SKILL.md Step 5.5 documents the new sub-case + recovery options.

Test Coverage

Per-commit new tests:

Suite	Before	After	Δ	Coverage focus
`cli-finally-exit-code.test.ts` (NEW)	—	5	+5	Fix A wiring (pendingExitError declaration + catch + finally) + JS semantics sanity check
`plan-reviewer-loop.test.ts`	18	19	+1	Fix D+E no-false-positive (mid-loop synth_failure stays exit 1)
`cli-synth-failure-resume.test.ts` (NEW)	—	5	+5	Fix B wiring (guard placement, status removal from retry list, exit-3 routing, STALEMATE event reasons)
`monitor.test.ts`	36	42	+6	Fix C stalemate refusal (both statuses), restart cap, env override, counter persistence across ticks

Total: 13 new tests. All 71 targeted tests pass. Full `bun test` exit=0.

The cap-synth-failure path in `plan-review-loop.ts` is defensive — `shouldBailAdaptive` at line 380 routes `round >= maxRounds` through the stalemate gate before the loop body's synth call fires, so the new exit-3 branch is reached only via the continue-path after a bail-out (or if someone changes `shouldBailAdaptive`). Kept anyway as defense-in-depth; the no-false-positive test pins the contract scope.

Pre-Landing Review

No SQL, no LLM trust boundary, no user input handling. Atomic-write pattern (temp-file-rename via `process.pid`) mirrors the existing `heartbeatTrackerPath` helper, so concurrent monitor processes won't race. `pendingExitError` pattern is JavaScript-semantically guaranteed once the three pieces are wired (catch sets, finally reads). No issues found.

Plan Completion

All items from `~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md` DONE:

✅ Fix A (cli.ts finally) + test
✅ Fix B (cli.ts stalemate guard) + test
✅ Fix C component 2 (monitor restart cap) + tests
✅ Fix D + E (plan-review-loop) + test
✅ SKILL.md Step 5.5 docs

No items deferred. No items unverifiable.

Integration with `bd10c04c`

This branch stacks on `bd10c04c` (the heartbeat-tracker fix you'd already committed). The branch `fix/heartbeat-tracker-auto-resume-false-progress` becomes superseded once this lands — `bd10c04c` is an ancestor commit of this PR's HEAD, so the work lands as part of the merge with no duplicate commits or rebase needed.

Test plan

`bun test build/orchestrator/tests/{plan-reviewer-loop,monitor,cli-finally-exit-code,cli-synth-failure-resume}.test.ts` — 71/71 pass
Full `bun test` — exit=0, no regressions
Fork-local rule respected (no VERSION/CHANGELOG/package.json edits per `gstack/CLAUDE.md`)

🤖 Generated with Claude Code

…as false progress signal ## Bug The HeartbeatTracker watchdog in monitor.ts compares the latest sidecar heartbeat against a persisted "last seen" tracker to decide whether the build run has stalled. The `moved` predicate was: moved = sidecar.stateLastUpdatedAt !== tracker.lastSeenStateLastUpdatedAt || sidecar.drainProcessedCount !== tracker.lastSeenDrainProcessedCount `stateLastUpdatedAt` is the write timestamp on the state file. Every state rewrite touches it — including no-op rewrites that the auto-resume path performs each time the orchestrator re-enters the loop. So during any prolonged hang where the orchestrator keeps auto-resuming (e.g. a subagent that returns without making progress and gets retried), the stall clock resets on every cycle and the no-progress watchdog never fires the USER_ACTION_REQUIRED escalation. Observed in the field as a 30+ minute hang with only heartbeat output, build_stall_threshold_ms set to 30 min, and the user-action prompt never triggered. ## Fix Drop `stateLastUpdatedAt` from the `moved` predicate. Add `phase` (the sidecar's current phase index, which only advances on real forward progress) as the new positive progress signal. The tracker retains `drainProcessedCount` (also a real progress signal — only increments when a drain item is actually processed). `lastSeenStateLastUpdatedAt` is kept in the tracker file for backwards compatibility with pre-fix on-disk tracker state, but the moved-check ignores it on read and the write path preserves it for forward-compat readers. Pre-fix tracker files without `lastSeenPhase` parse and load fine; the moved-check is gated on `tracker.lastSeenPhase !== undefined` so the old-tracker case falls through to "no positive signal observed", which correctly arms the stall on the next tick. ## Regression tests Three new tests in monitor.test.ts cover: 1. stateLastUpdatedAt-only advance → USER_ACTION_REQUIRED fires (the original regression — auto-resume tracker-reset) 2. phase advance → MONITOR_REENTER + RUN_RUNNING (positive case) 3. pre-fix tracker without lastSeenPhase → still escalates correctly (backwards-compat) Full monitor.test.ts suite: 36 pass / 0 fail / 101 expect() calls. ## Investigation note: plan-review-loop STALEMATE A second hypothesis ("plan-review-loop missing STALEMATE emission on round-cap with synth ok:false") was investigated and ruled out. The STALEMATE path already exists in plan-review-loop.ts at two sites (synth_failure on the continue path and on the round-cap path), both return finalResult(..., exitCode=1, ...), and cli.ts:11710-11722 properly throws ExitError(1) on the loopResult. End-to-end wiring is correct. The observed 30+ minute hang was entirely attributable to Bug #1 above (tracker reset on auto-resume). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (Bug A) cli.ts's outer try/finally at the heartbeat boundary calls \`process.exit(exitCode)\` in the finally block. exitCode is initialized to 1 at function entry and set to 0 on the success path BEFORE any failure branches throw — so a \`throw new ExitError(N)\` after a successful prior phase exits with whatever exitCode happens to be at that moment (typically 0, i.e. false success), NOT the thrown ExitError's code. The top-level catch at \`main().catch(err => process.exit(err.code))\` never runs because the process already died inside the finally. This is the load-bearing structural bug in the plan-review-loop restart storm: a synth_failure that throws ExitError(1) exits 0, the monitor sees success-shaped exit + a "synth_failure" state status, auto-resumes the crashed run, which hits the same failure, ad infinitum. Fix: capture any ExitError thrown inside the outer try via a \`pendingExitError\` variable. The finally block computes \`process.exit(pendingExitError?.code ?? exitCode)\` so the thrown code wins. Pattern is three small additions: - \`let pendingExitError: ExitError | null = null;\` near \`let exitCode\` - a new \`catch (err) { if (err instanceof ExitError) pendingExitError = err; throw err; }\` between the try body and finally - \`process.exit(pendingExitError?.code ?? exitCode)\` at the bottom of finally This is correct for every \`throw new ExitError(N)\` in cli.ts — Exit 3 STALEMATE, Exit 4 USER ABORT, Exit 130 SIGINT, Exit 1 synth_failure — not just the one fault path that surfaced it. Worth landing on its own merits. Test pins the structural fix via static-grep checks plus a JS semantics sanity check that the pattern correctly preserves thrown codes through finally. Behavioral coverage comes via plan-reviewer-loop.test.ts and the new cli-synth-failure-resume.test.ts in subsequent commits. See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug A. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…STALEMATE event (Bug D+E loop-side) plan-review-loop.ts now distinguishes "synth fails at round cap" from "synth fails mid-loop". The cap case becomes the new outcome \`synth_failure_stalemate\` with exit code 3 (matching SKILL.md Step 5.5's documented STALEMATE contract) and emits a stdout JSON event \`{"event":"STALEMATE","reason":"synth_failure_at_round_cap",...}\` so operators and monitors can detect convergence failure programmatically. Mid-loop synth_failure is unchanged — keeps exit code 1, outcome \`synth_failure\`, retry semantics intact for transient model/network blips. cli.ts's retry-promotion logic (Bug B fix in the next commit) is the load-bearing fix for repeated mid-loop synth_failure that creates the restart storm. NOTE: the cap-synth-failure path is largely defensive — in practice, \`shouldBailAdaptive\` at line 380 routes \`round >= maxRounds\` through the stalemate gate BEFORE the loop body's synth call fires. So the new exit-3 branch is reached only when an alternative loop flow (e.g., \`--no-adaptive-cap\` plus continue-path bail-out) lands at the cap with a failing synth. Keeping the defensive code anyway so future \`shouldBailAdaptive\` changes don't regress the contract. Tests added in plan-reviewer-loop.test.ts: - Mid-loop synth_failure with maxRounds=5 still returns exit 1 + outcome \`synth_failure\` (pins the no-false-positive invariant — the contract change is scoped to the cap, not promoted to every synth failure). See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug D+E. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…romote to STALEMATE (Bug B) cli.ts's resume condition at line 11552 previously included \`state.planReview.status === "synth_failure"\` as a retry trigger, optimized for transient mid-loop crashes. Combined with Bug A (now fixed) and the monitor's auto-resume on RUN_STALE, this created the restart storm: the orchestrator re-ran the same failing loop on every resume, ad infinitum. Per the plan, do NOT auto-retry synth_failure on resume. Add a stalemate guard BEFORE the existing retry condition: if status === "synth_failure" OR "synth_failure_stalemate": emit STALEMATE event to stdout promote status to "synth_failure_stalemate" if needed throw ExitError(3) — operator intervention required The transient-retry case the old code optimized for is, from the operator's perspective, the same shape as the restart-storm case — both result in a crashed run that the orchestrator can't recover automatically. The safer default is to halt with clear guidance: 1. Edit the plan to address objections, re-run. 2. Re-launch with --no-plan-review to bypass the gate. 3. Delete state.planReview to restart the loop fresh. Also routes \`loopResult.outcome === "synth_failure_stalemate"\` (from the cap path in plan-review-loop) to record \`status: "synth_failure_stalemate"\` in the saved state, so the resume guard above recognizes it as definitely- stalemate (not just first occurrence). Tests pin the static wiring across five concerns: - Stalemate guard appears BEFORE the loop-launch if - Both "synth_failure" and "synth_failure_stalemate" statuses are guarded - "synth_failure" removed from the resume-retry condition - exit-3 handler routes synth_failure_stalemate outcome correctly - STALEMATE event carries reason field for telemetry See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug B. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er-run restart cap (Bug C component 2) Two new guards on the monitor's auto-resume site, complementing the heartbeat-tracker fix from commit bd10c04 (Bug C component 1): 1. **Stalemate refusal**: when \`state.planReview.status\` is \`synth_failure\` or \`synth_failure_stalemate\`, the monitor refuses auto-resume and emits a new \`RUN_HALTED_STALEMATE\` event instead. Prevents the storm cycle from starting even if cli.ts's stalemate guard (Bug B) were ever bypassed. 2. **Per-run restart cap**: counts auto-resume attempts in a sidecar file at \`<stateDir>/<stateSlug>.resume-count\`. Default cap 3, overridable via \`GSTACK_MONITOR_MAX_AUTO_RESUME\`. On reaching the cap, emits new \`RUN_HALTED_RESTART_CAP\` event. Defense in depth for any future failure class where auto-resume keeps spinning without converging. Both new events get exit code 11 (USER_ACTION_REQUIRED-tier) so wrapper scripts that key on operator-intervention events get the same signal without needing to learn new exit codes. Counter semantics: bumped only on actual spawn attempts (dry-run with \`spawnResume: false\` doesn't consume budget). The counter file naturally goes stale once a run completes; the next run on a different stateSlug gets a fresh counter. Failure to write the counter file is silent (best-effort) — the stalemate refusal guard above is the primary defense, and the cap is a secondary safety net. Tests cover: stalemate refusal for both synth_failure and synth_failure_stalemate statuses, cap hits after default 3 resumes, env var override raises the cap, RUN_RESUMED metadata surfaces resumeCount + maxResumes for observability, counter persists across monitor ticks. See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug C. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…case After the plan-review-loop restart-storm fix (commits f2be7bf → 6d0d9d3), Exit 3 STALEMATE has three sub-cases instead of two. Document the new synth_failure_stalemate path and the operator recovery options specific to it (override with --no-plan-review, edit plan + delete state.planReview to restart fresh, etc.). Also note the monitor's new restart-cap behavior (default 3, GSTACK_MONITOR_MAX_AUTO_RESUME) so operators reading Step 5.5 know what RUN_HALTED_STALEMATE and RUN_HALTED_RESTART_CAP events mean when they appear in the monitor log. Regenerated SKILL.md from the .tmpl change via gen:skill-docs. See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md SKILL.md task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anbangr and others added 6 commits May 26, 2026 09:43

anbangr merged commit dcd3d7b into main May 26, 2026

anbangr deleted the worktree-plan-review-loop-stalemate-restart-storm branch May 26, 2026 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(build/orchestrator): plan-review-loop synth_failure restart-storm (4 layered bugs + docs)#101

fix(build/orchestrator): plan-review-loop synth_failure restart-storm (4 layered bugs + docs)#101
anbangr merged 6 commits into
mainfrom
worktree-plan-review-loop-stalemate-restart-storm

anbangr commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anbangr commented May 26, 2026

Summary

Test Coverage

Pre-Landing Review

Plan Completion

Integration with `bd10c04c`

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant