Skip to content

fix(build/orchestrator): plan-review-loop synth_failure restart-storm (4 layered bugs + docs)#101

Merged
anbangr merged 6 commits into
mainfrom
worktree-plan-review-loop-stalemate-restart-storm
May 26, 2026
Merged

fix(build/orchestrator): plan-review-loop synth_failure restart-storm (4 layered bugs + docs)#101
anbangr merged 6 commits into
mainfrom
worktree-plan-review-loop-stalemate-restart-storm

Conversation

@anbangr

@anbangr anbangr commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the synth_failure restart storm — a 5-bug chain where a single plan-review-loop convergence failure produces infinite auto-resume cycles. Surfaced by fault investigation MANUAL_INVESTIGATION:0:997a403b.md; full failure-chain analysis in the plan file.

The investigation report identified 2 layers. Reading the actual run evidence revealed 5 interacting bugs. This PR fixes 4 of them; the 5th (heartbeat-tracker false-progress signal) was already fixed by `bd10c04c` and is an ancestor commit of this branch.

Failure chain:

Bug Where What
A `cli.ts:13274` `process.exit(exitCode)` in finally masks thrown ExitError codes. `exitCode` defaults to 1, gets set to 0 on success path before failure branches throw, so `ExitError(3)` after a successful prior phase exits 0 (false success). Top-level catch never runs.
B `cli.ts:11552-11558` `state.planReview.status === "synth_failure"` triggers retry on resume. Every resume re-runs the same failing loop.
C component 1 `monitor.ts` heartbeat tracker Used `stateLastUpdatedAt` (a write timestamp that ticks on every state rewrite including `spawnResume`'s no-op rewrite) as a progress signal. Fixed in `bd10c04c` — ancestor commit, included here.
C component 2 `monitor.ts` auto-resume site No stalemate guard, no per-run restart cap. Monitor sees `RUN_STALE` → `spawnResume` indefinitely.
D + E `plan-review-loop.ts` No stdout STALEMATE event; `synth_failure` uses Exit 1, contradicting SKILL.md Step 5.5's documented Exit 3 = STALEMATE contract.

Commits (5 new, all stacked on `bd10c04c`):

  • `f2be7bf3` Fix A — Honor thrown ExitError code in outer finally. Three-line `pendingExitError` pattern. Load-bearing — fixes every `ExitError` site in cli.ts, not just this fault.
  • `418b6efb` Fix D + E (loop-side) — New `synth_failure_stalemate` outcome + Exit 3 + stdout STALEMATE event when synth fails at the round cap. Mid-loop synth_failure unchanged (still exit 1, still retryable per Fix B's policy).
  • `47029c07` Fix B — Refuse auto-retry on resume; promote `synth_failure` → `synth_failure_stalemate` and throw `ExitError(3)` with operator-facing recovery guidance.
  • `6d0d9d3a` Fix C component 2 — Monitor refuses auto-resume when state shows stalemate; per-run restart cap (default 3, override via `GSTACK_MONITOR_MAX_AUTO_RESUME`). Two new monitor events: `RUN_HALTED_STALEMATE`, `RUN_HALTED_RESTART_CAP`.
  • `fb98ff2d` Docs — SKILL.md Step 5.5 documents the new sub-case + recovery options.

Test Coverage

Per-commit new tests:

Suite Before After Δ Coverage focus
`cli-finally-exit-code.test.ts` (NEW) 5 +5 Fix A wiring (pendingExitError declaration + catch + finally) + JS semantics sanity check
`plan-reviewer-loop.test.ts` 18 19 +1 Fix D+E no-false-positive (mid-loop synth_failure stays exit 1)
`cli-synth-failure-resume.test.ts` (NEW) 5 +5 Fix B wiring (guard placement, status removal from retry list, exit-3 routing, STALEMATE event reasons)
`monitor.test.ts` 36 42 +6 Fix C stalemate refusal (both statuses), restart cap, env override, counter persistence across ticks

Total: 13 new tests. All 71 targeted tests pass. Full `bun test` exit=0.

The cap-synth-failure path in `plan-review-loop.ts` is defensive — `shouldBailAdaptive` at line 380 routes `round >= maxRounds` through the stalemate gate before the loop body's synth call fires, so the new exit-3 branch is reached only via the continue-path after a bail-out (or if someone changes `shouldBailAdaptive`). Kept anyway as defense-in-depth; the no-false-positive test pins the contract scope.

Pre-Landing Review

No SQL, no LLM trust boundary, no user input handling. Atomic-write pattern (temp-file-rename via `process.pid`) mirrors the existing `heartbeatTrackerPath` helper, so concurrent monitor processes won't race. `pendingExitError` pattern is JavaScript-semantically guaranteed once the three pieces are wired (catch sets, finally reads). No issues found.

Plan Completion

All items from `~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md` DONE:

  • ✅ Fix A (cli.ts finally) + test
  • ✅ Fix B (cli.ts stalemate guard) + test
  • ✅ Fix C component 2 (monitor restart cap) + tests
  • ✅ Fix D + E (plan-review-loop) + test
  • ✅ SKILL.md Step 5.5 docs

No items deferred. No items unverifiable.

Integration with `bd10c04c`

This branch stacks on `bd10c04c` (the heartbeat-tracker fix you'd already committed). The branch `fix/heartbeat-tracker-auto-resume-false-progress` becomes superseded once this lands — `bd10c04c` is an ancestor commit of this PR's HEAD, so the work lands as part of the merge with no duplicate commits or rebase needed.

Test plan

  • `bun test build/orchestrator/tests/{plan-reviewer-loop,monitor,cli-finally-exit-code,cli-synth-failure-resume}.test.ts` — 71/71 pass
  • Full `bun test` — exit=0, no regressions
  • Fork-local rule respected (no VERSION/CHANGELOG/package.json edits per `gstack/CLAUDE.md`)

🤖 Generated with Claude Code

anbangr and others added 6 commits May 26, 2026 09:43
…as false progress signal

## Bug

The HeartbeatTracker watchdog in monitor.ts compares the latest sidecar
heartbeat against a persisted "last seen" tracker to decide whether the
build run has stalled. The `moved` predicate was:

    moved = sidecar.stateLastUpdatedAt !== tracker.lastSeenStateLastUpdatedAt
         || sidecar.drainProcessedCount   !== tracker.lastSeenDrainProcessedCount

`stateLastUpdatedAt` is the write timestamp on the state file. Every
state rewrite touches it — including no-op rewrites that the auto-resume
path performs each time the orchestrator re-enters the loop. So during
any prolonged hang where the orchestrator keeps auto-resuming (e.g. a
subagent that returns without making progress and gets retried), the
stall clock resets on every cycle and the no-progress watchdog never
fires the USER_ACTION_REQUIRED escalation. Observed in the field as a
30+ minute hang with only heartbeat output, build_stall_threshold_ms
set to 30 min, and the user-action prompt never triggered.

## Fix

Drop `stateLastUpdatedAt` from the `moved` predicate. Add `phase` (the
sidecar's current phase index, which only advances on real forward
progress) as the new positive progress signal. The tracker retains
`drainProcessedCount` (also a real progress signal — only increments
when a drain item is actually processed). `lastSeenStateLastUpdatedAt`
is kept in the tracker file for backwards compatibility with pre-fix
on-disk tracker state, but the moved-check ignores it on read and the
write path preserves it for forward-compat readers.

Pre-fix tracker files without `lastSeenPhase` parse and load fine; the
moved-check is gated on `tracker.lastSeenPhase !== undefined` so the
old-tracker case falls through to "no positive signal observed", which
correctly arms the stall on the next tick.

## Regression tests

Three new tests in monitor.test.ts cover:

1. stateLastUpdatedAt-only advance → USER_ACTION_REQUIRED fires
   (the original regression — auto-resume tracker-reset)
2. phase advance → MONITOR_REENTER + RUN_RUNNING (positive case)
3. pre-fix tracker without lastSeenPhase → still escalates correctly
   (backwards-compat)

Full monitor.test.ts suite: 36 pass / 0 fail / 101 expect() calls.

## Investigation note: plan-review-loop STALEMATE

A second hypothesis ("plan-review-loop missing STALEMATE emission on
round-cap with synth ok:false") was investigated and ruled out. The
STALEMATE path already exists in plan-review-loop.ts at two sites
(synth_failure on the continue path and on the round-cap path), both
return finalResult(..., exitCode=1, ...), and cli.ts:11710-11722
properly throws ExitError(1) on the loopResult. End-to-end wiring is
correct. The observed 30+ minute hang was entirely attributable to
Bug #1 above (tracker reset on auto-resume).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (Bug A)

cli.ts's outer try/finally at the heartbeat boundary calls
\`process.exit(exitCode)\` in the finally block. exitCode is initialized to
1 at function entry and set to 0 on the success path BEFORE any failure
branches throw — so a \`throw new ExitError(N)\` after a successful prior
phase exits with whatever exitCode happens to be at that moment (typically
0, i.e. false success), NOT the thrown ExitError's code. The top-level
catch at \`main().catch(err => process.exit(err.code))\` never runs because
the process already died inside the finally.

This is the load-bearing structural bug in the plan-review-loop restart
storm: a synth_failure that throws ExitError(1) exits 0, the monitor sees
success-shaped exit + a "synth_failure" state status, auto-resumes the
crashed run, which hits the same failure, ad infinitum.

Fix: capture any ExitError thrown inside the outer try via a
\`pendingExitError\` variable. The finally block computes
\`process.exit(pendingExitError?.code ?? exitCode)\` so the thrown code
wins. Pattern is three small additions:
- \`let pendingExitError: ExitError | null = null;\` near \`let exitCode\`
- a new \`catch (err) { if (err instanceof ExitError) pendingExitError = err; throw err; }\` between the try body and finally
- \`process.exit(pendingExitError?.code ?? exitCode)\` at the bottom of finally

This is correct for every \`throw new ExitError(N)\` in cli.ts — Exit 3
STALEMATE, Exit 4 USER ABORT, Exit 130 SIGINT, Exit 1 synth_failure — not
just the one fault path that surfaced it. Worth landing on its own merits.

Test pins the structural fix via static-grep checks plus a JS semantics
sanity check that the pattern correctly preserves thrown codes through
finally. Behavioral coverage comes via plan-reviewer-loop.test.ts and the
new cli-synth-failure-resume.test.ts in subsequent commits.

See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug A.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…STALEMATE event (Bug D+E loop-side)

plan-review-loop.ts now distinguishes "synth fails at round cap" from
"synth fails mid-loop". The cap case becomes the new outcome
\`synth_failure_stalemate\` with exit code 3 (matching SKILL.md Step 5.5's
documented STALEMATE contract) and emits a stdout JSON event
\`{"event":"STALEMATE","reason":"synth_failure_at_round_cap",...}\` so
operators and monitors can detect convergence failure programmatically.

Mid-loop synth_failure is unchanged — keeps exit code 1, outcome
\`synth_failure\`, retry semantics intact for transient model/network
blips. cli.ts's retry-promotion logic (Bug B fix in the next commit) is
the load-bearing fix for repeated mid-loop synth_failure that creates
the restart storm.

NOTE: the cap-synth-failure path is largely defensive — in practice,
\`shouldBailAdaptive\` at line 380 routes \`round >= maxRounds\` through
the stalemate gate BEFORE the loop body's synth call fires. So the new
exit-3 branch is reached only when an alternative loop flow (e.g.,
\`--no-adaptive-cap\` plus continue-path bail-out) lands at the cap with
a failing synth. Keeping the defensive code anyway so future
\`shouldBailAdaptive\` changes don't regress the contract.

Tests added in plan-reviewer-loop.test.ts:
- Mid-loop synth_failure with maxRounds=5 still returns exit 1 + outcome
  \`synth_failure\` (pins the no-false-positive invariant — the contract
  change is scoped to the cap, not promoted to every synth failure).

See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug D+E.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…romote to STALEMATE (Bug B)

cli.ts's resume condition at line 11552 previously included
\`state.planReview.status === "synth_failure"\` as a retry trigger,
optimized for transient mid-loop crashes. Combined with Bug A (now fixed)
and the monitor's auto-resume on RUN_STALE, this created the restart
storm: the orchestrator re-ran the same failing loop on every resume,
ad infinitum.

Per the plan, do NOT auto-retry synth_failure on resume. Add a stalemate
guard BEFORE the existing retry condition:

  if status === "synth_failure" OR "synth_failure_stalemate":
    emit STALEMATE event to stdout
    promote status to "synth_failure_stalemate" if needed
    throw ExitError(3) — operator intervention required

The transient-retry case the old code optimized for is, from the operator's
perspective, the same shape as the restart-storm case — both result in a
crashed run that the orchestrator can't recover automatically. The safer
default is to halt with clear guidance:
  1. Edit the plan to address objections, re-run.
  2. Re-launch with --no-plan-review to bypass the gate.
  3. Delete state.planReview to restart the loop fresh.

Also routes \`loopResult.outcome === "synth_failure_stalemate"\` (from the
cap path in plan-review-loop) to record \`status: "synth_failure_stalemate"\`
in the saved state, so the resume guard above recognizes it as definitely-
stalemate (not just first occurrence).

Tests pin the static wiring across five concerns:
- Stalemate guard appears BEFORE the loop-launch if
- Both "synth_failure" and "synth_failure_stalemate" statuses are guarded
- "synth_failure" removed from the resume-retry condition
- exit-3 handler routes synth_failure_stalemate outcome correctly
- STALEMATE event carries reason field for telemetry

See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug B.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er-run restart cap (Bug C component 2)

Two new guards on the monitor's auto-resume site, complementing the
heartbeat-tracker fix from commit bd10c04 (Bug C component 1):

1. **Stalemate refusal**: when \`state.planReview.status\` is
   \`synth_failure\` or \`synth_failure_stalemate\`, the monitor refuses
   auto-resume and emits a new \`RUN_HALTED_STALEMATE\` event instead.
   Prevents the storm cycle from starting even if cli.ts's stalemate
   guard (Bug B) were ever bypassed.

2. **Per-run restart cap**: counts auto-resume attempts in a sidecar
   file at \`<stateDir>/<stateSlug>.resume-count\`. Default cap 3,
   overridable via \`GSTACK_MONITOR_MAX_AUTO_RESUME\`. On reaching the
   cap, emits new \`RUN_HALTED_RESTART_CAP\` event. Defense in depth
   for any future failure class where auto-resume keeps spinning
   without converging.

Both new events get exit code 11 (USER_ACTION_REQUIRED-tier) so wrapper
scripts that key on operator-intervention events get the same signal
without needing to learn new exit codes.

Counter semantics: bumped only on actual spawn attempts (dry-run with
\`spawnResume: false\` doesn't consume budget). The counter file
naturally goes stale once a run completes; the next run on a different
stateSlug gets a fresh counter. Failure to write the counter file is
silent (best-effort) — the stalemate refusal guard above is the primary
defense, and the cap is a secondary safety net.

Tests cover: stalemate refusal for both synth_failure and
synth_failure_stalemate statuses, cap hits after default 3 resumes,
env var override raises the cap, RUN_RESUMED metadata surfaces
resumeCount + maxResumes for observability, counter persists across
monitor ticks.

See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md Bug C.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…case

After the plan-review-loop restart-storm fix (commits f2be7bf6d0d9d3),
Exit 3 STALEMATE has three sub-cases instead of two. Document the new
synth_failure_stalemate path and the operator recovery options specific
to it (override with --no-plan-review, edit plan + delete state.planReview
to restart fresh, etc.).

Also note the monitor's new restart-cap behavior (default 3,
GSTACK_MONITOR_MAX_AUTO_RESUME) so operators reading Step 5.5 know what
RUN_HALTED_STALEMATE and RUN_HALTED_RESTART_CAP events mean when they
appear in the monitor log.

Regenerated SKILL.md from the .tmpl change via gen:skill-docs.

See ~/.claude/plans/fix-plan-review-loop-stalemate-restart-storm.md SKILL.md task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant