Skip to content

fix(build/orchestrator): heartbeat tracker auto-resume false-progress (stall watchdog reset bug)#100

Merged
anbangr merged 1 commit into
mainfrom
fix/heartbeat-tracker-auto-resume-false-progress
May 26, 2026
Merged

fix(build/orchestrator): heartbeat tracker auto-resume false-progress (stall watchdog reset bug)#100
anbangr merged 1 commit into
mainfrom
fix/heartbeat-tracker-auto-resume-false-progress

Conversation

@anbangr

@anbangr anbangr commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes a stall-watchdog blindspot where the orchestrator's heartbeat tracker treated stateLastUpdatedAt as a forward-progress signal. That timestamp gets touched on every state rewrite — including no-op rewrites that auto-resume performs each time the loop re-enters — so during any prolonged hang where the orchestrator keeps auto-resuming (e.g. a subagent that returns without progress and gets retried), the stall clock reset every tick and the no-progress watchdog never fired its USER_ACTION_REQUIRED escalation.

Observed in the field as a 30+ minute hang with only heartbeat output, build_stall_threshold_ms set to 30 min, and the user-action prompt never triggered.

Fix

build/orchestrator/monitor.ts — drop stateLastUpdatedAt from the moved predicate. Keep drainProcessedCount (real progress signal) and add phase (sidecar's phase index, only advances on real forward progress) as the new positive signal. Pre-fix tracker files without lastSeenPhase parse fine; the moved-check is gated on tracker.lastSeenPhase !== undefined so the old-tracker case falls through correctly to "no positive signal observed" on the next tick.

lastSeenStateLastUpdatedAt is retained in the tracker schema for backwards-compat with pre-fix on-disk state (read but ignored; written preserved for forward-compat readers).

Test Coverage

build/orchestrator/__tests__/monitor.test.ts — 3 new regression tests:

  1. STALL ARM: stateLastUpdatedAt advance is NOT progress (spawnResume tracker-reset regression) — pre-seeds tracker with stale state, advances only stateLastUpdatedAt on subsequent ticks, asserts USER_ACTION_REQUIRED fires (the bug repro).
  2. STALL ARM: phase advance IS progress (resets stall clock) — asserts MONITOR_REENTER + RUN_RUNNING when phase advances, and tracker upgrade to lastSeenPhase: 2.
  3. STALL ARM: pre-fix tracker without lastSeenPhase still works (backwards-compat) — old tracker schema (no lastSeenPhase field) still escalates correctly because the comparison gate defaults to "no positive signal."

Result: 36 pass / 0 fail / 101 expect() calls (~480ms).

Investigation note: plan-review-loop STALEMATE (false-alarm Bug #2)

A second hypothesis from the field investigation — "plan-review-loop missing STALEMATE emission on round-cap with synth returned ok:false" — was investigated and ruled out. The STALEMATE path already exists in plan-review-loop.ts at two sites:

  • plan-review-loop.ts:1424 — continue path on synth ok:false
  • plan-review-loop.ts:1468 — round-cap path on synth ok:false

Both return finalResult("synth_failure", round, 1, verdict). The caller in cli.ts:11710-11722 properly throws ExitError(1) on loopResult.exitCode === 1. End-to-end wiring is correct. The observed 30-min hang was entirely attributable to the tracker regression fixed in this PR.

Adversarial review summary

Reviewed for: backwards-compat with old trackers, phase wrap/backwards, brand-new-run signals at 0, test tautology, schema break for pre-fix readers, sidecar/tracker race. All clean. One minor theoretical concern (phase going backwards would count as progress) is a non-issue in normal flow — phase shouldn't go backwards in healthy operation, and if it does the run is already in a degraded state where a one-tick stall-clock reset is acceptable.

Test plan

  • bun test build/orchestrator/__tests__/monitor.test.ts — 36/36 pass
  • Smoke test: launch a real gstack-build run, simulate a synth hang, confirm watchdog fires within build_stall_threshold_ms (manual verification, recommended before merge)

…as false progress signal

## Bug

The HeartbeatTracker watchdog in monitor.ts compares the latest sidecar
heartbeat against a persisted "last seen" tracker to decide whether the
build run has stalled. The `moved` predicate was:

    moved = sidecar.stateLastUpdatedAt !== tracker.lastSeenStateLastUpdatedAt
         || sidecar.drainProcessedCount   !== tracker.lastSeenDrainProcessedCount

`stateLastUpdatedAt` is the write timestamp on the state file. Every
state rewrite touches it — including no-op rewrites that the auto-resume
path performs each time the orchestrator re-enters the loop. So during
any prolonged hang where the orchestrator keeps auto-resuming (e.g. a
subagent that returns without making progress and gets retried), the
stall clock resets on every cycle and the no-progress watchdog never
fires the USER_ACTION_REQUIRED escalation. Observed in the field as a
30+ minute hang with only heartbeat output, build_stall_threshold_ms
set to 30 min, and the user-action prompt never triggered.

## Fix

Drop `stateLastUpdatedAt` from the `moved` predicate. Add `phase` (the
sidecar's current phase index, which only advances on real forward
progress) as the new positive progress signal. The tracker retains
`drainProcessedCount` (also a real progress signal — only increments
when a drain item is actually processed). `lastSeenStateLastUpdatedAt`
is kept in the tracker file for backwards compatibility with pre-fix
on-disk tracker state, but the moved-check ignores it on read and the
write path preserves it for forward-compat readers.

Pre-fix tracker files without `lastSeenPhase` parse and load fine; the
moved-check is gated on `tracker.lastSeenPhase !== undefined` so the
old-tracker case falls through to "no positive signal observed", which
correctly arms the stall on the next tick.

## Regression tests

Three new tests in monitor.test.ts cover:

1. stateLastUpdatedAt-only advance → USER_ACTION_REQUIRED fires
   (the original regression — auto-resume tracker-reset)
2. phase advance → MONITOR_REENTER + RUN_RUNNING (positive case)
3. pre-fix tracker without lastSeenPhase → still escalates correctly
   (backwards-compat)

Full monitor.test.ts suite: 36 pass / 0 fail / 101 expect() calls.

## Investigation note: plan-review-loop STALEMATE

A second hypothesis ("plan-review-loop missing STALEMATE emission on
round-cap with synth ok:false") was investigated and ruled out. The
STALEMATE path already exists in plan-review-loop.ts at two sites
(synth_failure on the continue path and on the round-cap path), both
return finalResult(..., exitCode=1, ...), and cli.ts:11710-11722
properly throws ExitError(1) on the loopResult. End-to-end wiring is
correct. The observed 30+ minute hang was entirely attributable to
Bug #1 above (tracker reset on auto-resume).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anbangr anbangr merged commit 9ea7462 into main May 26, 2026
@anbangr anbangr deleted the fix/heartbeat-tracker-auto-resume-false-progress branch May 26, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant