fix(build/orchestrator): heartbeat tracker auto-resume false-progress (stall watchdog reset bug)#100
Merged
Conversation
…as false progress signal
## Bug
The HeartbeatTracker watchdog in monitor.ts compares the latest sidecar
heartbeat against a persisted "last seen" tracker to decide whether the
build run has stalled. The `moved` predicate was:
moved = sidecar.stateLastUpdatedAt !== tracker.lastSeenStateLastUpdatedAt
|| sidecar.drainProcessedCount !== tracker.lastSeenDrainProcessedCount
`stateLastUpdatedAt` is the write timestamp on the state file. Every
state rewrite touches it — including no-op rewrites that the auto-resume
path performs each time the orchestrator re-enters the loop. So during
any prolonged hang where the orchestrator keeps auto-resuming (e.g. a
subagent that returns without making progress and gets retried), the
stall clock resets on every cycle and the no-progress watchdog never
fires the USER_ACTION_REQUIRED escalation. Observed in the field as a
30+ minute hang with only heartbeat output, build_stall_threshold_ms
set to 30 min, and the user-action prompt never triggered.
## Fix
Drop `stateLastUpdatedAt` from the `moved` predicate. Add `phase` (the
sidecar's current phase index, which only advances on real forward
progress) as the new positive progress signal. The tracker retains
`drainProcessedCount` (also a real progress signal — only increments
when a drain item is actually processed). `lastSeenStateLastUpdatedAt`
is kept in the tracker file for backwards compatibility with pre-fix
on-disk tracker state, but the moved-check ignores it on read and the
write path preserves it for forward-compat readers.
Pre-fix tracker files without `lastSeenPhase` parse and load fine; the
moved-check is gated on `tracker.lastSeenPhase !== undefined` so the
old-tracker case falls through to "no positive signal observed", which
correctly arms the stall on the next tick.
## Regression tests
Three new tests in monitor.test.ts cover:
1. stateLastUpdatedAt-only advance → USER_ACTION_REQUIRED fires
(the original regression — auto-resume tracker-reset)
2. phase advance → MONITOR_REENTER + RUN_RUNNING (positive case)
3. pre-fix tracker without lastSeenPhase → still escalates correctly
(backwards-compat)
Full monitor.test.ts suite: 36 pass / 0 fail / 101 expect() calls.
## Investigation note: plan-review-loop STALEMATE
A second hypothesis ("plan-review-loop missing STALEMATE emission on
round-cap with synth ok:false") was investigated and ruled out. The
STALEMATE path already exists in plan-review-loop.ts at two sites
(synth_failure on the continue path and on the round-cap path), both
return finalResult(..., exitCode=1, ...), and cli.ts:11710-11722
properly throws ExitError(1) on the loopResult. End-to-end wiring is
correct. The observed 30+ minute hang was entirely attributable to
Bug #1 above (tracker reset on auto-resume).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a stall-watchdog blindspot where the orchestrator's heartbeat tracker treated
stateLastUpdatedAtas a forward-progress signal. That timestamp gets touched on every state rewrite — including no-op rewrites that auto-resume performs each time the loop re-enters — so during any prolonged hang where the orchestrator keeps auto-resuming (e.g. a subagent that returns without progress and gets retried), the stall clock reset every tick and the no-progress watchdog never fired itsUSER_ACTION_REQUIREDescalation.Observed in the field as a 30+ minute hang with only heartbeat output,
build_stall_threshold_msset to 30 min, and the user-action prompt never triggered.Fix
build/orchestrator/monitor.ts— dropstateLastUpdatedAtfrom themovedpredicate. KeepdrainProcessedCount(real progress signal) and addphase(sidecar's phase index, only advances on real forward progress) as the new positive signal. Pre-fix tracker files withoutlastSeenPhaseparse fine; the moved-check is gated ontracker.lastSeenPhase !== undefinedso the old-tracker case falls through correctly to "no positive signal observed" on the next tick.lastSeenStateLastUpdatedAtis retained in the tracker schema for backwards-compat with pre-fix on-disk state (read but ignored; written preserved for forward-compat readers).Test Coverage
build/orchestrator/__tests__/monitor.test.ts— 3 new regression tests:STALL ARM: stateLastUpdatedAt advance is NOT progress (spawnResume tracker-reset regression)— pre-seeds tracker with stale state, advances onlystateLastUpdatedAton subsequent ticks, assertsUSER_ACTION_REQUIREDfires (the bug repro).STALL ARM: phase advance IS progress (resets stall clock)— assertsMONITOR_REENTER + RUN_RUNNINGwhen phase advances, and tracker upgrade tolastSeenPhase: 2.STALL ARM: pre-fix tracker without lastSeenPhase still works (backwards-compat)— old tracker schema (nolastSeenPhasefield) still escalates correctly because the comparison gate defaults to "no positive signal."Result: 36 pass / 0 fail / 101 expect() calls (~480ms).
Investigation note: plan-review-loop STALEMATE (false-alarm Bug #2)
A second hypothesis from the field investigation — "plan-review-loop missing STALEMATE emission on round-cap with
synth returned ok:false" — was investigated and ruled out. The STALEMATE path already exists inplan-review-loop.tsat two sites:plan-review-loop.ts:1424— continue path onsynth ok:falseplan-review-loop.ts:1468— round-cap path onsynth ok:falseBoth return
finalResult("synth_failure", round, 1, verdict). The caller incli.ts:11710-11722properly throwsExitError(1)onloopResult.exitCode === 1. End-to-end wiring is correct. The observed 30-min hang was entirely attributable to the tracker regression fixed in this PR.Adversarial review summary
Reviewed for: backwards-compat with old trackers, phase wrap/backwards, brand-new-run signals at 0, test tautology, schema break for pre-fix readers, sidecar/tracker race. All clean. One minor theoretical concern (phase going backwards would count as progress) is a non-issue in normal flow — phase shouldn't go backwards in healthy operation, and if it does the run is already in a degraded state where a one-tick stall-clock reset is acceptable.
Test plan
bun test build/orchestrator/__tests__/monitor.test.ts— 36/36 passbuild_stall_threshold_ms(manual verification, recommended before merge)