Skip to content

fix: supervisor treats code=0 watchdog exits as crashes#1002

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/supervisor-clean-exit-crash-count
Closed

fix: supervisor treats code=0 watchdog exits as crashes#1002
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/supervisor-clean-exit-crash-count

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts, the supervisor gave up with max_crashes_exceeded.

On a 96K-page brain, process.memoryUsage().rss reports ~7GB during autopilot sync (likely shared page inflation from git mmap operations). The 4096MB threshold triggers on every autopilot cycle, causing the worker to exit cleanly after ~5 min. The supervisor then counts this as a crash.

Result: 73 "crashes" in 24h, all code=0. Supervisor keeps dying and being restarted by external crons.

Fix

  • code=0 exits reset crashCount to 0 and restart immediately (no backoff)
  • Only code≠0 exits count toward the crash limit
  • Clean exits skip the backoff delay entirely

Separate Issue (not addressed here)

process.memoryUsage().rss inflates to 7GB during sync on large repos. The 4096MB watchdog threshold triggers every cycle. This may need a separate fix (use /proc/self/statm for accurate RSS, or raise the threshold for large repos).

The RSS watchdog triggers gracefulShutdown() which exits with code 0.
The supervisor was counting ALL exits < 5min as crashes, including
clean code=0 exits. After 10 watchdog-triggered restarts (typical with
a 96K-page brain where autopilot inflates RSS), the supervisor gave up
with max_crashes_exceeded.

Fix: code=0 exits reset crashCount to 0 and restart immediately with
no backoff. Only code≠0 exits count toward the crash limit.

Root cause: process.memoryUsage().rss reports 7GB during autopilot
sync on large repos (possibly shared page inflation from git mmap).
The 4096MB threshold triggers on every cycle. This is a separate
issue (RSS measurement accuracy) but the supervisor should handle
clean exits regardless.
@garrytan

Copy link
Copy Markdown
Owner

Moving to base-repo branch for CI secret access (fork PRs from non-collaborator accounts don't receive base-repo secrets). Replacement PR coming up — same title, same body, same commits.

@garrytan garrytan closed this May 14, 2026
garrytan pushed a commit that referenced this pull request May 15, 2026
process.memoryUsage().rss returns VmRSS which includes file-backed
mmap'd pages. On repos with large git packfiles (96K+ pages), git
operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB.
The kernel reclaims these pages under memory pressure — they're cache.

Replace with /proc/self/status RssAnon + RssShmem which measures only
anonymous pages (heap, stack, anonymous mmap). This is the memory that
actually matters for OOM risk.

Falls back to process.memoryUsage().rss on non-Linux.

Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold)
After:  watchdog only triggers on real memory growth (~100MB << 4GB threshold)

Related: #1002 (supervisor crash-count fix for the same symptom)
garrytan added a commit that referenced this pull request May 15, 2026
)

* fix: supervisor treats code=0 watchdog exits as crashes

The RSS watchdog triggers gracefulShutdown() which exits with code 0.
The supervisor was counting ALL exits < 5min as crashes, including
clean code=0 exits. After 10 watchdog-triggered restarts (typical with
a 96K-page brain where autopilot inflates RSS), the supervisor gave up
with max_crashes_exceeded.

Fix: code=0 exits reset crashCount to 0 and restart immediately with
no backoff. Only code≠0 exits count toward the crash limit.

Root cause: process.memoryUsage().rss reports 7GB during autopilot
sync on large repos (possibly shared page inflation from git mmap).
The 4096MB threshold triggers on every cycle. This is a separate
issue (RSS measurement accuracy) but the supervisor should handle
clean exits regardless.

* fix: use RssAnon instead of VmRSS for watchdog threshold

process.memoryUsage().rss returns VmRSS which includes file-backed
mmap'd pages. On repos with large git packfiles (96K+ pages), git
operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB.
The kernel reclaims these pages under memory pressure — they're cache.

Replace with /proc/self/status RssAnon + RssShmem which measures only
anonymous pages (heap, stack, anonymous mmap). This is the memory that
actually matters for OOM risk.

Falls back to process.memoryUsage().rss on non-Linux.

Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold)
After:  watchdog only triggers on real memory growth (~100MB << 4GB threshold)

Related: #1002 (supervisor crash-count fix for the same symptom)

* refactor(minions): extract ChildWorkerSupervisor with D1/D2 amendments

MinionSupervisor and src/commands/autopilot.ts each owned a separate
spawn-and-respawn loop. PR #1003 fixed the supervisor's crash-counter
bug (counting code=0 watchdog drains as crashes) but the autopilot
loop has the same bug class. Worse, the as-shipped #1003 fix reset
crashCount=0 on every code=0 exit, which lost the "flapping worker"
signal in mixed-exit sequences.

Extract the shared spawn loop into ChildWorkerSupervisor so both
consumers compose one tested core. The new class bakes in two
amendments resolved during plan-eng-review:

D1 (lastExitCode track): code=0 exits no longer touch crashCount.
They emit ms:0 backoff and restart immediately, but the counter
survives across them. A worker alternating exit 1 / exit 0 / exit 1
correctly trips max_crashes; a worker drained 100 times by the
watchdog stays at crashCount=0 and runs forever (also correct).

D2 (clean-restart budget): on platforms where the watchdog measures
VmRSS instead of RssAnon (macOS, kernel <4.5, restricted containers),
a perpetually over-threshold worker could clean-exit in a tight loop
with no observability. New `cleanRestartBudget` option (default 10
clean restarts per 60s window) emits a `health_warn` and applies
backoff once exceeded.

The supervisor now delegates spawn/respawn/backoff to the inner
class and maps ChildSupervisorEvent → existing SupervisorEvent
emit() channel so JSONL audit consumers see byte-compatible output.
PID lock, signal handlers, health check, and process.exit on
max-crashes stay in MinionSupervisor (those are standalone-daemon
concerns the autopilot composer doesn't need).

Tests: 6 new ChildWorkerSupervisor cases (D1 classifier, interleaved
exits, stable-run + clean-exit interaction, D2 budget tripping, per-
instance config isolation, event shape regression). Existing supervisor
tests updated to use exit-1 workers where they previously relied on
clean-exit-as-crash semantics; their assertions (env plumbing, PID
lock, audit shape) are unaffected.

Co-Authored-By: Wintermute <wintermute@garrytan.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(autopilot): compose ChildWorkerSupervisor instead of inline spawn loop

src/commands/autopilot.ts:165-197 used to have its own spawn-and-
respawn loop separate from MinionSupervisor's. It hardcoded
maxCrashes=5, fixed 10s backoff, and counted every exit (including
code=0) toward the crash limit. Codex flagged this during plan-eng
review: the parallel implementation had the same bug class fixed
in #1003, just on a different code path. Anyone running
`gbrain autopilot` as a long-running daemon (instead of
`gbrain jobs supervisor`) would hit it.

Replace the inline `startWorker` + `child.on('exit')` block with
a ChildWorkerSupervisor instance. Drops the parallel `crashCount`,
`lastWorkerStartTime`, and `STABLE_RUN_RESET_MS` state. The
ChildWorkerSupervisor's D1 lastExitCode track + D2 clean-restart
budget apply to autopilot for free.

Shutdown now drains via the supervisor's killChild + awaitChildExit
typed surface instead of reaching into `workerProc` directly. The
onMaxCrashesExceeded callback routes through autopilot's existing
shutdown('max_crashes') path so the lockfile gets cleaned up
(pre-refactor, the inline loop called process.exit(1) directly and
bypassed the cleanup).

Regression coverage in test/autopilot-supervisor-wiring.test.ts:
static-shape grep guards for `--max-rss 2048`, `maxCrashes: 5`,
the shutdown-via-callback wiring, and absence of the legacy inline
names (startWorker, workerProc, crashCount, lastWorkerStartTime,
STABLE_RUN_RESET_MS).

Co-Authored-By: Wintermute <wintermute@garrytan.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(worker): parse RssAnon as field-presence + soften OOM docstring

Two follow-ups to the RssAnon watchdog fix (b81c598), both surfaced
during plan-eng-review by Codex.

M1: getAccurateRss() used `if (anonKb > 0) return ...` to decide
whether to use the /proc/self/status reading or fall back to
process.memoryUsage().rss. That conflated "RssAnon field missing"
(old kernel, non-Linux) with "RssAnon field present but zero" (a
near-empty worker process whose only memory is shmem). The legitimate
shmem-only worker case fell through to VmRSS even though /proc had a
valid reading.

Fix: split the pure parser (parseRssFromProcStatus) into a separate
exported function that checks field presence via regex match, not
value comparison. Returns null only when the field text doesn't
match `^RssAnon:\s+(\d+)` AND `^RssShmem:\s+(\d+)`. Both fields
present + both zero is now a valid reading of 0 bytes.

M2: the docstring claimed RssAnon + RssShmem was "the memory that
actually matters for OOM risk." Codex pushed back: this is correct
for per-process leak detection but NOT a full container-OOM metric,
because cgroup memory pressure includes page cache. Soften to
"non-file-backed resident memory used for per-process leak
detection" and call out the cgroup caveat explicitly.

getAccurateRss now takes an optional readStatus function for
testability. Production callers use the default; tests inject
canned status text to cover the M1 regression and the fallback paths
without mocking the filesystem.

Tests: 11 cases covering parseRssFromProcStatus (normal, M1 regression
with anon=0 + shmem>0, both-zero, missing fields, malformed values,
shmem-only) and getAccurateRss (injected reader, ENOENT fallback,
old-kernel fallback, malformed-value fallback).

Co-Authored-By: Wintermute <wintermute@garrytan.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(minions): awaitChildExit short-circuits when child already exited

Pre-fix, awaitChildExit registered `child.once('exit', ...)` without
checking whether the child had already terminated. If the child drained
between killChild('SIGTERM') and awaitChildExit() — common on fast
SIGTERM responders — Node's 'exit' event had already fired, the late
listener never resolved, and the caller waited out the full timeout.
On the supervisor's clean shutdown path that's a 35-second hang on
every quick child.

Probe `child.exitCode` and `child.signalCode` first; resolve
immediately when either is non-null. Sub-second clean shutdown
restored.

Pre-existing in the legacy supervisor.ts shape (same bug pattern),
but since the refactor consolidates child-process management into one
class, fix the pattern at the new seam.

Regression test in test/child-worker-supervisor.test.ts: run one full
spawn cycle, then call awaitChildExit on the already-finished cycle
and assert it returns in under 200ms (well under any test timeout).

Surfaced during pre-landing /review on the fix wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.34.3.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md key-files entries for v0.34.3.0

Reflects the ChildWorkerSupervisor extraction shipped in this branch:

- Add new entry for src/core/minions/child-worker-supervisor.ts
  covering D1 lastExitCode classifier, D2 clean-restart budget, the
  awaitChildExit short-circuit, and test pinning at
  test/child-worker-supervisor.test.ts
- Update src/core/minions/supervisor.ts entry to note the spawn-loop
  extraction into the shared core + the byte-compatible event-shape
  mapping that preserves JSONL audit consumers
- Update src/commands/autopilot.ts entry to note the parallel-
  supervisor elimination + the shutdown-via-callback wiring
- Update src/core/minions/worker.ts entry with the new RssAnon /
  getAccurateRss exports + the M1 field-presence parser fix

Regenerated llms-full.txt to match (per project rule: every CLAUDE.md
edit must be followed by bun run build:llms).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Wintermute <wintermute@garrytan.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants