fix: supervisor treats code=0 watchdog exits as crashes by garrytan-agents · Pull Request #1002 · garrytan/gbrain

garrytan-agents · 2026-05-14T23:13:57Z

Problem

The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts, the supervisor gave up with max_crashes_exceeded.

On a 96K-page brain, process.memoryUsage().rss reports ~7GB during autopilot sync (likely shared page inflation from git mmap operations). The 4096MB threshold triggers on every autopilot cycle, causing the worker to exit cleanly after ~5 min. The supervisor then counts this as a crash.

Result: 73 "crashes" in 24h, all code=0. Supervisor keeps dying and being restarted by external crons.

Fix

code=0 exits reset crashCount to 0 and restart immediately (no backoff)
Only code≠0 exits count toward the crash limit
Clean exits skip the backoff delay entirely

Separate Issue (not addressed here)

process.memoryUsage().rss inflates to 7GB during sync on large repos. The 4096MB watchdog threshold triggers every cycle. This may need a separate fix (use /proc/self/statm for accurate RSS, or raise the threshold for large repos).

The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts (typical with a 96K-page brain where autopilot inflates RSS), the supervisor gave up with max_crashes_exceeded. Fix: code=0 exits reset crashCount to 0 and restart immediately with no backoff. Only code≠0 exits count toward the crash limit. Root cause: process.memoryUsage().rss reports 7GB during autopilot sync on large repos (possibly shared page inflation from git mmap). The 4096MB threshold triggers on every cycle. This is a separate issue (RSS measurement accuracy) but the supervisor should handle clean exits regardless.

garrytan · 2026-05-14T23:38:03Z

Moving to base-repo branch for CI secret access (fork PRs from non-collaborator accounts don't receive base-repo secrets). Replacement PR coming up — same title, same body, same commits.

process.memoryUsage().rss returns VmRSS which includes file-backed mmap'd pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB. The kernel reclaims these pages under memory pressure — they're cache. Replace with /proc/self/status RssAnon + RssShmem which measures only anonymous pages (heap, stack, anonymous mmap). This is the memory that actually matters for OOM risk. Falls back to process.memoryUsage().rss on non-Linux. Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold) After: watchdog only triggers on real memory growth (~100MB << 4GB threshold) Related: #1002 (supervisor crash-count fix for the same symptom)

) * fix: supervisor treats code=0 watchdog exits as crashes The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts (typical with a 96K-page brain where autopilot inflates RSS), the supervisor gave up with max_crashes_exceeded. Fix: code=0 exits reset crashCount to 0 and restart immediately with no backoff. Only code≠0 exits count toward the crash limit. Root cause: process.memoryUsage().rss reports 7GB during autopilot sync on large repos (possibly shared page inflation from git mmap). The 4096MB threshold triggers on every cycle. This is a separate issue (RSS measurement accuracy) but the supervisor should handle clean exits regardless. * fix: use RssAnon instead of VmRSS for watchdog threshold process.memoryUsage().rss returns VmRSS which includes file-backed mmap'd pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB. The kernel reclaims these pages under memory pressure — they're cache. Replace with /proc/self/status RssAnon + RssShmem which measures only anonymous pages (heap, stack, anonymous mmap). This is the memory that actually matters for OOM risk. Falls back to process.memoryUsage().rss on non-Linux. Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold) After: watchdog only triggers on real memory growth (~100MB << 4GB threshold) Related: #1002 (supervisor crash-count fix for the same symptom) * refactor(minions): extract ChildWorkerSupervisor with D1/D2 amendments MinionSupervisor and src/commands/autopilot.ts each owned a separate spawn-and-respawn loop. PR #1003 fixed the supervisor's crash-counter bug (counting code=0 watchdog drains as crashes) but the autopilot loop has the same bug class. Worse, the as-shipped #1003 fix reset crashCount=0 on every code=0 exit, which lost the "flapping worker" signal in mixed-exit sequences. Extract the shared spawn loop into ChildWorkerSupervisor so both consumers compose one tested core. The new class bakes in two amendments resolved during plan-eng-review: D1 (lastExitCode track): code=0 exits no longer touch crashCount. They emit ms:0 backoff and restart immediately, but the counter survives across them. A worker alternating exit 1 / exit 0 / exit 1 correctly trips max_crashes; a worker drained 100 times by the watchdog stays at crashCount=0 and runs forever (also correct). D2 (clean-restart budget): on platforms where the watchdog measures VmRSS instead of RssAnon (macOS, kernel <4.5, restricted containers), a perpetually over-threshold worker could clean-exit in a tight loop with no observability. New `cleanRestartBudget` option (default 10 clean restarts per 60s window) emits a `health_warn` and applies backoff once exceeded. The supervisor now delegates spawn/respawn/backoff to the inner class and maps ChildSupervisorEvent → existing SupervisorEvent emit() channel so JSONL audit consumers see byte-compatible output. PID lock, signal handlers, health check, and process.exit on max-crashes stay in MinionSupervisor (those are standalone-daemon concerns the autopilot composer doesn't need). Tests: 6 new ChildWorkerSupervisor cases (D1 classifier, interleaved exits, stable-run + clean-exit interaction, D2 budget tripping, per- instance config isolation, event shape regression). Existing supervisor tests updated to use exit-1 workers where they previously relied on clean-exit-as-crash semantics; their assertions (env plumbing, PID lock, audit shape) are unaffected. Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(autopilot): compose ChildWorkerSupervisor instead of inline spawn loop src/commands/autopilot.ts:165-197 used to have its own spawn-and- respawn loop separate from MinionSupervisor's. It hardcoded maxCrashes=5, fixed 10s backoff, and counted every exit (including code=0) toward the crash limit. Codex flagged this during plan-eng review: the parallel implementation had the same bug class fixed in #1003, just on a different code path. Anyone running `gbrain autopilot` as a long-running daemon (instead of `gbrain jobs supervisor`) would hit it. Replace the inline `startWorker` + `child.on('exit')` block with a ChildWorkerSupervisor instance. Drops the parallel `crashCount`, `lastWorkerStartTime`, and `STABLE_RUN_RESET_MS` state. The ChildWorkerSupervisor's D1 lastExitCode track + D2 clean-restart budget apply to autopilot for free. Shutdown now drains via the supervisor's killChild + awaitChildExit typed surface instead of reaching into `workerProc` directly. The onMaxCrashesExceeded callback routes through autopilot's existing shutdown('max_crashes') path so the lockfile gets cleaned up (pre-refactor, the inline loop called process.exit(1) directly and bypassed the cleanup). Regression coverage in test/autopilot-supervisor-wiring.test.ts: static-shape grep guards for `--max-rss 2048`, `maxCrashes: 5`, the shutdown-via-callback wiring, and absence of the legacy inline names (startWorker, workerProc, crashCount, lastWorkerStartTime, STABLE_RUN_RESET_MS). Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(worker): parse RssAnon as field-presence + soften OOM docstring Two follow-ups to the RssAnon watchdog fix (b81c598), both surfaced during plan-eng-review by Codex. M1: getAccurateRss() used `if (anonKb > 0) return ...` to decide whether to use the /proc/self/status reading or fall back to process.memoryUsage().rss. That conflated "RssAnon field missing" (old kernel, non-Linux) with "RssAnon field present but zero" (a near-empty worker process whose only memory is shmem). The legitimate shmem-only worker case fell through to VmRSS even though /proc had a valid reading. Fix: split the pure parser (parseRssFromProcStatus) into a separate exported function that checks field presence via regex match, not value comparison. Returns null only when the field text doesn't match `^RssAnon:\s+(\d+)` AND `^RssShmem:\s+(\d+)`. Both fields present + both zero is now a valid reading of 0 bytes. M2: the docstring claimed RssAnon + RssShmem was "the memory that actually matters for OOM risk." Codex pushed back: this is correct for per-process leak detection but NOT a full container-OOM metric, because cgroup memory pressure includes page cache. Soften to "non-file-backed resident memory used for per-process leak detection" and call out the cgroup caveat explicitly. getAccurateRss now takes an optional readStatus function for testability. Production callers use the default; tests inject canned status text to cover the M1 regression and the fallback paths without mocking the filesystem. Tests: 11 cases covering parseRssFromProcStatus (normal, M1 regression with anon=0 + shmem>0, both-zero, missing fields, malformed values, shmem-only) and getAccurateRss (injected reader, ENOENT fallback, old-kernel fallback, malformed-value fallback). Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(minions): awaitChildExit short-circuits when child already exited Pre-fix, awaitChildExit registered `child.once('exit', ...)` without checking whether the child had already terminated. If the child drained between killChild('SIGTERM') and awaitChildExit() — common on fast SIGTERM responders — Node's 'exit' event had already fired, the late listener never resolved, and the caller waited out the full timeout. On the supervisor's clean shutdown path that's a 35-second hang on every quick child. Probe `child.exitCode` and `child.signalCode` first; resolve immediately when either is non-null. Sub-second clean shutdown restored. Pre-existing in the legacy supervisor.ts shape (same bug pattern), but since the refactor consolidates child-process management into one class, fix the pattern at the new seam. Regression test in test/child-worker-supervisor.test.ts: run one full spawn cycle, then call awaitChildExit on the already-finished cycle and assert it returns in under 200ms (well under any test timeout). Surfaced during pre-landing /review on the fix wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.34.3.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update CLAUDE.md key-files entries for v0.34.3.0 Reflects the ChildWorkerSupervisor extraction shipped in this branch: - Add new entry for src/core/minions/child-worker-supervisor.ts covering D1 lastExitCode classifier, D2 clean-restart budget, the awaitChildExit short-circuit, and test pinning at test/child-worker-supervisor.test.ts - Update src/core/minions/supervisor.ts entry to note the spawn-loop extraction into the shared core + the byte-compatible event-shape mapping that preserves JSONL audit consumers - Update src/commands/autopilot.ts entry to note the parallel- supervisor elimination + the shutdown-via-callback wiring - Update src/core/minions/worker.ts entry with the new RssAnon / getAccurateRss exports + the M1 field-presence parser fix Regenerated llms-full.txt to match (per project rule: every CLAUDE.md edit must be followed by bun run build:llms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Wintermute <wintermute@garrytan.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan closed this May 14, 2026

garrytan-agents mentioned this pull request May 14, 2026

fix: use RssAnon instead of VmRSS for watchdog threshold #1004

Closed

garrytan-agents mentioned this pull request May 15, 2026

fix(doctor): only count non-zero exit codes as worker crashes #1050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: supervisor treats code=0 watchdog exits as crashes#1002

fix: supervisor treats code=0 watchdog exits as crashes#1002
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/supervisor-clean-exit-crash-count

garrytan-agents commented May 14, 2026

Uh oh!

garrytan commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 14, 2026

Problem

Fix

Separate Issue (not addressed here)

Uh oh!

garrytan commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants