v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes#1003
Merged
Conversation
The RSS watchdog triggers gracefulShutdown() which exits with code 0. The supervisor was counting ALL exits < 5min as crashes, including clean code=0 exits. After 10 watchdog-triggered restarts (typical with a 96K-page brain where autopilot inflates RSS), the supervisor gave up with max_crashes_exceeded. Fix: code=0 exits reset crashCount to 0 and restart immediately with no backoff. Only code≠0 exits count toward the crash limit. Root cause: process.memoryUsage().rss reports 7GB during autopilot sync on large repos (possibly shared page inflation from git mmap). The 4096MB threshold triggers on every cycle. This is a separate issue (RSS measurement accuracy) but the supervisor should handle clean exits regardless.
process.memoryUsage().rss returns VmRSS which includes file-backed mmap'd pages. On repos with large git packfiles (96K+ pages), git operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB. The kernel reclaims these pages under memory pressure — they're cache. Replace with /proc/self/status RssAnon + RssShmem which measures only anonymous pages (heap, stack, anonymous mmap). This is the memory that actually matters for OOM risk. Falls back to process.memoryUsage().rss on non-Linux. Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold) After: watchdog only triggers on real memory growth (~100MB << 4GB threshold) Related: #1002 (supervisor crash-count fix for the same symptom)
MinionSupervisor and src/commands/autopilot.ts each owned a separate spawn-and-respawn loop. PR #1003 fixed the supervisor's crash-counter bug (counting code=0 watchdog drains as crashes) but the autopilot loop has the same bug class. Worse, the as-shipped #1003 fix reset crashCount=0 on every code=0 exit, which lost the "flapping worker" signal in mixed-exit sequences. Extract the shared spawn loop into ChildWorkerSupervisor so both consumers compose one tested core. The new class bakes in two amendments resolved during plan-eng-review: D1 (lastExitCode track): code=0 exits no longer touch crashCount. They emit ms:0 backoff and restart immediately, but the counter survives across them. A worker alternating exit 1 / exit 0 / exit 1 correctly trips max_crashes; a worker drained 100 times by the watchdog stays at crashCount=0 and runs forever (also correct). D2 (clean-restart budget): on platforms where the watchdog measures VmRSS instead of RssAnon (macOS, kernel <4.5, restricted containers), a perpetually over-threshold worker could clean-exit in a tight loop with no observability. New `cleanRestartBudget` option (default 10 clean restarts per 60s window) emits a `health_warn` and applies backoff once exceeded. The supervisor now delegates spawn/respawn/backoff to the inner class and maps ChildSupervisorEvent → existing SupervisorEvent emit() channel so JSONL audit consumers see byte-compatible output. PID lock, signal handlers, health check, and process.exit on max-crashes stay in MinionSupervisor (those are standalone-daemon concerns the autopilot composer doesn't need). Tests: 6 new ChildWorkerSupervisor cases (D1 classifier, interleaved exits, stable-run + clean-exit interaction, D2 budget tripping, per- instance config isolation, event shape regression). Existing supervisor tests updated to use exit-1 workers where they previously relied on clean-exit-as-crash semantics; their assertions (env plumbing, PID lock, audit shape) are unaffected. Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…spawn loop src/commands/autopilot.ts:165-197 used to have its own spawn-and- respawn loop separate from MinionSupervisor's. It hardcoded maxCrashes=5, fixed 10s backoff, and counted every exit (including code=0) toward the crash limit. Codex flagged this during plan-eng review: the parallel implementation had the same bug class fixed in #1003, just on a different code path. Anyone running `gbrain autopilot` as a long-running daemon (instead of `gbrain jobs supervisor`) would hit it. Replace the inline `startWorker` + `child.on('exit')` block with a ChildWorkerSupervisor instance. Drops the parallel `crashCount`, `lastWorkerStartTime`, and `STABLE_RUN_RESET_MS` state. The ChildWorkerSupervisor's D1 lastExitCode track + D2 clean-restart budget apply to autopilot for free. Shutdown now drains via the supervisor's killChild + awaitChildExit typed surface instead of reaching into `workerProc` directly. The onMaxCrashesExceeded callback routes through autopilot's existing shutdown('max_crashes') path so the lockfile gets cleaned up (pre-refactor, the inline loop called process.exit(1) directly and bypassed the cleanup). Regression coverage in test/autopilot-supervisor-wiring.test.ts: static-shape grep guards for `--max-rss 2048`, `maxCrashes: 5`, the shutdown-via-callback wiring, and absence of the legacy inline names (startWorker, workerProc, crashCount, lastWorkerStartTime, STABLE_RUN_RESET_MS). Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to the RssAnon watchdog fix (b81c598), both surfaced during plan-eng-review by Codex. M1: getAccurateRss() used `if (anonKb > 0) return ...` to decide whether to use the /proc/self/status reading or fall back to process.memoryUsage().rss. That conflated "RssAnon field missing" (old kernel, non-Linux) with "RssAnon field present but zero" (a near-empty worker process whose only memory is shmem). The legitimate shmem-only worker case fell through to VmRSS even though /proc had a valid reading. Fix: split the pure parser (parseRssFromProcStatus) into a separate exported function that checks field presence via regex match, not value comparison. Returns null only when the field text doesn't match `^RssAnon:\s+(\d+)` AND `^RssShmem:\s+(\d+)`. Both fields present + both zero is now a valid reading of 0 bytes. M2: the docstring claimed RssAnon + RssShmem was "the memory that actually matters for OOM risk." Codex pushed back: this is correct for per-process leak detection but NOT a full container-OOM metric, because cgroup memory pressure includes page cache. Soften to "non-file-backed resident memory used for per-process leak detection" and call out the cgroup caveat explicitly. getAccurateRss now takes an optional readStatus function for testability. Production callers use the default; tests inject canned status text to cover the M1 regression and the fallback paths without mocking the filesystem. Tests: 11 cases covering parseRssFromProcStatus (normal, M1 regression with anon=0 + shmem>0, both-zero, missing fields, malformed values, shmem-only) and getAccurateRss (injected reader, ENOENT fallback, old-kernel fallback, malformed-value fallback). Co-Authored-By: Wintermute <wintermute@garrytan.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, awaitChildExit registered `child.once('exit', ...)` without
checking whether the child had already terminated. If the child drained
between killChild('SIGTERM') and awaitChildExit() — common on fast
SIGTERM responders — Node's 'exit' event had already fired, the late
listener never resolved, and the caller waited out the full timeout.
On the supervisor's clean shutdown path that's a 35-second hang on
every quick child.
Probe `child.exitCode` and `child.signalCode` first; resolve
immediately when either is non-null. Sub-second clean shutdown
restored.
Pre-existing in the legacy supervisor.ts shape (same bug pattern),
but since the refactor consolidates child-process management into one
class, fix the pattern at the new seam.
Regression test in test/child-worker-supervisor.test.ts: run one full
spawn cycle, then call awaitChildExit on the already-finished cycle
and assert it returns in under 200ms (well under any test timeout).
Surfaced during pre-landing /review on the fix wave.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-exit-crash-count
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-exit-crash-count # Conflicts: # CHANGELOG.md # VERSION # package.json
Reflects the ChildWorkerSupervisor extraction shipped in this branch: - Add new entry for src/core/minions/child-worker-supervisor.ts covering D1 lastExitCode classifier, D2 clean-restart budget, the awaitChildExit short-circuit, and test pinning at test/child-worker-supervisor.test.ts - Update src/core/minions/supervisor.ts entry to note the spawn-loop extraction into the shared core + the byte-compatible event-shape mapping that preserves JSONL audit consumers - Update src/commands/autopilot.ts entry to note the parallel- supervisor elimination + the shutdown-via-callback wiring - Update src/core/minions/worker.ts entry with the new RssAnon / getAccurateRss exports + the M1 field-presence parser fix Regenerated llms-full.txt to match (per project rule: every CLAUDE.md edit must be followed by bun run build:llms). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-exit-crash-count # Conflicts: # CHANGELOG.md # VERSION # package.json
garrytan
added a commit
that referenced
this pull request
May 15, 2026
Master shipped v0.34.3.0 (PR #1003, supervisor clean-exit watchdog fix). No new migrations — only src/core/minions/* + src/commands/autopilot.ts touched, so our v66 partial-index migration stays in place. Conflict resolutions: - VERSION: take 0.34.4.0 - package.json: take 0.34.4.0 - CHANGELOG.md: clean non-interleaved conflict; strip markers and keep both entries (v0.34.4.0 on top, v0.34.3.0 below, then the existing v0.34.2.0 and v0.34.1.0 in descending order) Typecheck clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
brandonlipman
added a commit
to brandonlipman/gbrain
that referenced
this pull request
May 29, 2026
* upstream/master: v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055) v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008) v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991) v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003) v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988) v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996) v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994) v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934) v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992) fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982) v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897) v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix wave for a production supervisor incident plus the root-cause RSS measurement that caused it. Six commits split across:
Supervisor crash-counter fix.
e2d34004— original supervisor fix (code=0 no longer counts as a crash). Later amended by the refactor below to preserve flap detection across mixed exits.Root-cause RSS measurement.
dab48cdf— replacesprocess.memoryUsage().rsswith/proc/self/statusRssAnon + RssShmem on Linux. VmRSS counts file-backed mmap pages (git packfiles inflate it to 7GB on a 96K-page brain even when heap is ~100MB). Falls back to VmRSS on macOS / kernel <4.5 / missing /proc.ChildWorkerSupervisor refactor.
2b507f5e— new shared spawn-and-respawn core extracted fromMinionSupervisor. Bakes in two amendments from plan-eng-review: D1 (lastExitCodetrack, code=0 doesn't touch crashCount) preserves flap detection across mixed exits; D2 (clean-restart budget, default 10 restarts per 60s) closes the macOS-fallback tight-loop hole.Autopilot composition.
becc9db0—src/commands/autopilot.tsno longer maintains its own parallel spawn-and-respawn loop (Codex caught this during plan-eng-review). Composes the sharedChildWorkerSupervisor. DropscrashCount,lastWorkerStartTime,STABLE_RUN_RESET_MS, and the inlinechild.on('exit')block. RoutesonMaxCrashesExceededthrough autopilot's own shutdown so the lockfile gets cleaned up.Worker parser + docstring.
197a3142— M1 field-presence check fixes the shmem-only worker case (RssAnon:0 + RssShmem>0 used to fall through to VmRSS). M2 softens the docstring's container-OOM claim — RssAnon + RssShmem is the per-process leak metric, not the full cgroup-OOM metric.Pre-landing review fix.
a58435c0—awaitChildExitshort-circuits when the child has already exited (exitCode !== null). Pre-fix, a fast SIGTERM responder could cause a 35-second shutdown hang.Test Coverage
Tests: full suite —
6386 pass / 0 fail.Pre-Landing Review
Found 1 P2 — fixed in commit
a58435c0.awaitChildExitdidn't short-circuit when child was already exited — fast SIGTERM responders caused a 35s shutdown hang. Fix: probechild.exitCode !== null+child.signalCode !== nullbefore registering the late'exit'listener. Regression test intest/child-worker-supervisor.test.tsasserts the call returns in under 200ms when the child has already exited.Specialist passes considered: race conditions, enum/value completeness (all 5
ChildSupervisorEventkinds handled at both consumers), shell injection (spawn helpers pass argv arrays, no shell), SQL safety (N/A), LLM trust boundary (N/A).Scope Drift
Scope Check: CLEAN. Every file in the 1287-line diff maps to a plan-eng-review item. Plan grew during review from 13 lines to ~150 net via D1/D2/D3 amendments that Garry explicitly accepted at each gate.
Plan Completion
Plan file:
~/.claude/plans/this-is-a-real-sleepy-sketch.md. All 3 decisions resolved:ChildWorkerSupervisor).health_warn+ backoff on overrun).ChildWorkerSupervisor→ SHIPPED.5 test files / sections required by the plan all written and passing. M1 parser + M2 docstring + M3 PR-body threshold corrected.
TODOS
No items closed by this PR. The CHANGELOG calls out the cgroup-OOM caveat as future-work territory for operators on container deployments.
Documentation
/document-releaseran post-ship and made these factual updates toCLAUDE.md:src/core/minions/child-worker-supervisor.ts— covers the D1 lastExitCode classifier, D2 clean-restart budget (default 10 restarts per 60s withhealth_warn { reason: 'clean_restart_budget_exceeded' }), the awaitChildExit short-circuit fix, public read-only accessors, test hooks (_backoffFloorMs+_now), and pin totest/child-worker-supervisor.test.ts(7 cases).src/core/minions/supervisor.tsentry — notes the spawn-and-respawn extraction into the shared core, the byte-compatibleChildSupervisorEvent → emit()event-shape mapping (JSONL audit consumers unaffected), and theshutdown()drain switch tochildSupervisor.killChild + awaitChildExit.src/commands/autopilot.tsentry — notes the elimination of the parallel inline spawn loop,--max-rss 2048+maxCrashes: 5preservation, and theonMaxCrashesExceeded → shutdown('max_crashes')wiring fix.src/core/minions/worker.tsentry — covers the newparseRssFromProcStatus+getAccurateRssexports, the M1 field-presence parser fix (RssAnon: 0 + RssShmem: 512shmem-only case), and pin totest/worker-rss.test.ts(11 cases).llms-full.txtto match (per project rule).No README / AGENTS.md / ARCHITECTURE.md updates required — this wave is internal refactor only. The user-visible operator surface (
gbrain jobs supervisor,gbrain autopilot) is unchanged. The new D2clean_restart_budget_exceededevent is documented in CHANGELOG's "To take advantage of v0.34.3.0" operator-verification block.No TODOS.md items closed by this PR. TODOS does mention "supervisor" in other contexts (ngrok keep-alive, executeRaw retry opt-in, claw-test
supervisor-restartscenario) — those are unrelated and remain open.Test plan
bun run typecheckcleanbun run verify(privacy + jsonb + progress + wasm + admin + cli + system-of-record + eval-glossary + typecheck) cleanbun test test/child-worker-supervisor.test.ts test/supervisor.test.ts test/worker-rss.test.ts test/autopilot-supervisor-wiring.test.ts test/supervisor-tini.test.ts)bun run test)worker_exitedevents withlikely_cause: 'clean_exit',backoffevents withms: 0, and zeromax_crashes_exceededevents. Ifclean_restart_budget_exceededfires on Linux, the worker is actually leaking; investigate.🤖 Generated with Claude Code