Skip to content

v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes#1003

Merged
garrytan merged 11 commits into
masterfrom
fix/supervisor-clean-exit-crash-count
May 15, 2026
Merged

v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes#1003
garrytan merged 11 commits into
masterfrom
fix/supervisor-clean-exit-crash-count

Conversation

@garrytan

@garrytan garrytan commented May 14, 2026

Copy link
Copy Markdown
Owner

Summary

Fix wave for a production supervisor incident plus the root-cause RSS measurement that caused it. Six commits split across:

Supervisor crash-counter fix. e2d34004 — original supervisor fix (code=0 no longer counts as a crash). Later amended by the refactor below to preserve flap detection across mixed exits.

Root-cause RSS measurement. dab48cdf — replaces process.memoryUsage().rss with /proc/self/status RssAnon + RssShmem on Linux. VmRSS counts file-backed mmap pages (git packfiles inflate it to 7GB on a 96K-page brain even when heap is ~100MB). Falls back to VmRSS on macOS / kernel <4.5 / missing /proc.

ChildWorkerSupervisor refactor. 2b507f5e — new shared spawn-and-respawn core extracted from MinionSupervisor. Bakes in two amendments from plan-eng-review: D1 (lastExitCode track, code=0 doesn't touch crashCount) preserves flap detection across mixed exits; D2 (clean-restart budget, default 10 restarts per 60s) closes the macOS-fallback tight-loop hole.

Autopilot composition. becc9db0src/commands/autopilot.ts no longer maintains its own parallel spawn-and-respawn loop (Codex caught this during plan-eng-review). Composes the shared ChildWorkerSupervisor. Drops crashCount, lastWorkerStartTime, STABLE_RUN_RESET_MS, and the inline child.on('exit') block. Routes onMaxCrashesExceeded through autopilot's own shutdown so the lockfile gets cleaned up.

Worker parser + docstring. 197a3142 — M1 field-presence check fixes the shmem-only worker case (RssAnon:0 + RssShmem>0 used to fall through to VmRSS). M2 softens the docstring's container-OOM claim — RssAnon + RssShmem is the per-process leak metric, not the full cgroup-OOM metric.

Pre-landing review fix. a58435c0awaitChildExit short-circuits when the child has already exited (exitCode !== null). Pre-fix, a fast SIGTERM responder could cause a 35-second shutdown hang.

Test Coverage

NEW FILES                                                COVERAGE
test/child-worker-supervisor.test.ts (NEW, 7 cases)      ★★★ D1 + D2 + awaitChildExit regression
test/worker-rss.test.ts (NEW, 11 cases)                  ★★★ M1 field-presence + fallback paths
test/autopilot-supervisor-wiring.test.ts (NEW, 6 cases)  ★★ static-shape regression guards
test/supervisor.test.ts (updated, 16 cases)              ★★ existing tests preserved via exit-1 swap
test/supervisor-tini.test.ts (unchanged, 2 cases)        ★★ tini detection still works

COVERAGE: 42 targeted tests pass / 0 fail
QUALITY: ★★★ behavior + edge + error (regression + parity)

Tests: full suite — 6386 pass / 0 fail.

Pre-Landing Review

Found 1 P2 — fixed in commit a58435c0.

  • [P2] (confidence: 8/10) awaitChildExit didn't short-circuit when child was already exited — fast SIGTERM responders caused a 35s shutdown hang. Fix: probe child.exitCode !== null + child.signalCode !== null before registering the late 'exit' listener. Regression test in test/child-worker-supervisor.test.ts asserts the call returns in under 200ms when the child has already exited.

Specialist passes considered: race conditions, enum/value completeness (all 5 ChildSupervisorEvent kinds handled at both consumers), shell injection (spawn helpers pass argv arrays, no shell), SQL safety (N/A), LLM trust boundary (N/A).

Scope Drift

Scope Check: CLEAN. Every file in the 1287-line diff maps to a plan-eng-review item. Plan grew during review from 13 lines to ~150 net via D1/D2/D3 amendments that Garry explicitly accepted at each gate.

Plan Completion

Plan file: ~/.claude/plans/this-is-a-real-sleepy-sketch.md. All 3 decisions resolved:

  • D1 (option B): code=0 doesn't touch crashCount → SHIPPED (in ChildWorkerSupervisor).
  • D2 (option B): clean-restart rate limit → SHIPPED (default 10 per 60s; health_warn + backoff on overrun).
  • D3 (option C): autopilot.ts delegates to shared ChildWorkerSupervisor → SHIPPED.

5 test files / sections required by the plan all written and passing. M1 parser + M2 docstring + M3 PR-body threshold corrected.

TODOS

No items closed by this PR. The CHANGELOG calls out the cgroup-OOM caveat as future-work territory for operators on container deployments.

Documentation

/document-release ran post-ship and made these factual updates to CLAUDE.md:

  • NEW entry for src/core/minions/child-worker-supervisor.ts — covers the D1 lastExitCode classifier, D2 clean-restart budget (default 10 restarts per 60s with health_warn { reason: 'clean_restart_budget_exceeded' }), the awaitChildExit short-circuit fix, public read-only accessors, test hooks (_backoffFloorMs + _now), and pin to test/child-worker-supervisor.test.ts (7 cases).
  • Updated src/core/minions/supervisor.ts entry — notes the spawn-and-respawn extraction into the shared core, the byte-compatible ChildSupervisorEvent → emit() event-shape mapping (JSONL audit consumers unaffected), and the shutdown() drain switch to childSupervisor.killChild + awaitChildExit.
  • Updated src/commands/autopilot.ts entry — notes the elimination of the parallel inline spawn loop, --max-rss 2048 + maxCrashes: 5 preservation, and the onMaxCrashesExceeded → shutdown('max_crashes') wiring fix.
  • Updated src/core/minions/worker.ts entry — covers the new parseRssFromProcStatus + getAccurateRss exports, the M1 field-presence parser fix (RssAnon: 0 + RssShmem: 512 shmem-only case), and pin to test/worker-rss.test.ts (11 cases).
  • Regenerated llms-full.txt to match (per project rule).

No README / AGENTS.md / ARCHITECTURE.md updates required — this wave is internal refactor only. The user-visible operator surface (gbrain jobs supervisor, gbrain autopilot) is unchanged. The new D2 clean_restart_budget_exceeded event is documented in CHANGELOG's "To take advantage of v0.34.3.0" operator-verification block.

No TODOS.md items closed by this PR. TODOS does mention "supervisor" in other contexts (ngrok keep-alive, executeRaw retry opt-in, claw-test supervisor-restart scenario) — those are unrelated and remain open.

Test plan

  • bun run typecheck clean
  • bun run verify (privacy + jsonb + progress + wasm + admin + cli + system-of-record + eval-glossary + typecheck) clean
  • Targeted tests: 42 pass across 5 files (bun test test/child-worker-supervisor.test.ts test/supervisor.test.ts test/worker-rss.test.ts test/autopilot-supervisor-wiring.test.ts test/supervisor-tini.test.ts)
  • Full unit suite: 6386 pass / 0 fail (bun run test)
  • Post-merge production validation: watch supervisor log for 24h. Expect worker_exited events with likely_cause: 'clean_exit', backoff events with ms: 0, and zero max_crashes_exceeded events. If clean_restart_budget_exceeded fires on Linux, the worker is actually leaking; investigate.

🤖 Generated with Claude Code

Wintermute and others added 5 commits May 14, 2026 23:13
The RSS watchdog triggers gracefulShutdown() which exits with code 0.
The supervisor was counting ALL exits < 5min as crashes, including
clean code=0 exits. After 10 watchdog-triggered restarts (typical with
a 96K-page brain where autopilot inflates RSS), the supervisor gave up
with max_crashes_exceeded.

Fix: code=0 exits reset crashCount to 0 and restart immediately with
no backoff. Only code≠0 exits count toward the crash limit.

Root cause: process.memoryUsage().rss reports 7GB during autopilot
sync on large repos (possibly shared page inflation from git mmap).
The 4096MB threshold triggers on every cycle. This is a separate
issue (RSS measurement accuracy) but the supervisor should handle
clean exits regardless.
process.memoryUsage().rss returns VmRSS which includes file-backed
mmap'd pages. On repos with large git packfiles (96K+ pages), git
operations inflate VmRSS to 7GB+ while actual heap usage is ~100MB.
The kernel reclaims these pages under memory pressure — they're cache.

Replace with /proc/self/status RssAnon + RssShmem which measures only
anonymous pages (heap, stack, anonymous mmap). This is the memory that
actually matters for OOM risk.

Falls back to process.memoryUsage().rss on non-Linux.

Before: watchdog triggers every autopilot cycle (7GB VmRSS > 4GB threshold)
After:  watchdog only triggers on real memory growth (~100MB << 4GB threshold)

Related: #1002 (supervisor crash-count fix for the same symptom)
MinionSupervisor and src/commands/autopilot.ts each owned a separate
spawn-and-respawn loop. PR #1003 fixed the supervisor's crash-counter
bug (counting code=0 watchdog drains as crashes) but the autopilot
loop has the same bug class. Worse, the as-shipped #1003 fix reset
crashCount=0 on every code=0 exit, which lost the "flapping worker"
signal in mixed-exit sequences.

Extract the shared spawn loop into ChildWorkerSupervisor so both
consumers compose one tested core. The new class bakes in two
amendments resolved during plan-eng-review:

D1 (lastExitCode track): code=0 exits no longer touch crashCount.
They emit ms:0 backoff and restart immediately, but the counter
survives across them. A worker alternating exit 1 / exit 0 / exit 1
correctly trips max_crashes; a worker drained 100 times by the
watchdog stays at crashCount=0 and runs forever (also correct).

D2 (clean-restart budget): on platforms where the watchdog measures
VmRSS instead of RssAnon (macOS, kernel <4.5, restricted containers),
a perpetually over-threshold worker could clean-exit in a tight loop
with no observability. New `cleanRestartBudget` option (default 10
clean restarts per 60s window) emits a `health_warn` and applies
backoff once exceeded.

The supervisor now delegates spawn/respawn/backoff to the inner
class and maps ChildSupervisorEvent → existing SupervisorEvent
emit() channel so JSONL audit consumers see byte-compatible output.
PID lock, signal handlers, health check, and process.exit on
max-crashes stay in MinionSupervisor (those are standalone-daemon
concerns the autopilot composer doesn't need).

Tests: 6 new ChildWorkerSupervisor cases (D1 classifier, interleaved
exits, stable-run + clean-exit interaction, D2 budget tripping, per-
instance config isolation, event shape regression). Existing supervisor
tests updated to use exit-1 workers where they previously relied on
clean-exit-as-crash semantics; their assertions (env plumbing, PID
lock, audit shape) are unaffected.

Co-Authored-By: Wintermute <wintermute@garrytan.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…spawn loop

src/commands/autopilot.ts:165-197 used to have its own spawn-and-
respawn loop separate from MinionSupervisor's. It hardcoded
maxCrashes=5, fixed 10s backoff, and counted every exit (including
code=0) toward the crash limit. Codex flagged this during plan-eng
review: the parallel implementation had the same bug class fixed
in #1003, just on a different code path. Anyone running
`gbrain autopilot` as a long-running daemon (instead of
`gbrain jobs supervisor`) would hit it.

Replace the inline `startWorker` + `child.on('exit')` block with
a ChildWorkerSupervisor instance. Drops the parallel `crashCount`,
`lastWorkerStartTime`, and `STABLE_RUN_RESET_MS` state. The
ChildWorkerSupervisor's D1 lastExitCode track + D2 clean-restart
budget apply to autopilot for free.

Shutdown now drains via the supervisor's killChild + awaitChildExit
typed surface instead of reaching into `workerProc` directly. The
onMaxCrashesExceeded callback routes through autopilot's existing
shutdown('max_crashes') path so the lockfile gets cleaned up
(pre-refactor, the inline loop called process.exit(1) directly and
bypassed the cleanup).

Regression coverage in test/autopilot-supervisor-wiring.test.ts:
static-shape grep guards for `--max-rss 2048`, `maxCrashes: 5`,
the shutdown-via-callback wiring, and absence of the legacy inline
names (startWorker, workerProc, crashCount, lastWorkerStartTime,
STABLE_RUN_RESET_MS).

Co-Authored-By: Wintermute <wintermute@garrytan.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to the RssAnon watchdog fix (b81c598), both surfaced
during plan-eng-review by Codex.

M1: getAccurateRss() used `if (anonKb > 0) return ...` to decide
whether to use the /proc/self/status reading or fall back to
process.memoryUsage().rss. That conflated "RssAnon field missing"
(old kernel, non-Linux) with "RssAnon field present but zero" (a
near-empty worker process whose only memory is shmem). The legitimate
shmem-only worker case fell through to VmRSS even though /proc had a
valid reading.

Fix: split the pure parser (parseRssFromProcStatus) into a separate
exported function that checks field presence via regex match, not
value comparison. Returns null only when the field text doesn't
match `^RssAnon:\s+(\d+)` AND `^RssShmem:\s+(\d+)`. Both fields
present + both zero is now a valid reading of 0 bytes.

M2: the docstring claimed RssAnon + RssShmem was "the memory that
actually matters for OOM risk." Codex pushed back: this is correct
for per-process leak detection but NOT a full container-OOM metric,
because cgroup memory pressure includes page cache. Soften to
"non-file-backed resident memory used for per-process leak
detection" and call out the cgroup caveat explicitly.

getAccurateRss now takes an optional readStatus function for
testability. Production callers use the default; tests inject
canned status text to cover the M1 regression and the fallback paths
without mocking the filesystem.

Tests: 11 cases covering parseRssFromProcStatus (normal, M1 regression
with anon=0 + shmem>0, both-zero, missing fields, malformed values,
shmem-only) and getAccurateRss (injected reader, ENOENT fallback,
old-kernel fallback, malformed-value fallback).

Co-Authored-By: Wintermute <wintermute@garrytan.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan and others added 3 commits May 14, 2026 19:16
Pre-fix, awaitChildExit registered `child.once('exit', ...)` without
checking whether the child had already terminated. If the child drained
between killChild('SIGTERM') and awaitChildExit() — common on fast
SIGTERM responders — Node's 'exit' event had already fired, the late
listener never resolved, and the caller waited out the full timeout.
On the supervisor's clean shutdown path that's a 35-second hang on
every quick child.

Probe `child.exitCode` and `child.signalCode` first; resolve
immediately when either is non-null. Sub-second clean shutdown
restored.

Pre-existing in the legacy supervisor.ts shape (same bug pattern),
but since the refactor consolidates child-process management into one
class, fix the pattern at the new seam.

Regression test in test/child-worker-supervisor.test.ts: run one full
spawn cycle, then call awaitChildExit on the already-finished cycle
and assert it returns in under 200ms (well under any test timeout).

Surfaced during pre-landing /review on the fix wave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title fix: supervisor treats code=0 watchdog exits as crashes v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes May 15, 2026
garrytan and others added 3 commits May 14, 2026 20:18
…n-exit-crash-count

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
Reflects the ChildWorkerSupervisor extraction shipped in this branch:

- Add new entry for src/core/minions/child-worker-supervisor.ts
  covering D1 lastExitCode classifier, D2 clean-restart budget, the
  awaitChildExit short-circuit, and test pinning at
  test/child-worker-supervisor.test.ts
- Update src/core/minions/supervisor.ts entry to note the spawn-loop
  extraction into the shared core + the byte-compatible event-shape
  mapping that preserves JSONL audit consumers
- Update src/commands/autopilot.ts entry to note the parallel-
  supervisor elimination + the shutdown-via-callback wiring
- Update src/core/minions/worker.ts entry with the new RssAnon /
  getAccurateRss exports + the M1 field-presence parser fix

Regenerated llms-full.txt to match (per project rule: every CLAUDE.md
edit must be followed by bun run build:llms).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n-exit-crash-count

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
@garrytan garrytan merged commit 668254b into master May 15, 2026
7 checks passed
garrytan added a commit that referenced this pull request May 15, 2026
Master shipped v0.34.3.0 (PR #1003, supervisor clean-exit watchdog fix).
No new migrations — only src/core/minions/* + src/commands/autopilot.ts
touched, so our v66 partial-index migration stays in place.

Conflict resolutions:
- VERSION: take 0.34.4.0
- package.json: take 0.34.4.0
- CHANGELOG.md: clean non-interleaved conflict; strip markers and keep
  both entries (v0.34.4.0 on top, v0.34.3.0 below, then the existing
  v0.34.2.0 and v0.34.1.0 in descending order)

Typecheck clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
brandonlipman added a commit to brandonlipman/gbrain that referenced this pull request May 29, 2026
* upstream/master:
  v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055)
  v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008)
  v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991)
  v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003)
  v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988)
  v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996)
  v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994)
  v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934)
  v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992)
  fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982)
  v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897)
  v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant