v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes by garrytan · Pull Request #1108 · garrytan/gbrain

garrytan · 2026-05-17T15:36:47Z

Summary

The doctor's supervisor health check and gbrain jobs supervisor status both counted every worker_exited audit event as a crash regardless of cause. After v0.34.3.0's RSS-watchdog added more code=0 worker drains, the count inflated to 120+/day on a healthy brain — surfacing as the "Supervisor crashes: 120x/24h (was 62x — nearly doubled)" alarm.

The classifier upstream in child-worker-supervisor.ts:309-321 was already stamping a five-value likely_cause field on every exit (clean_exit, graceful_shutdown, runtime_error, oom_or_external_kill, unknown); neither read site looked at it. v0.35.5.0 ships a shared classifier so both surfaces agree.

Helper foundation (d9975af4)

New isCrashExit(event) + summarizeCrashes(events) + CrashSummary type + CLEAN_EXIT_CAUSES constant in src/core/minions/handlers/supervisor-audit.ts
Denylist semantics — only clean_exit and graceful_shutdown are non-crashes; future unrecognized causes surface by default
14-case unit test pinning every branch (9 isCrashExit cases + 5 summarizeCrashes aggregator cases)

Consumer wiring (1d9a902d)

gbrain doctor: replaces ad-hoc filter with summarizeCrashes; warn threshold dropped from >3 to >=1 (calibrated counter); ok + warn messages include per-cause breakdown (runtime=N oom=M unknown=K legacy=L) and clean_exits_24h=N
gbrain jobs supervisor status: same wiring; JSON output adds crashes_by_cause + clean_exits_24h; human output adds per-cause line + Clean exits (24h) line
4 source-grep wiring assertions in test/doctor.test.ts guard both call sites against drift

Metadata (059b60df + cfc2a765)

VERSION + package.json bumped to v0.35.5.0
CHANGELOG entry with usage instructions
CLAUDE.md "Key files" entries extended for supervisor-audit.ts, doctor.ts, jobs.ts; llms-full.txt regenerated

Test Coverage

src/core/minions/handlers/supervisor-audit.ts (NEW)
├── isCrashExit(event)              9 branches, all covered
│   ├── event !== 'worker_exited' → false            ✓ Case 9
│   ├── likely_cause = 'clean_exit' → false          ✓ Case 1
│   ├── likely_cause = 'graceful_shutdown' → false   ✓ Case 2
│   ├── likely_cause = 'runtime_error' → true        ✓ Case 3
│   ├── likely_cause = 'oom_or_external_kill' → true ✓ Case 4
│   ├── likely_cause = 'unknown' → true              ✓ Case 5
│   ├── likely_cause = <future unrecognized> → true  ✓ Case 6 (denylist guard)
│   ├── no cause + code=0 → false                    ✓ Case 7
│   └── no cause + code!=0 → true                    ✓ Case 8 + null-code case
└── summarizeCrashes(events)        9 branches, all covered
    ├── non-exit event skipped                       ✓ mixed + only-non-exit
    ├── clean_exit / graceful_shutdown → clean_exits ✓ mixed (4)
    ├── runtime_error / oom / unknown buckets        ✓ mixed (2/1/1)
    ├── legacy (no cause, code!=0) → legacy++        ✓ mixed (1)
    ├── unrecognized cause → legacy++                ✓ future-cause case
    └── empty input                                  ✓ empty-summary case

src/commands/doctor.ts + src/commands/jobs.ts (MODIFIED)
└── summarizeCrashes wiring + threshold + message    ✓ 4 source-grep assertions

COVERAGE: 22/22 paths tested (100%)
QUALITY: 14 unit tests, 4 wiring assertions, 1 docstring-truth test

Tests: ~3650 → 3664 (+14 new in supervisor-audit.test.ts, +4 wiring in doctor.test.ts)

Pre-Landing Review

Pre-Landing Review: 3 issues — 1 auto-fixed (stale docstring on summarizeCrashes updated to reflect that legacy bucket catches both pre-v0.34 entries AND future unrecognized causes), 2 skipped as scope creep (DRY format-string helper, type-union refactor — captured for future cleanup).

Specialist review: 4 specialists dispatched (testing, maintainability, security, performance). 0 critical, 3 informational — all in maintainability and triaged. PR Quality Score: 9.0/10.

Adversarial Review

Claude adversarial subagent surfaced 13 informational observations. All triaged as pre-existing surfaces, accepted plan trade-offs from /plan-eng-review, or future-data-tuning concerns. None in-scope for this PR. Notable follow-ups identified:

Clean-exit rate threshold: the new doctor surface shows clean_exits_24h=N so operators see drain rate at a glance. A warn threshold on that count is the natural follow-up once we have post-fix data (D5 from plan-eng-review intentionally deferred this).
ISO-week rotation boundary in readSupervisorEvents: pre-existing 24h-window visibility hole at week boundaries. Separate fix.

Codex ran during /plan-eng-review and surfaced 4 substantive findings, all incorporated: the duplicate bug at jobs.ts:805, denylist semantics, shared helper extraction, and threshold rebaseline.

Scope Drift

Scope Check: CLEAN. Intent matches delivery (fix the doctor's miscounting). No drift, no missing requirements.

Plan Completion

T1 — src/core/minions/handlers/supervisor-audit.ts exports isCrashExit, summarizeCrashes, CrashSummary with denylist semantics + legacy fallback + per-cause aggregation
[~] T2 — Tests landed but doctor-integration approach changed from plan: 9-case isCrashExit matrix + 5 summarizeCrashes aggregation cases in supervisor-audit.test.ts plus 4 source-grep wiring assertions in doctor.test.ts (vs. the plan's single runtime-fixture integration test). Same drift-prevention guarantee via different mechanism.
T3 — doctor.ts wired to summarizeCrashes; warn threshold dropped >3 → >=1; messages widened with per-cause breakdown + clean_exits_24h
T4 — jobs.ts supervisor status wired to summarizeCrashes; JSON adds crashes_by_cause + clean_exits_24h; human output expanded

4 plan items: 3 DONE, 1 CHANGED. No NOT DONE. No UNVERIFIABLE.

Verification Results

Plan verification skipped — fix has no UI/URL surface, only CLI output. Local end-to-end smoke ran against a synthesized supervisor audit JSONL fixture: 6 worker_exited events (3 clean_exit + 1 graceful_shutdown + 1 runtime_error + 1 oom_or_external_kill) → doctor reports Worker crashed 2x in last 24h (runtime=1 oom=1 unknown=0 legacy=0). Pre-fix would have reported "crashed 6x" with no qualitative signal.

TODOS

No items completed by this fix. TODOS.md unchanged.

Documentation

CLAUDE.md: added new entry for src/core/minions/handlers/supervisor-audit.ts covering the v0.35.5.0 exports (isCrashExit, summarizeCrashes, CrashSummary, CLEAN_EXIT_CAUSES); extended src/commands/doctor.ts entry with the v0.35.5.0 wire-up (shared summarizeCrashes consumer, >=1 warn threshold, runtime=A oom=B unknown=C legacy=D per-cause breakdown, clean_exits_24h=N in ok message); extended src/commands/jobs.ts entry with the v0.35.5.0 crashes_by_cause + clean_exits_24h JSON fields and cross-surface parity contract.
llms-full.txt: regenerated via bun run build:llms to match the CLAUDE.md edit.
CHANGELOG.md: v0.35.5.0 entry shipped with the fix; "## To take advantage of v0.35.5.0" block included.

Test plan

bun run typecheck clean
bun run test — 6604 pass / 0 fail / 0 skip (full parallel suite, 231s)
bun test test/supervisor-audit.test.ts — 14/14 pass (21 expects, 35ms)
bun test test/doctor.test.ts — 43/43 pass (124 expects, ~3.3s in isolation)
End-to-end smoke against synthesized audit JSONL — message shape verified
On affected machine post-merge: gbrain doctor 2>&1 | grep -i supervisor shows real crash count (not 120x clean-exit count)
On affected machine post-merge: gbrain jobs supervisor status --json | jq '{crashes_24h, clean_exits_24h}' matches doctor's count

🤖 Generated with Claude Code

Adds the read-side foundation for reading `likely_cause` off `worker_exited` audit events. Denylist semantics — only `clean_exit` and `graceful_shutdown` are non-crashes. Future unrecognized causes surface by default. `isCrashExit(event)` classifies a single audit event with legacy `code !== 0` fallback for pre-v0.34 entries lacking `likely_cause`. `summarizeCrashes(events)` aggregates a 24h window into a `CrashSummary` with per-cause counts (runtime_error, oom_or_external_kill, unknown, legacy) and a `clean_exits` total. Both helpers live next to `readSupervisorEvents` so the producer (the JSONL writer) and the consumers (doctor + jobs CLI) share one regression point. Test matrix pins all 9 isCrashExit branches plus 5 summarizeCrashes aggregation cases including the future-cause denylist regression guard.

`gbrain doctor` and `gbrain jobs supervisor status` both counted every `worker_exited` audit event as a crash, regardless of `likely_cause`. After v0.34.3.0 added RSS-watchdog drains (code=0), the count inflated to 120+/day on a healthy brain — the alarm pattern users reported. Both surfaces now go through `summarizeCrashes(events)` (single regression point, can't drift). The warn threshold drops from `>3` to `>=1` now that the counter is calibrated; the per-cause breakdown (runtime=N oom=M unknown=K legacy=L) gives operators triage context in the message without grep'ing the JSONL audit. `gbrain jobs supervisor status --json` adds `crashes_by_cause` and `clean_exits_24h` fields so monitoring dashboards bind to the named buckets. 4 source-grep wiring assertions in doctor.test.ts pin both call sites against drift.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add CLAUDE.md entry for src/core/minions/handlers/supervisor-audit.ts covering the new isCrashExit/summarizeCrashes/CrashSummary/CLEAN_EXIT_CAUSES exports. Extend doctor.ts and jobs.ts entries with the v0.35.5.0 wire-up: shared helper, denylist semantics, >=1 warn threshold, per-cause breakdown in messages, crashes_by_cause + clean_exits_24h in JSON. Regenerate llms-full.txt to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json # src/commands/doctor.ts # src/commands/jobs.ts # test/doctor.test.ts

# Conflicts: # CHANGELOG.md # VERSION

* upstream/master: v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208) v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165) v0.36.5.0 feat: secure DATABASE_URL access for shell jobs (inherit: ["database_url"]) (garrytan#1192) v0.36.4.0 feat: brain-health-100 — autonomous remediation via doctor --remediate + Minions (garrytan#1193) fix(docs): comprehensive drift audit — contradictions, broken links, stale refs (garrytan#1201) v0.36.3.0 feat: dynamic embedding column selection for search (garrytan#1164) v0.36.2.0 feat: ZeroEntropy as default + zero-based README rewrite (garrytan#1136) v0.36.1.1 fix-wave: community PR triage + 28 atomic fixes (garrytan#1182) v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong (garrytan#1139) v0.36.0.0 feat(skillpack): scaffold + reference + harvest (retire managed-block install) (garrytan#1130) v0.35.8.0 feat(cycle): phantom-page redirect inside extract_facts (garrytan#1138) v0.35.7.0 feat: temporal trajectory + founder scorecard (Phases 2-4) (garrytan#1131) v0.35.6.0 feat(search): floor-ratio gate for metadata boost stages (closes garrytan#1091) (garrytan#1129) v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes (garrytan#1108) v0.35.5.0 fix wave: bootstrap + orphans + think MCP + worktree + walker (garrytan#1111) v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability (garrytan#1085) v0.35.3.1 feat(eval): temporal-aware contradiction probe + verdict enum (garrytan#1052) v0.35.3.0 fix wave: extract_facts items + git --no-recurse-submodules placement (garrytan#1053) # Conflicts: # src/core/postgres-engine.ts # test/schema-bootstrap-coverage.test.ts

garrytan and others added 6 commits May 17, 2026 08:26

chore: bump version and changelog (v0.35.5.0)

059b60d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into garrytan/montreal

b410ba7

# Conflicts: # CHANGELOG.md # VERSION # package.json # src/commands/doctor.ts # src/commands/jobs.ts # test/doctor.test.ts

Merge remote-tracking branch 'origin/master' into garrytan/montreal

12b20c0

# Conflicts: # CHANGELOG.md # VERSION

garrytan changed the title ~~v0.35.5.0 fix(doctor): stop counting clean supervisor exits as crashes~~ v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes May 17, 2026

garrytan merged commit 0620094 into master May 17, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes#1108

v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes#1108
garrytan merged 6 commits into
masterfrom
garrytan/montreal

garrytan commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 17, 2026

Summary

Test Coverage

Pre-Landing Review

Adversarial Review

Scope Drift

Plan Completion

Verification Results

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant