v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes#1108
Merged
Conversation
Adds the read-side foundation for reading `likely_cause` off `worker_exited` audit events. Denylist semantics — only `clean_exit` and `graceful_shutdown` are non-crashes. Future unrecognized causes surface by default. `isCrashExit(event)` classifies a single audit event with legacy `code !== 0` fallback for pre-v0.34 entries lacking `likely_cause`. `summarizeCrashes(events)` aggregates a 24h window into a `CrashSummary` with per-cause counts (runtime_error, oom_or_external_kill, unknown, legacy) and a `clean_exits` total. Both helpers live next to `readSupervisorEvents` so the producer (the JSONL writer) and the consumers (doctor + jobs CLI) share one regression point. Test matrix pins all 9 isCrashExit branches plus 5 summarizeCrashes aggregation cases including the future-cause denylist regression guard.
`gbrain doctor` and `gbrain jobs supervisor status` both counted every `worker_exited` audit event as a crash, regardless of `likely_cause`. After v0.34.3.0 added RSS-watchdog drains (code=0), the count inflated to 120+/day on a healthy brain — the alarm pattern users reported. Both surfaces now go through `summarizeCrashes(events)` (single regression point, can't drift). The warn threshold drops from `>3` to `>=1` now that the counter is calibrated; the per-cause breakdown (runtime=N oom=M unknown=K legacy=L) gives operators triage context in the message without grep'ing the JSONL audit. `gbrain jobs supervisor status --json` adds `crashes_by_cause` and `clean_exits_24h` fields so monitoring dashboards bind to the named buckets. 4 source-grep wiring assertions in doctor.test.ts pin both call sites against drift.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add CLAUDE.md entry for src/core/minions/handlers/supervisor-audit.ts covering the new isCrashExit/summarizeCrashes/CrashSummary/CLEAN_EXIT_CAUSES exports. Extend doctor.ts and jobs.ts entries with the v0.35.5.0 wire-up: shared helper, denylist semantics, >=1 warn threshold, per-cause breakdown in messages, crashes_by_cause + clean_exits_24h in JSON. Regenerate llms-full.txt to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # CHANGELOG.md # VERSION # package.json # src/commands/doctor.ts # src/commands/jobs.ts # test/doctor.test.ts
# Conflicts: # CHANGELOG.md # VERSION
brandonlipman
added a commit
to brandonlipman/gbrain
that referenced
this pull request
May 29, 2026
* upstream/master: v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208) v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165) v0.36.5.0 feat: secure DATABASE_URL access for shell jobs (inherit: ["database_url"]) (garrytan#1192) v0.36.4.0 feat: brain-health-100 — autonomous remediation via doctor --remediate + Minions (garrytan#1193) fix(docs): comprehensive drift audit — contradictions, broken links, stale refs (garrytan#1201) v0.36.3.0 feat: dynamic embedding column selection for search (garrytan#1164) v0.36.2.0 feat: ZeroEntropy as default + zero-based README rewrite (garrytan#1136) v0.36.1.1 fix-wave: community PR triage + 28 atomic fixes (garrytan#1182) v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong (garrytan#1139) v0.36.0.0 feat(skillpack): scaffold + reference + harvest (retire managed-block install) (garrytan#1130) v0.35.8.0 feat(cycle): phantom-page redirect inside extract_facts (garrytan#1138) v0.35.7.0 feat: temporal trajectory + founder scorecard (Phases 2-4) (garrytan#1131) v0.35.6.0 feat(search): floor-ratio gate for metadata boost stages (closes garrytan#1091) (garrytan#1129) v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes (garrytan#1108) v0.35.5.0 fix wave: bootstrap + orphans + think MCP + worktree + walker (garrytan#1111) v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability (garrytan#1085) v0.35.3.1 feat(eval): temporal-aware contradiction probe + verdict enum (garrytan#1052) v0.35.3.0 fix wave: extract_facts items + git --no-recurse-submodules placement (garrytan#1053) # Conflicts: # src/core/postgres-engine.ts # test/schema-bootstrap-coverage.test.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The doctor's supervisor health check and
gbrain jobs supervisor statusboth counted everyworker_exitedaudit event as a crash regardless of cause. After v0.34.3.0's RSS-watchdog added more code=0 worker drains, the count inflated to 120+/day on a healthy brain — surfacing as the "Supervisor crashes: 120x/24h (was 62x — nearly doubled)" alarm.The classifier upstream in
child-worker-supervisor.ts:309-321was already stamping a five-valuelikely_causefield on every exit (clean_exit,graceful_shutdown,runtime_error,oom_or_external_kill,unknown); neither read site looked at it. v0.35.5.0 ships a shared classifier so both surfaces agree.Helper foundation (
d9975af4)isCrashExit(event)+summarizeCrashes(events)+CrashSummarytype +CLEAN_EXIT_CAUSESconstant insrc/core/minions/handlers/supervisor-audit.tsclean_exitandgraceful_shutdownare non-crashes; future unrecognized causes surface by defaultisCrashExitcases + 5summarizeCrashesaggregator cases)Consumer wiring (
1d9a902d)gbrain doctor: replaces ad-hoc filter withsummarizeCrashes; warn threshold dropped from>3to>=1(calibrated counter); ok + warn messages include per-cause breakdown (runtime=N oom=M unknown=K legacy=L) andclean_exits_24h=Ngbrain jobs supervisor status: same wiring; JSON output addscrashes_by_cause+clean_exits_24h; human output adds per-cause line +Clean exits (24h)linetest/doctor.test.tsguard both call sites against driftMetadata (
059b60df+cfc2a765)supervisor-audit.ts,doctor.ts,jobs.ts;llms-full.txtregeneratedTest Coverage
Tests: ~3650 → 3664 (+14 new in
supervisor-audit.test.ts, +4 wiring indoctor.test.ts)Pre-Landing Review
Pre-Landing Review: 3 issues — 1 auto-fixed (stale docstring on
summarizeCrashesupdated to reflect thatlegacybucket catches both pre-v0.34 entries AND future unrecognized causes), 2 skipped as scope creep (DRY format-string helper, type-union refactor — captured for future cleanup).Specialist review: 4 specialists dispatched (testing, maintainability, security, performance). 0 critical, 3 informational — all in maintainability and triaged. PR Quality Score: 9.0/10.
Adversarial Review
Claude adversarial subagent surfaced 13 informational observations. All triaged as pre-existing surfaces, accepted plan trade-offs from
/plan-eng-review, or future-data-tuning concerns. None in-scope for this PR. Notable follow-ups identified:clean_exits_24h=Nso operators see drain rate at a glance. A warn threshold on that count is the natural follow-up once we have post-fix data (D5 from plan-eng-review intentionally deferred this).readSupervisorEvents: pre-existing 24h-window visibility hole at week boundaries. Separate fix.Codex ran during
/plan-eng-reviewand surfaced 4 substantive findings, all incorporated: the duplicate bug atjobs.ts:805, denylist semantics, shared helper extraction, and threshold rebaseline.Scope Drift
Scope Check: CLEAN. Intent matches delivery (fix the doctor's miscounting). No drift, no missing requirements.
Plan Completion
src/core/minions/handlers/supervisor-audit.tsexportsisCrashExit,summarizeCrashes,CrashSummarywith denylist semantics + legacy fallback + per-cause aggregationisCrashExitmatrix + 5summarizeCrashesaggregation cases insupervisor-audit.test.tsplus 4 source-grep wiring assertions indoctor.test.ts(vs. the plan's single runtime-fixture integration test). Same drift-prevention guarantee via different mechanism.doctor.tswired tosummarizeCrashes; warn threshold dropped>3→>=1; messages widened with per-cause breakdown +clean_exits_24hjobs.tssupervisor status wired tosummarizeCrashes; JSON addscrashes_by_cause+clean_exits_24h; human output expanded4 plan items: 3 DONE, 1 CHANGED. No NOT DONE. No UNVERIFIABLE.
Verification Results
Plan verification skipped — fix has no UI/URL surface, only CLI output. Local end-to-end smoke ran against a synthesized supervisor audit JSONL fixture: 6
worker_exitedevents (3 clean_exit + 1 graceful_shutdown + 1 runtime_error + 1 oom_or_external_kill) → doctor reportsWorker crashed 2x in last 24h (runtime=1 oom=1 unknown=0 legacy=0). Pre-fix would have reported "crashed 6x" with no qualitative signal.TODOS
No items completed by this fix. TODOS.md unchanged.
Documentation
src/core/minions/handlers/supervisor-audit.tscovering the v0.35.5.0 exports (isCrashExit,summarizeCrashes,CrashSummary,CLEAN_EXIT_CAUSES); extendedsrc/commands/doctor.tsentry with the v0.35.5.0 wire-up (sharedsummarizeCrashesconsumer,>=1warn threshold,runtime=A oom=B unknown=C legacy=Dper-cause breakdown,clean_exits_24h=Nin ok message); extendedsrc/commands/jobs.tsentry with the v0.35.5.0crashes_by_cause+clean_exits_24hJSON fields and cross-surface parity contract.bun run build:llmsto match the CLAUDE.md edit.Test plan
bun run typecheckcleanbun run test— 6604 pass / 0 fail / 0 skip (full parallel suite, 231s)bun test test/supervisor-audit.test.ts— 14/14 pass (21 expects, 35ms)bun test test/doctor.test.ts— 43/43 pass (124 expects, ~3.3s in isolation)gbrain doctor 2>&1 | grep -i supervisorshows real crash count (not 120x clean-exit count)gbrain jobs supervisor status --json | jq '{crashes_24h, clean_exits_24h}'matches doctor's count🤖 Generated with Claude Code