Skip to content

v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes#1108

Merged
garrytan merged 6 commits into
masterfrom
garrytan/montreal
May 17, 2026
Merged

v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes#1108
garrytan merged 6 commits into
masterfrom
garrytan/montreal

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

The doctor's supervisor health check and gbrain jobs supervisor status both counted every worker_exited audit event as a crash regardless of cause. After v0.34.3.0's RSS-watchdog added more code=0 worker drains, the count inflated to 120+/day on a healthy brain — surfacing as the "Supervisor crashes: 120x/24h (was 62x — nearly doubled)" alarm.

The classifier upstream in child-worker-supervisor.ts:309-321 was already stamping a five-value likely_cause field on every exit (clean_exit, graceful_shutdown, runtime_error, oom_or_external_kill, unknown); neither read site looked at it. v0.35.5.0 ships a shared classifier so both surfaces agree.

Helper foundation (d9975af4)

  • New isCrashExit(event) + summarizeCrashes(events) + CrashSummary type + CLEAN_EXIT_CAUSES constant in src/core/minions/handlers/supervisor-audit.ts
  • Denylist semantics — only clean_exit and graceful_shutdown are non-crashes; future unrecognized causes surface by default
  • 14-case unit test pinning every branch (9 isCrashExit cases + 5 summarizeCrashes aggregator cases)

Consumer wiring (1d9a902d)

  • gbrain doctor: replaces ad-hoc filter with summarizeCrashes; warn threshold dropped from >3 to >=1 (calibrated counter); ok + warn messages include per-cause breakdown (runtime=N oom=M unknown=K legacy=L) and clean_exits_24h=N
  • gbrain jobs supervisor status: same wiring; JSON output adds crashes_by_cause + clean_exits_24h; human output adds per-cause line + Clean exits (24h) line
  • 4 source-grep wiring assertions in test/doctor.test.ts guard both call sites against drift

Metadata (059b60df + cfc2a765)

  • VERSION + package.json bumped to v0.35.5.0
  • CHANGELOG entry with usage instructions
  • CLAUDE.md "Key files" entries extended for supervisor-audit.ts, doctor.ts, jobs.ts; llms-full.txt regenerated

Test Coverage

src/core/minions/handlers/supervisor-audit.ts (NEW)
├── isCrashExit(event)              9 branches, all covered
│   ├── event !== 'worker_exited' → false            ✓ Case 9
│   ├── likely_cause = 'clean_exit' → false          ✓ Case 1
│   ├── likely_cause = 'graceful_shutdown' → false   ✓ Case 2
│   ├── likely_cause = 'runtime_error' → true        ✓ Case 3
│   ├── likely_cause = 'oom_or_external_kill' → true ✓ Case 4
│   ├── likely_cause = 'unknown' → true              ✓ Case 5
│   ├── likely_cause = <future unrecognized> → true  ✓ Case 6 (denylist guard)
│   ├── no cause + code=0 → false                    ✓ Case 7
│   └── no cause + code!=0 → true                    ✓ Case 8 + null-code case
└── summarizeCrashes(events)        9 branches, all covered
    ├── non-exit event skipped                       ✓ mixed + only-non-exit
    ├── clean_exit / graceful_shutdown → clean_exits ✓ mixed (4)
    ├── runtime_error / oom / unknown buckets        ✓ mixed (2/1/1)
    ├── legacy (no cause, code!=0) → legacy++        ✓ mixed (1)
    ├── unrecognized cause → legacy++                ✓ future-cause case
    └── empty input                                  ✓ empty-summary case

src/commands/doctor.ts + src/commands/jobs.ts (MODIFIED)
└── summarizeCrashes wiring + threshold + message    ✓ 4 source-grep assertions

COVERAGE: 22/22 paths tested (100%)
QUALITY: 14 unit tests, 4 wiring assertions, 1 docstring-truth test

Tests: ~3650 → 3664 (+14 new in supervisor-audit.test.ts, +4 wiring in doctor.test.ts)

Pre-Landing Review

Pre-Landing Review: 3 issues — 1 auto-fixed (stale docstring on summarizeCrashes updated to reflect that legacy bucket catches both pre-v0.34 entries AND future unrecognized causes), 2 skipped as scope creep (DRY format-string helper, type-union refactor — captured for future cleanup).

Specialist review: 4 specialists dispatched (testing, maintainability, security, performance). 0 critical, 3 informational — all in maintainability and triaged. PR Quality Score: 9.0/10.

Adversarial Review

Claude adversarial subagent surfaced 13 informational observations. All triaged as pre-existing surfaces, accepted plan trade-offs from /plan-eng-review, or future-data-tuning concerns. None in-scope for this PR. Notable follow-ups identified:

  • Clean-exit rate threshold: the new doctor surface shows clean_exits_24h=N so operators see drain rate at a glance. A warn threshold on that count is the natural follow-up once we have post-fix data (D5 from plan-eng-review intentionally deferred this).
  • ISO-week rotation boundary in readSupervisorEvents: pre-existing 24h-window visibility hole at week boundaries. Separate fix.

Codex ran during /plan-eng-review and surfaced 4 substantive findings, all incorporated: the duplicate bug at jobs.ts:805, denylist semantics, shared helper extraction, and threshold rebaseline.

Scope Drift

Scope Check: CLEAN. Intent matches delivery (fix the doctor's miscounting). No drift, no missing requirements.

Plan Completion

  • T1src/core/minions/handlers/supervisor-audit.ts exports isCrashExit, summarizeCrashes, CrashSummary with denylist semantics + legacy fallback + per-cause aggregation
  • [~] T2 — Tests landed but doctor-integration approach changed from plan: 9-case isCrashExit matrix + 5 summarizeCrashes aggregation cases in supervisor-audit.test.ts plus 4 source-grep wiring assertions in doctor.test.ts (vs. the plan's single runtime-fixture integration test). Same drift-prevention guarantee via different mechanism.
  • T3doctor.ts wired to summarizeCrashes; warn threshold dropped >3>=1; messages widened with per-cause breakdown + clean_exits_24h
  • T4jobs.ts supervisor status wired to summarizeCrashes; JSON adds crashes_by_cause + clean_exits_24h; human output expanded

4 plan items: 3 DONE, 1 CHANGED. No NOT DONE. No UNVERIFIABLE.

Verification Results

Plan verification skipped — fix has no UI/URL surface, only CLI output. Local end-to-end smoke ran against a synthesized supervisor audit JSONL fixture: 6 worker_exited events (3 clean_exit + 1 graceful_shutdown + 1 runtime_error + 1 oom_or_external_kill) → doctor reports Worker crashed 2x in last 24h (runtime=1 oom=1 unknown=0 legacy=0). Pre-fix would have reported "crashed 6x" with no qualitative signal.

TODOS

No items completed by this fix. TODOS.md unchanged.

Documentation

  • CLAUDE.md: added new entry for src/core/minions/handlers/supervisor-audit.ts covering the v0.35.5.0 exports (isCrashExit, summarizeCrashes, CrashSummary, CLEAN_EXIT_CAUSES); extended src/commands/doctor.ts entry with the v0.35.5.0 wire-up (shared summarizeCrashes consumer, >=1 warn threshold, runtime=A oom=B unknown=C legacy=D per-cause breakdown, clean_exits_24h=N in ok message); extended src/commands/jobs.ts entry with the v0.35.5.0 crashes_by_cause + clean_exits_24h JSON fields and cross-surface parity contract.
  • llms-full.txt: regenerated via bun run build:llms to match the CLAUDE.md edit.
  • CHANGELOG.md: v0.35.5.0 entry shipped with the fix; "## To take advantage of v0.35.5.0" block included.

Test plan

  • bun run typecheck clean
  • bun run test — 6604 pass / 0 fail / 0 skip (full parallel suite, 231s)
  • bun test test/supervisor-audit.test.ts — 14/14 pass (21 expects, 35ms)
  • bun test test/doctor.test.ts — 43/43 pass (124 expects, ~3.3s in isolation)
  • End-to-end smoke against synthesized audit JSONL — message shape verified
  • On affected machine post-merge: gbrain doctor 2>&1 | grep -i supervisor shows real crash count (not 120x clean-exit count)
  • On affected machine post-merge: gbrain jobs supervisor status --json | jq '{crashes_24h, clean_exits_24h}' matches doctor's count

🤖 Generated with Claude Code

garrytan and others added 6 commits May 17, 2026 08:26
Adds the read-side foundation for reading `likely_cause` off `worker_exited`
audit events. Denylist semantics — only `clean_exit` and `graceful_shutdown`
are non-crashes. Future unrecognized causes surface by default.

`isCrashExit(event)` classifies a single audit event with legacy
`code !== 0` fallback for pre-v0.34 entries lacking `likely_cause`.

`summarizeCrashes(events)` aggregates a 24h window into a `CrashSummary`
with per-cause counts (runtime_error, oom_or_external_kill, unknown,
legacy) and a `clean_exits` total.

Both helpers live next to `readSupervisorEvents` so the producer (the
JSONL writer) and the consumers (doctor + jobs CLI) share one regression
point. Test matrix pins all 9 isCrashExit branches plus 5 summarizeCrashes
aggregation cases including the future-cause denylist regression guard.
`gbrain doctor` and `gbrain jobs supervisor status` both counted every
`worker_exited` audit event as a crash, regardless of `likely_cause`.
After v0.34.3.0 added RSS-watchdog drains (code=0), the count inflated
to 120+/day on a healthy brain — the alarm pattern users reported.

Both surfaces now go through `summarizeCrashes(events)` (single
regression point, can't drift). The warn threshold drops from `>3`
to `>=1` now that the counter is calibrated; the per-cause breakdown
(runtime=N oom=M unknown=K legacy=L) gives operators triage context
in the message without grep'ing the JSONL audit.

`gbrain jobs supervisor status --json` adds `crashes_by_cause` and
`clean_exits_24h` fields so monitoring dashboards bind to the named
buckets.

4 source-grep wiring assertions in doctor.test.ts pin both call sites
against drift.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add CLAUDE.md entry for src/core/minions/handlers/supervisor-audit.ts
covering the new isCrashExit/summarizeCrashes/CrashSummary/CLEAN_EXIT_CAUSES
exports. Extend doctor.ts and jobs.ts entries with the v0.35.5.0
wire-up: shared helper, denylist semantics, >=1 warn threshold, per-cause
breakdown in messages, crashes_by_cause + clean_exits_24h in JSON.
Regenerate llms-full.txt to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
#	src/commands/doctor.ts
#	src/commands/jobs.ts
#	test/doctor.test.ts
@garrytan garrytan changed the title v0.35.5.0 fix(doctor): stop counting clean supervisor exits as crashes v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes May 17, 2026
@garrytan garrytan merged commit 0620094 into master May 17, 2026
7 checks passed
brandonlipman added a commit to brandonlipman/gbrain that referenced this pull request May 29, 2026
* upstream/master:
  v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208)
  v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165)
  v0.36.5.0 feat: secure DATABASE_URL access for shell jobs (inherit: ["database_url"]) (garrytan#1192)
  v0.36.4.0 feat: brain-health-100 — autonomous remediation via doctor --remediate + Minions (garrytan#1193)
  fix(docs): comprehensive drift audit — contradictions, broken links, stale refs (garrytan#1201)
  v0.36.3.0 feat: dynamic embedding column selection for search (garrytan#1164)
  v0.36.2.0 feat: ZeroEntropy as default + zero-based README rewrite (garrytan#1136)
  v0.36.1.1 fix-wave: community PR triage + 28 atomic fixes (garrytan#1182)
  v0.36.1.0 Hindsight calibration wave: brain learns how you tend to be wrong (garrytan#1139)
  v0.36.0.0 feat(skillpack): scaffold + reference + harvest (retire managed-block install) (garrytan#1130)
  v0.35.8.0 feat(cycle): phantom-page redirect inside extract_facts (garrytan#1138)
  v0.35.7.0 feat: temporal trajectory + founder scorecard (Phases 2-4) (garrytan#1131)
  v0.35.6.0 feat(search): floor-ratio gate for metadata boost stages (closes garrytan#1091) (garrytan#1129)
  v0.35.5.1 fix(doctor): stop counting clean supervisor exits as crashes (garrytan#1108)
  v0.35.5.0 fix wave: bootstrap + orphans + think MCP + worktree + walker (garrytan#1111)
  v0.35.4.0 fix(doctor,entities): supervisor crash classification + bare-name resolver + 58x perf + stub guard observability (garrytan#1085)
  v0.35.3.1 feat(eval): temporal-aware contradiction probe + verdict enum (garrytan#1052)
  v0.35.3.0 fix wave: extract_facts items + git --no-recurse-submodules placement (garrytan#1053)

# Conflicts:
#	src/core/postgres-engine.ts
#	test/schema-bootstrap-coverage.test.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant