Skip to content

fix(doctor): only count non-zero exit codes as worker crashes#1050

Closed
garrytan-agents wants to merge 2 commits into
garrytan:masterfrom
garrytan-agents:fix/doctor-crash-count-filter
Closed

fix(doctor): only count non-zero exit codes as worker crashes#1050
garrytan-agents wants to merge 2 commits into
garrytan:masterfrom
garrytan-agents:fix/doctor-crash-count-filter

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

The doctor's supervisor health check counts ALL worker_exited audit events as crashes, including code=0 clean exits (drain after queue empty, health-check restart, etc.). This inflates the crash count to 100+ per day when the actual crash count is 0.

PR #1002 already fixed the supervisor's own crash counting (crashCount = 0 for code=0). This patch applies the same logic to the doctor's independent audit-log probe.

Fix

  • Filter worker_exited events by exit code: only code !== 0 counts as a crash
  • Show clean exit count separately in status message for visibility
  • Crash threshold (>3) now only applies to real crashes

Evidence

Production audit log shows 102 worker_exited events in 24h, ALL with code: 0. Doctor reports "102 crashes" when actual crashes = 0.

worker_exited { code: 0, signal: null, crash_count: 0, reason: "code 0" }

Companion to PR #1002 (supervisor side) and PR #1004 (RSS accuracy).

Wintermute added 2 commits May 15, 2026 21:46
The doctor's supervisor health check was counting ALL worker_exited audit
events as crashes, including code=0 clean exits (drain after queue empty,
health-check-triggered restart, etc.). This inflated the crash count to
100+ per day when the actual crash count was 0.

PR garrytan#1002 already fixed the supervisor's own crash counting (crashCount=0
for code=0 exits). This patch applies the same logic to the doctor's
independent audit-log probe.

Changes:
- Filter worker_exited events by exit code: only code !== 0 counts as crash
- Show clean exit count separately in status message for visibility
- Crash threshold (>3) now applies only to real crashes
@garrytan

Copy link
Copy Markdown
Owner

Thanks — closing as already-shipped. v0.35.5.0 introduced summarizeCrashes() in src/core/minions/handlers/supervisor-audit.ts as the single source of truth for crash-event classification; both gbrain doctor (doctor.ts:1011-1043) and gbrain jobs supervisor status (jobs.ts:803-826) route through it. The denylist over allowlist design (clean_exit / graceful_shutdown are NON-crashes; everything else including future unrecognized causes counts) is the regression guard the parallel worker_exited-filter pattern lacked. Pinned by 14 cases in test/supervisor-audit.test.ts + 4 source-grep wiring assertions in test/doctor.test.ts that ban the ad-hoc filter from drifting back in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants