fix(doctor): only count non-zero exit codes as worker crashes#1050
Closed
garrytan-agents wants to merge 2 commits into
Closed
fix(doctor): only count non-zero exit codes as worker crashes#1050garrytan-agents wants to merge 2 commits into
garrytan-agents wants to merge 2 commits into
Conversation
added 2 commits
May 15, 2026 21:46
The doctor's supervisor health check was counting ALL worker_exited audit events as crashes, including code=0 clean exits (drain after queue empty, health-check-triggered restart, etc.). This inflated the crash count to 100+ per day when the actual crash count was 0. PR garrytan#1002 already fixed the supervisor's own crash counting (crashCount=0 for code=0 exits). This patch applies the same logic to the doctor's independent audit-log probe. Changes: - Filter worker_exited events by exit code: only code !== 0 counts as crash - Show clean exit count separately in status message for visibility - Crash threshold (>3) now applies only to real crashes
Owner
|
Thanks — closing as already-shipped. v0.35.5.0 introduced |
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The doctor's supervisor health check counts ALL
worker_exitedaudit events as crashes, including code=0 clean exits (drain after queue empty, health-check restart, etc.). This inflates the crash count to 100+ per day when the actual crash count is 0.PR #1002 already fixed the supervisor's own crash counting (
crashCount = 0for code=0). This patch applies the same logic to the doctor's independent audit-log probe.Fix
worker_exitedevents by exit code: onlycode !== 0counts as a crashEvidence
Production audit log shows 102
worker_exitedevents in 24h, ALL withcode: 0. Doctor reports "102 crashes" when actual crashes = 0.Companion to PR #1002 (supervisor side) and PR #1004 (RSS accuracy).