doctor: probe cron scheduler health when jobs are configured#27
Merged
Conversation
Auditing the gateway/cron/batch_runner surfaces for diagnostic gaps analogous to the devagentic-graph probe (#17/#18). `hermes doctor` made zero cron-related checks even when ~/.hermes/cron/ jobs.json is populated. Two signals are now surfaced (same shape as cron_status, but in the canonical "tell me what's wrong" surface): 1. Gateway PID — cron only fires when the gateway runs. When PIDs are absent and jobs are configured, doctor now fails with a pointer to `hermes gateway install`. 2. Recent failures — every job tracks last_status / last_error. Doctor warns on any job whose last_status is not in {ok, skipped, pending, ""}, and lists up to the first 5 with the failing job's name + last_run_at + truncated error. Inert when jobs.json doesn't exist or contains no jobs — the byte-stable default for users who never wired cron. Closes #26.
This was referenced May 22, 2026
PowerCreek
added a commit
that referenced
this pull request
May 22, 2026
) Completing the gateway/cron/batch_runner audit sweep: - cron probe shipped in #27 (gateway PID + recent failures) - batch_runner failure stats shipped in #29 (per-prompt failures) - this PR: gateway runtime state itself `hermes doctor` previously only checked whether systemd linger was enabled. The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor didn't read it. Add `_check_gateway_runtime()` covering: - `running` → check_ok with PID + uptime + active_agents - `degraded` → check_warn pointing at platform errors below - `startup_failed` → check_fail with exit_reason + updated_at - `stopped` → check_info (intentional stop, not an alert) - `starting`/`draining` → check_info (transient) - PID present but no state → check_warn (old build, stale status) Plus per-platform health: connected platforms listed as a single check_info line; any fatal platform becomes check_fail with error_message; any paused/retrying platform becomes check_warn. Inert when the runtime status file is absent (gateway never started — byte-stable default). Closes #30.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #26. Auditing the gateway/cron/batch_runner surfaces for diagnostic gaps analogous to the devagentic-graph probe (#17/#18).
hermes doctormade zero cron-related checks even when~/.hermes/cron/jobs.jsonis populated — operators had to know to runhermes cron statusto see whether jobs were actually firing.What's surfaced
Two signals, same shape as the existing cron_status output but in the canonical "tell me what's wrong" doctor surface:
find_gateway_pids()returns empty and jobs are configured, doctor now reportscheck_fail("Gateway not running", "cron jobs will NOT fire …").last_status/last_error. Doctor warns on any job whoselast_statusis not in{ok, skipped, pending, ""}and lists up to the first 5 withname (last_run_at): truncated error.Inert when
jobs.jsondoesn't exist or contains no jobs — preserves byte-stable default for users who never wired cron.Example output
Test plan
tests/hermes_cli/test_doctor_cron_scheduler.py:pytest tests/hermes_cli/test_doctor_cron_scheduler.py tests/hermes_cli/test_doctor*.py→ 75 passed (67 existing + 8 new)Filed by hermes-maintainer (PowerCreek).