Skip to content

doctor: probe cron scheduler health when jobs are configured#27

Merged
PowerCreek merged 1 commit into
mainfrom
doctor-cron-scheduler-probe
May 22, 2026
Merged

doctor: probe cron scheduler health when jobs are configured#27
PowerCreek merged 1 commit into
mainfrom
doctor-cron-scheduler-probe

Conversation

@PowerCreek

Copy link
Copy Markdown

Summary

Closes #26. Auditing the gateway/cron/batch_runner surfaces for diagnostic gaps analogous to the devagentic-graph probe (#17/#18). hermes doctor made zero cron-related checks even when ~/.hermes/cron/jobs.json is populated — operators had to know to run hermes cron status to see whether jobs were actually firing.

What's surfaced

Two signals, same shape as the existing cron_status output but in the canonical "tell me what's wrong" doctor surface:

  1. Gateway PID — cron only fires when the gateway runs. When find_gateway_pids() returns empty and jobs are configured, doctor now reports check_fail("Gateway not running", "cron jobs will NOT fire …").
  2. Recent failures — every job tracks last_status / last_error. Doctor warns on any job whose last_status is not in {ok, skipped, pending, ""} and lists up to the first 5 with name (last_run_at): truncated error.

Inert when jobs.json doesn't exist or contains no jobs — preserves byte-stable default for users who never wired cron.

Example output

─── Cron Scheduler ───
✗ Gateway not running       cron jobs will NOT fire — run `hermes gateway` or
                            `hermes gateway install` for systemd. See
                            `hermes cron status` for the same check.
✓ 2 active job(s)           next run: 2026-05-22T23:55:00
⚠ 1 of 3 job(s) have a failing last_status
                            see jobs.json or `hermes cron status` for full detail
ℹ   - audit-prs (2026-05-22T22:01:00): authentication failed — set API key

Test plan

  • 8 new tests in tests/hermes_cli/test_doctor_cron_scheduler.py:
    • silent when jobs.json absent
    • silent when jobs list empty
    • corrupted jobs.json (RuntimeError) surfaces a fail row
    • gateway running + active jobs → ok rows with PID + next run
    • gateway not running → fail row
    • failing jobs → warn + up to 5 info rows
    • disabled / paused jobs excluded from active count
    • all-disabled jobs → info row instead of ok
  • pytest tests/hermes_cli/test_doctor_cron_scheduler.py tests/hermes_cli/test_doctor*.py → 75 passed (67 existing + 8 new)

Filed by hermes-maintainer (PowerCreek).

Auditing the gateway/cron/batch_runner surfaces for diagnostic
gaps analogous to the devagentic-graph probe (#17/#18). `hermes
doctor` made zero cron-related checks even when ~/.hermes/cron/
jobs.json is populated.

Two signals are now surfaced (same shape as cron_status, but in
the canonical "tell me what's wrong" surface):

1. Gateway PID — cron only fires when the gateway runs. When PIDs
   are absent and jobs are configured, doctor now fails with a
   pointer to `hermes gateway install`.
2. Recent failures — every job tracks last_status / last_error.
   Doctor warns on any job whose last_status is not in
   {ok, skipped, pending, ""}, and lists up to the first 5 with
   the failing job's name + last_run_at + truncated error.

Inert when jobs.json doesn't exist or contains no jobs — the
byte-stable default for users who never wired cron.

Closes #26.
@PowerCreek PowerCreek merged commit 2c86087 into main May 22, 2026
@PowerCreek PowerCreek deleted the doctor-cron-scheduler-probe branch May 22, 2026 23:31
PowerCreek added a commit that referenced this pull request May 22, 2026
)

Completing the gateway/cron/batch_runner audit sweep:
  - cron probe shipped in #27 (gateway PID + recent failures)
  - batch_runner failure stats shipped in #29 (per-prompt failures)
  - this PR: gateway runtime state itself

`hermes doctor` previously only checked whether systemd linger was
enabled. The gateway already maintains a rich runtime status file
at gateway/status.py:read_runtime_status() — keyed by
gateway_state ∈ {starting, running, draining, stopped,
startup_failed, degraded}, with exit_reason, start_time,
active_agents, and per-platform health. Doctor didn't read it.

Add `_check_gateway_runtime()` covering:

- `running` → check_ok with PID + uptime + active_agents
- `degraded` → check_warn pointing at platform errors below
- `startup_failed` → check_fail with exit_reason + updated_at
- `stopped` → check_info (intentional stop, not an alert)
- `starting`/`draining` → check_info (transient)
- PID present but no state → check_warn (old build, stale status)

Plus per-platform health: connected platforms listed as a single
check_info line; any fatal platform becomes check_fail with
error_message; any paused/retrying platform becomes check_warn.

Inert when the runtime status file is absent (gateway never
started — byte-stable default).

Closes #30.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hermes doctor doesn't probe cron scheduler health — silent failures when gateway down or jobs error

1 participant