doctor: probe cron scheduler health when jobs are configured by PowerCreek · Pull Request #27 · TechDevGroup/hermes-agent

PowerCreek · 2026-05-22T23:31:10Z

Summary

Closes #26. Auditing the gateway/cron/batch_runner surfaces for diagnostic gaps analogous to the devagentic-graph probe (#17/#18). hermes doctor made zero cron-related checks even when ~/.hermes/cron/jobs.json is populated — operators had to know to run hermes cron status to see whether jobs were actually firing.

What's surfaced

Two signals, same shape as the existing cron_status output but in the canonical "tell me what's wrong" doctor surface:

Gateway PID — cron only fires when the gateway runs. When find_gateway_pids() returns empty and jobs are configured, doctor now reports check_fail("Gateway not running", "cron jobs will NOT fire …").
Recent failures — every job tracks last_status / last_error. Doctor warns on any job whose last_status is not in {ok, skipped, pending, ""} and lists up to the first 5 with name (last_run_at): truncated error.

Inert when jobs.json doesn't exist or contains no jobs — preserves byte-stable default for users who never wired cron.

Example output

─── Cron Scheduler ───
✗ Gateway not running       cron jobs will NOT fire — run `hermes gateway` or
                            `hermes gateway install` for systemd. See
                            `hermes cron status` for the same check.
✓ 2 active job(s)           next run: 2026-05-22T23:55:00
⚠ 1 of 3 job(s) have a failing last_status
                            see jobs.json or `hermes cron status` for full detail
ℹ   - audit-prs (2026-05-22T22:01:00): authentication failed — set API key

Test plan

8 new tests in tests/hermes_cli/test_doctor_cron_scheduler.py:
- silent when jobs.json absent
- silent when jobs list empty
- corrupted jobs.json (RuntimeError) surfaces a fail row
- gateway running + active jobs → ok rows with PID + next run
- gateway not running → fail row
- failing jobs → warn + up to 5 info rows
- disabled / paused jobs excluded from active count
- all-disabled jobs → info row instead of ok
pytest tests/hermes_cli/test_doctor_cron_scheduler.py tests/hermes_cli/test_doctor*.py → 75 passed (67 existing + 8 new)

Filed by hermes-maintainer (PowerCreek).

Auditing the gateway/cron/batch_runner surfaces for diagnostic gaps analogous to the devagentic-graph probe (#17/#18). `hermes doctor` made zero cron-related checks even when ~/.hermes/cron/ jobs.json is populated. Two signals are now surfaced (same shape as cron_status, but in the canonical "tell me what's wrong" surface): 1. Gateway PID — cron only fires when the gateway runs. When PIDs are absent and jobs are configured, doctor now fails with a pointer to `hermes gateway install`. 2. Recent failures — every job tracks last_status / last_error. Doctor warns on any job whose last_status is not in {ok, skipped, pending, ""}, and lists up to the first 5 with the failing job's name + last_run_at + truncated error. Inert when jobs.json doesn't exist or contains no jobs — the byte-stable default for users who never wired cron. Closes #26.

) Completing the gateway/cron/batch_runner audit sweep: - cron probe shipped in #27 (gateway PID + recent failures) - batch_runner failure stats shipped in #29 (per-prompt failures) - this PR: gateway runtime state itself `hermes doctor` previously only checked whether systemd linger was enabled. The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor didn't read it. Add `_check_gateway_runtime()` covering: - `running` → check_ok with PID + uptime + active_agents - `degraded` → check_warn pointing at platform errors below - `startup_failed` → check_fail with exit_reason + updated_at - `stopped` → check_info (intentional stop, not an alert) - `starting`/`draining` → check_info (transient) - PID present but no state → check_warn (old build, stale status) Plus per-platform health: connected platforms listed as a single check_info line; any fatal platform becomes check_fail with error_message; any paused/retrying platform becomes check_warn. Inert when the runtime status file is absent (gateway never started — byte-stable default). Closes #30.

PowerCreek merged commit 2c86087 into main May 22, 2026

PowerCreek deleted the doctor-cron-scheduler-probe branch May 22, 2026 23:31

This was referenced May 22, 2026

hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health) #30

Closed

doctor: probe gateway runtime state (state / uptime / per-platform) #31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doctor: probe cron scheduler health when jobs are configured#27

doctor: probe cron scheduler health when jobs are configured#27
PowerCreek merged 1 commit into
mainfrom
doctor-cron-scheduler-probe

PowerCreek commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PowerCreek commented May 22, 2026

Summary

What's surfaced

Example output

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant