Skip to content

hermes doctor doesn't probe cron scheduler health — silent failures when gateway down or jobs error #26

@PowerCreek

Description

@PowerCreek

Context

Auditing the gateway/cron/batch_runner boundaries for diagnostic visibility gaps analogous to the devagentic-graph probe shipped in #17/#18. Found one: hermes doctor does not surface cron scheduler health.

Operators wire hermes cron create …, expect autonomous execution, and have no signal from hermes doctor when:

  1. The gateway isn't running. Cron jobs only fire when the gateway process is up (cron_status checks this via find_gateway_pids()). If the gateway crashed or was never installed, jobs sit dormant.
  2. Recent jobs are failing. Every job stores last_status + last_error in ~/.hermes/cron/jobs.json. A handful of error states is the symptom of a broken environment (missing API key, expired auth, schedule misconfigured). Currently invisible to doctor.

hermes cron status shows both of these — but only when the operator knows to run it. Doctor is the canonical "tell me what's wrong" surface and should flag them.

Fix

Add _check_cron_scheduler() to hermes_cli/doctor.py. Inert when ~/.hermes/cron/jobs.json doesn't exist (the byte-stable default for users who never wired cron). When jobs are configured:

─── Cron Scheduler ───
  ✓ Gateway running          PID 12345
  ✓ 3 active job(s)          next run: 2026-05-22T23:55:00

# or, if anything's broken:

─── Cron Scheduler ───
  ✗ Gateway not running       cron jobs will NOT fire — see `hermes cron status`
  ⚠ 2 of last 10 runs failed
    - job-a (2026-05-22T22:33): timeout
    - job-b (2026-05-22T22:01): authentication failed

Re-uses primitives that already exist: cron.jobs.load_jobs, hermes_cli.gateway.find_gateway_pids. Imports are wrapped to fail silently if either path is missing — same defensive pattern the existing _check_devagentic_graph uses.

Out of scope

  • Healing failed jobs. Doctor's job is diagnostic, not remediation.
  • Probing the scheduler tick cadence. Surface is hermes cron tick + the gateway's internal scheduler — doctor would need to read scheduler state which is more invasive than the read-only "show me what's broken" contract.

Filed by hermes-maintainer (PowerCreek). PR incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions