Context
Completing the gateway/cron/batch_runner audit sweep. Cron now has a doctor probe (#26 / #27), batch_runner now persists failure stats (#28 / #29). One gap remaining: hermes doctor checks whether systemd linger is enabled for the gateway, but says nothing about whether the gateway is actually RUNNING or in what state.
The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor doesn't read it.
The cron probe in #27 catches the most common symptom (gateway PID missing → cron won't fire) but only when cron jobs are configured. Operators running the gateway without cron get no signal from doctor when:
- The gateway exited with
startup_failed (auth misconfigured, port collision, etc.).
- The gateway is in
degraded state (running but a critical platform is broken).
- A specific platform is
fatal or paused (circuit-broken after repeated failures).
Fix
Add _check_gateway_runtime() to hermes_cli/doctor.py. Inert when read_runtime_status() returns None (gateway was never started — byte-stable default). When status is present:
─── Gateway Runtime ───
✓ Gateway running PID 12345 · uptime 2h 13m · 0 active agents
ℹ Platforms discord: ✓ connected · telegram: ⚠ paused (auto-paused after repeated failures)
For a startup-failed gateway:
─── Gateway Runtime ───
✗ Gateway startup_failed Exit reason: "missing TELEGRAM_BOT_TOKEN" · last seen 2026-05-22T22:01:00
For a degraded gateway:
─── Gateway Runtime ───
⚠ Gateway degraded PID 12345 · uptime 5m · check platform errors below
ℹ Platforms discord: ✓ connected · slack: ✗ fatal (token rejected by server)
Uses read_runtime_status() + find_gateway_pids(). No new state, no parsing of logs.
Out of scope
Filed by hermes-maintainer (PowerCreek). PR incoming.
Context
Completing the gateway/cron/batch_runner audit sweep. Cron now has a doctor probe (#26 / #27), batch_runner now persists failure stats (#28 / #29). One gap remaining:
hermes doctorchecks whether systemd linger is enabled for the gateway, but says nothing about whether the gateway is actually RUNNING or in what state.The gateway already maintains a rich runtime status file at
gateway/status.py:read_runtime_status()— keyed bygateway_state∈ {starting,running,draining,stopped,startup_failed,degraded}, withexit_reason,start_time,active_agents, and per-platform health. Doctor doesn't read it.The cron probe in #27 catches the most common symptom (gateway PID missing → cron won't fire) but only when cron jobs are configured. Operators running the gateway without cron get no signal from doctor when:
startup_failed(auth misconfigured, port collision, etc.).degradedstate (running but a critical platform is broken).fatalorpaused(circuit-broken after repeated failures).Fix
Add
_check_gateway_runtime()tohermes_cli/doctor.py. Inert whenread_runtime_status()returns None (gateway was never started — byte-stable default). When status is present:For a startup-failed gateway:
For a degraded gateway:
Uses
read_runtime_status()+find_gateway_pids(). No new state, no parsing of logs.Out of scope
gateway/memory_monitor.pyand runs inside the gateway process; it doesn't expose external state for doctor to read. A separate issue if that ever becomes useful.Filed by hermes-maintainer (PowerCreek). PR incoming.