Skip to content

doctor: probe gateway runtime state (state / uptime / per-platform)#31

Merged
PowerCreek merged 1 commit into
mainfrom
doctor-gateway-runtime-probe
May 22, 2026
Merged

doctor: probe gateway runtime state (state / uptime / per-platform)#31
PowerCreek merged 1 commit into
mainfrom
doctor-gateway-runtime-probe

Conversation

@PowerCreek

Copy link
Copy Markdown

Summary

Closes #30. Completing the gateway/cron/batch_runner audit sweep:

hermes doctor previously only checked systemd linger. The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor didn't read it.

What's now surfaced

state doctor row detail
running ✓ ok PID · uptime · N active agents
degraded ⚠ warn "see platform errors below"
startup_failed ✗ fail exit_reason + updated_at
stopped ℹ info exit_reason (intentional stop is not an alert)
starting / draining ℹ info transient
PID present, no state ⚠ warn old build / stale status file

Per-platform health: connected platforms listed in one info row; any fatal platform becomes check_fail with error_message; any paused or retrying platform becomes check_warn.

Inert when the runtime status file is absent (gateway never started — preserves byte-stable default).

Example output

Healthy:

─── Gateway Runtime ───
✓ Gateway running           PID 12345 · uptime 2h 13m · 2 active agent(s)
ℹ Platforms connected:      discord, telegram

Broken:

─── Gateway Runtime ───
⚠ Gateway degraded          PID 12345 · uptime 5m · see platform errors below
ℹ Platforms connected:      discord
✗ Platform slack: fatal     token rejected by server

Test plan

  • 12 new tests in tests/hermes_cli/test_doctor_gateway_runtime.py:
    • _format_uptime covers seconds / minutes / hours
    • silent when no status file
    • running state with PID
    • startup_failed surfaces exit_reason
    • degraded state warns
    • stopped state is informational
    • starting state shows pending
    • platform fatal → check_fail
    • platform paused → check_warn
    • PID without state → check_warn (stale status)
    • read raises → check_warn (rather than crashing)
  • pytest tests/hermes_cli/test_doctor*.py → 88 passed (76 existing + 12 new)

Filed by hermes-maintainer (PowerCreek).

Completing the gateway/cron/batch_runner audit sweep:
  - cron probe shipped in #27 (gateway PID + recent failures)
  - batch_runner failure stats shipped in #29 (per-prompt failures)
  - this PR: gateway runtime state itself

`hermes doctor` previously only checked whether systemd linger was
enabled. The gateway already maintains a rich runtime status file
at gateway/status.py:read_runtime_status() — keyed by
gateway_state ∈ {starting, running, draining, stopped,
startup_failed, degraded}, with exit_reason, start_time,
active_agents, and per-platform health. Doctor didn't read it.

Add `_check_gateway_runtime()` covering:

- `running` → check_ok with PID + uptime + active_agents
- `degraded` → check_warn pointing at platform errors below
- `startup_failed` → check_fail with exit_reason + updated_at
- `stopped` → check_info (intentional stop, not an alert)
- `starting`/`draining` → check_info (transient)
- PID present but no state → check_warn (old build, stale status)

Plus per-platform health: connected platforms listed as a single
check_info line; any fatal platform becomes check_fail with
error_message; any paused/retrying platform becomes check_warn.

Inert when the runtime status file is absent (gateway never
started — byte-stable default).

Closes #30.
@PowerCreek PowerCreek merged commit f6c0c95 into main May 22, 2026
@PowerCreek PowerCreek deleted the doctor-gateway-runtime-probe branch May 22, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health)

1 participant