Skip to content

hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health) #30

@PowerCreek

Description

@PowerCreek

Context

Completing the gateway/cron/batch_runner audit sweep. Cron now has a doctor probe (#26 / #27), batch_runner now persists failure stats (#28 / #29). One gap remaining: hermes doctor checks whether systemd linger is enabled for the gateway, but says nothing about whether the gateway is actually RUNNING or in what state.

The gateway already maintains a rich runtime status file at gateway/status.py:read_runtime_status() — keyed by gateway_state ∈ {starting, running, draining, stopped, startup_failed, degraded}, with exit_reason, start_time, active_agents, and per-platform health. Doctor doesn't read it.

The cron probe in #27 catches the most common symptom (gateway PID missing → cron won't fire) but only when cron jobs are configured. Operators running the gateway without cron get no signal from doctor when:

  • The gateway exited with startup_failed (auth misconfigured, port collision, etc.).
  • The gateway is in degraded state (running but a critical platform is broken).
  • A specific platform is fatal or paused (circuit-broken after repeated failures).

Fix

Add _check_gateway_runtime() to hermes_cli/doctor.py. Inert when read_runtime_status() returns None (gateway was never started — byte-stable default). When status is present:

─── Gateway Runtime ───
✓ Gateway running          PID 12345 · uptime 2h 13m · 0 active agents
ℹ Platforms                discord: ✓ connected · telegram: ⚠ paused (auto-paused after repeated failures)

For a startup-failed gateway:

─── Gateway Runtime ───
✗ Gateway startup_failed   Exit reason: "missing TELEGRAM_BOT_TOKEN" · last seen 2026-05-22T22:01:00

For a degraded gateway:

─── Gateway Runtime ───
⚠ Gateway degraded         PID 12345 · uptime 5m · check platform errors below
ℹ Platforms                discord: ✓ connected · slack: ✗ fatal (token rejected by server)

Uses read_runtime_status() + find_gateway_pids(). No new state, no parsing of logs.

Out of scope

Filed by hermes-maintainer (PowerCreek). PR incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions