hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health)

## Context

Completing the gateway/cron/batch_runner audit sweep. Cron now has a doctor probe (#26 / #27), batch_runner now persists failure stats (#28 / #29). One gap remaining: `hermes doctor` checks **whether systemd linger is enabled** for the gateway, but says nothing about whether the gateway is actually RUNNING or in what state.

The gateway already maintains a rich runtime status file at `gateway/status.py:read_runtime_status()` — keyed by `gateway_state` ∈ {`starting`, `running`, `draining`, `stopped`, `startup_failed`, `degraded`}, with `exit_reason`, `start_time`, `active_agents`, and per-platform health. Doctor doesn't read it.

The cron probe in #27 catches the most common symptom (gateway PID missing → cron won't fire) but only when cron jobs are configured. Operators running the gateway without cron get no signal from doctor when:

- The gateway exited with `startup_failed` (auth misconfigured, port collision, etc.).
- The gateway is in `degraded` state (running but a critical platform is broken).
- A specific platform is `fatal` or `paused` (circuit-broken after repeated failures).

## Fix

Add `_check_gateway_runtime()` to `hermes_cli/doctor.py`. Inert when `read_runtime_status()` returns None (gateway was never started — byte-stable default). When status is present:

```
─── Gateway Runtime ───
✓ Gateway running          PID 12345 · uptime 2h 13m · 0 active agents
ℹ Platforms                discord: ✓ connected · telegram: ⚠ paused (auto-paused after repeated failures)
```

For a startup-failed gateway:

```
─── Gateway Runtime ───
✗ Gateway startup_failed   Exit reason: "missing TELEGRAM_BOT_TOKEN" · last seen 2026-05-22T22:01:00
```

For a degraded gateway:

```
─── Gateway Runtime ───
⚠ Gateway degraded         PID 12345 · uptime 5m · check platform errors below
ℹ Platforms                discord: ✓ connected · slack: ✗ fatal (token rejected by server)
```

Uses `read_runtime_status()` + `find_gateway_pids()`. No new state, no parsing of logs.

## Out of scope

- Triggering a restart. Doctor diagnoses, doesn't remediate (cf. #26).
- Reading log files. The runtime status JSON is enough for the surface a probe should cover.
- A separate "Memory Monitor" section. The memory monitor lives in `gateway/memory_monitor.py` and runs inside the gateway process; it doesn't expose external state for doctor to read. A separate issue if that ever becomes useful.

Filed by hermes-maintainer (PowerCreek). PR incoming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health) #30

Context

Fix

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

hermes doctor surfaces gateway linger but not runtime state (state / uptime / active_agents / per-platform health) #30

Description

Context

Fix

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions