Problem
Today the only way to know the dispatcher is cycling is to watch ~/.hermes/logs/gateway.log for kanban dispatcher [default]: spawned=X ... lines. There's no positive heartbeat we can monitor. When the dispatch loop dies (#6) or hangs, nothing external surfaces it — gateway PID is still alive, platform-retry threads keep logging.
Why this matters
Sister issue to #5 and #6 — both fix specific failure modes. This makes the next failure mode also detectable.
For layer 2+ (real customer code), silent stalls in autonomous chains are unacceptable.
Acceptance criteria
- Dispatcher writes a heartbeat row to
~/.hermes/state/dispatcher_health.json every cycle (or every N seconds), with: last_cycle_ts, spawned_count, blocked_count, ready_queue_depth, providers_unhealthy
hermes gateway status or similar reads this and shows fresh-vs-stale
- (Optional) hook for alerting: warn if last_cycle_ts > 3 × normal interval
- Same metric exposed via
hermes status global overview
Related
Problem
Today the only way to know the dispatcher is cycling is to watch
~/.hermes/logs/gateway.logforkanban dispatcher [default]: spawned=X ...lines. There's no positive heartbeat we can monitor. When the dispatch loop dies (#6) or hangs, nothing external surfaces it — gateway PID is still alive, platform-retry threads keep logging.Why this matters
Sister issue to #5 and #6 — both fix specific failure modes. This makes the next failure mode also detectable.
For layer 2+ (real customer code), silent stalls in autonomous chains are unacceptable.
Acceptance criteria
~/.hermes/state/dispatcher_health.jsonevery cycle (or every N seconds), with: last_cycle_ts, spawned_count, blocked_count, ready_queue_depth, providers_unhealthyhermes gateway statusor similar reads this and shows fresh-vs-stalehermes statusglobal overviewRelated