Skip to content

Dispatcher: emit explicit liveness/heartbeat metric so silent-stall conditions are detectable from outside #7

@jarvis-stark-ops

Description

@jarvis-stark-ops

Problem

Today the only way to know the dispatcher is cycling is to watch ~/.hermes/logs/gateway.log for kanban dispatcher [default]: spawned=X ... lines. There's no positive heartbeat we can monitor. When the dispatch loop dies (#6) or hangs, nothing external surfaces it — gateway PID is still alive, platform-retry threads keep logging.

Why this matters

Sister issue to #5 and #6 — both fix specific failure modes. This makes the next failure mode also detectable.

For layer 2+ (real customer code), silent stalls in autonomous chains are unacceptable.

Acceptance criteria

  • Dispatcher writes a heartbeat row to ~/.hermes/state/dispatcher_health.json every cycle (or every N seconds), with: last_cycle_ts, spawned_count, blocked_count, ready_queue_depth, providers_unhealthy
  • hermes gateway status or similar reads this and shows fresh-vs-stale
  • (Optional) hook for alerting: warn if last_cycle_ts > 3 × normal interval
  • Same metric exposed via hermes status global overview

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions