feat(gateway): dispatcher heartbeat — detect silent stalls from outside by jarvis-stark-ops · Pull Request #8 · 1Team-Engineering/hermes-agent

jarvis-stark-ops · 2026-06-07T23:32:59Z

Summary

Closes Dispatcher: emit explicit liveness/heartbeat metric so silent-stall conditions are detectable from outside #7.
Writes a JSON heartbeat to $HERMES_HOME/state/dispatcher_health.json at end of every kanban dispatcher cycle (success / exception / cancellation paths all covered).
Unblocks investigation of Dispatcher: detect HTTP 429 in worker stderr and delayed-retry instead of protocol_violation auto-block #5 and Dispatcher: detect xAI OAuth crashes (and similar transient auth failures) and skip provider instead of auto-blocking task #6 — both were stuck on "is the dispatcher even running?" being unanswerable from outside the process.

Why this matters

Twice on 2026-06-07 the dispatcher silently stopped cycling while the gateway PID stayed alive. No log line, no exception trace bubbled up — the loop just stopped emitting its periodic kanban dispatcher [default]: spawned=X ... line. The only debug path was tailing gateway.log and noticing the silence. This patch fixes the observability gap so a monitor (or a human running hermes gateway status once that consumes the file) can detect stalls within ~120 seconds.

Schema v1 (stable contract)

{
  "schema_version": 1,
  "last_cycle_ts": 1780000000.0,
  "last_cycle_iso": "2026-06-07T16:30:00Z",
  "cycle_started_at": 1779999998.5,
  "cycle_duration_seconds": 1.5,
  "interval_seconds": 60.0,
  "cycles_since_start": 42,
  "any_spawned_this_cycle": true,
  "spawned_total_this_cycle": 3,
  "ready_pending": false,
  "consecutive_bad_ticks": 0,
  "gateway_pid": 12345,
  "cycle_error": null
}

Detection rule for monitors:

stall = time.time() - last_cycle_ts > 2 × interval_seconds
dead = stall > 5 minutes AND gateway PID still alive

Test plan

5 unit tests in tests/gateway/test_dispatcher_heartbeat.py — all passing (Schema-v1 contract, cycle-error path, overwrite semantics, first-cycle-is-1 contract, auto-mkdir state/)
python3 -c "import ast; ast.parse(...)" clean
Manual: restart gateway after merge, tail dispatcher_health.json for one cycle to confirm format in real environment

Code-review focus

Cancellation path — except asyncio.CancelledError writes a final heartbeat (cycle_error="cancelled") then re-raises. Synchronous atomic_json_write cannot be interrupted by another CancelledError mid-write.
Failure isolation — heartbeat-write call is wrapped in try/except. KeyboardInterrupt/SystemExit (BaseException) correctly NOT caught.
Variable scoping — all 6 loop-locals initialized at top of while body BEFORE any try block. No UnboundLocalError risk.
Performance — 60s cadence, ~400-byte payload, single fsync. Negligible vs SQLite WAL traffic.

Follow-up (separate issues, not blocking)

Extend hermes gateway status to read dispatcher_health.json and show stall age
Schema v2: add gateway_started_at for uptime without ps
Optional Prometheus textfile exposition

🤖 Generated with Claude Code

Closes #7. Problem The kanban dispatcher loop in `_kanban_dispatcher_watcher` can silently stop cycling while the gateway process stays alive — observed twice in one day (2026-06-07). Cause is unclear (possibly hung `dispatch_once()`, sqlite lock, or event-loop livelock). Without instrumentation we can't tell a dead loop from an idle gateway. This unblocks investigation of #5 (HTTP 429 → delayed retry) and #6 (provider auth crashes → skip-not-stall), both of which were blocked on "is the dispatcher even running?" being answerable. Solution Write a heartbeat JSON to `$HERMES_HOME/state/dispatcher_health.json` at the end of every dispatcher cycle (success AND exception paths AND cancellation). Schema (v1, stable contract): schema_version: 1 last_cycle_ts: float # unix seconds, end of cycle last_cycle_iso: str # UTC ISO 8601 with Z suffix cycle_started_at: float cycle_duration_seconds: float interval_seconds: float # configured cadence (default 60) cycles_since_start: int # monotonic; first cycle = 1, 0 means "never wrote" any_spawned_this_cycle: bool spawned_total_this_cycle: int # count across all boards ready_pending: bool # ready queue non-empty consecutive_bad_ticks: int # mirrors the existing HEALTH_WINDOW counter gateway_pid: int cycle_error: str | null # exception text if cycle errored Detection rule for monitors: stall = time.time() - last_cycle_ts > 2 × interval_seconds dead = last_cycle_ts > 5 minutes ago AND gateway PID still alive Implementation - New method `GatewayRunner._write_dispatcher_heartbeat` (gateway/run.py). Uses existing `atomic_json_write` for crash-safe writes. - Called from `_kanban_dispatcher_watcher` at end of every iteration. - Heartbeat-write failures are wrapped in try/except so they can NEVER kill the dispatcher (heartbeat is a diagnostic, not a dependency). - Cancellation path also writes a final heartbeat with cycle_error="cancelled" before re-raising — so monitors can distinguish clean shutdown from crash. - Locals (`cycles_since_start`, `any_spawned`, etc.) initialized at top of the loop body BEFORE any try block so they're defined for the heartbeat call even if zombie-reap or main tick throws. Tests (5/5 passing) - Schema-v1 contract pinned (all 13 keys, types, ISO Z suffix) - Cycle-error path recorded correctly - Two writes overwrite (not append) - First-cycle-is-1 contract (monitors treat 0 as "never wrote") - Auto-creates `state/` dir if missing Follow-up (separate issues) - Extend `hermes gateway status` to read this file and show stall age - Schema v2: add `gateway_started_at` for uptime computation without ps - Optional Prometheus textfile exposition for node_exporter setups Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jarvis-stark-ops merged commit 2aed6f5 into main Jun 7, 2026

jarvis-stark-ops deleted the wt/dispatcher-heartbeat-metric branch June 7, 2026 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): dispatcher heartbeat — detect silent stalls from outside#8

feat(gateway): dispatcher heartbeat — detect silent stalls from outside#8
jarvis-stark-ops merged 1 commit into
mainfrom
wt/dispatcher-heartbeat-metric

jarvis-stark-ops commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jarvis-stark-ops commented Jun 7, 2026

Summary

Why this matters

Schema v1 (stable contract)

Test plan

Code-review focus

Follow-up (separate issues, not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant