feat(gateway): dispatcher heartbeat — detect silent stalls from outside#8
Merged
Merged
Conversation
Closes #7. Problem The kanban dispatcher loop in `_kanban_dispatcher_watcher` can silently stop cycling while the gateway process stays alive — observed twice in one day (2026-06-07). Cause is unclear (possibly hung `dispatch_once()`, sqlite lock, or event-loop livelock). Without instrumentation we can't tell a dead loop from an idle gateway. This unblocks investigation of #5 (HTTP 429 → delayed retry) and #6 (provider auth crashes → skip-not-stall), both of which were blocked on "is the dispatcher even running?" being answerable. Solution Write a heartbeat JSON to `$HERMES_HOME/state/dispatcher_health.json` at the end of every dispatcher cycle (success AND exception paths AND cancellation). Schema (v1, stable contract): schema_version: 1 last_cycle_ts: float # unix seconds, end of cycle last_cycle_iso: str # UTC ISO 8601 with Z suffix cycle_started_at: float cycle_duration_seconds: float interval_seconds: float # configured cadence (default 60) cycles_since_start: int # monotonic; first cycle = 1, 0 means "never wrote" any_spawned_this_cycle: bool spawned_total_this_cycle: int # count across all boards ready_pending: bool # ready queue non-empty consecutive_bad_ticks: int # mirrors the existing HEALTH_WINDOW counter gateway_pid: int cycle_error: str | null # exception text if cycle errored Detection rule for monitors: stall = time.time() - last_cycle_ts > 2 × interval_seconds dead = last_cycle_ts > 5 minutes ago AND gateway PID still alive Implementation - New method `GatewayRunner._write_dispatcher_heartbeat` (gateway/run.py). Uses existing `atomic_json_write` for crash-safe writes. - Called from `_kanban_dispatcher_watcher` at end of every iteration. - Heartbeat-write failures are wrapped in try/except so they can NEVER kill the dispatcher (heartbeat is a diagnostic, not a dependency). - Cancellation path also writes a final heartbeat with cycle_error="cancelled" before re-raising — so monitors can distinguish clean shutdown from crash. - Locals (`cycles_since_start`, `any_spawned`, etc.) initialized at top of the loop body BEFORE any try block so they're defined for the heartbeat call even if zombie-reap or main tick throws. Tests (5/5 passing) - Schema-v1 contract pinned (all 13 keys, types, ISO Z suffix) - Cycle-error path recorded correctly - Two writes overwrite (not append) - First-cycle-is-1 contract (monitors treat 0 as "never wrote") - Auto-creates `state/` dir if missing Follow-up (separate issues) - Extend `hermes gateway status` to read this file and show stall age - Schema v2: add `gateway_started_at` for uptime computation without ps - Optional Prometheus textfile exposition for node_exporter setups Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
$HERMES_HOME/state/dispatcher_health.jsonat end of every kanban dispatcher cycle (success / exception / cancellation paths all covered).Why this matters
Twice on 2026-06-07 the dispatcher silently stopped cycling while the gateway PID stayed alive. No log line, no exception trace bubbled up — the loop just stopped emitting its periodic
kanban dispatcher [default]: spawned=X ...line. The only debug path was tailing gateway.log and noticing the silence. This patch fixes the observability gap so a monitor (or a human runninghermes gateway statusonce that consumes the file) can detect stalls within ~120 seconds.Schema v1 (stable contract)
{ "schema_version": 1, "last_cycle_ts": 1780000000.0, "last_cycle_iso": "2026-06-07T16:30:00Z", "cycle_started_at": 1779999998.5, "cycle_duration_seconds": 1.5, "interval_seconds": 60.0, "cycles_since_start": 42, "any_spawned_this_cycle": true, "spawned_total_this_cycle": 3, "ready_pending": false, "consecutive_bad_ticks": 0, "gateway_pid": 12345, "cycle_error": null }Detection rule for monitors:
time.time() - last_cycle_ts > 2 × interval_secondsTest plan
tests/gateway/test_dispatcher_heartbeat.py— all passing (Schema-v1 contract, cycle-error path, overwrite semantics, first-cycle-is-1 contract, auto-mkdir state/)python3 -c "import ast; ast.parse(...)"cleandispatcher_health.jsonfor one cycle to confirm format in real environmentCode-review focus
except asyncio.CancelledErrorwrites a final heartbeat (cycle_error="cancelled") then re-raises. Synchronousatomic_json_writecannot be interrupted by another CancelledError mid-write.whilebody BEFORE any try block. No UnboundLocalError risk.Follow-up (separate issues, not blocking)
hermes gateway statusto readdispatcher_health.jsonand show stall agegateway_started_atfor uptime withoutps🤖 Generated with Claude Code