Skip to content

feat(gateway): dispatcher heartbeat — detect silent stalls from outside#8

Merged
jarvis-stark-ops merged 1 commit into
mainfrom
wt/dispatcher-heartbeat-metric
Jun 7, 2026
Merged

feat(gateway): dispatcher heartbeat — detect silent stalls from outside#8
jarvis-stark-ops merged 1 commit into
mainfrom
wt/dispatcher-heartbeat-metric

Conversation

@jarvis-stark-ops

Copy link
Copy Markdown
Collaborator

Summary

Why this matters

Twice on 2026-06-07 the dispatcher silently stopped cycling while the gateway PID stayed alive. No log line, no exception trace bubbled up — the loop just stopped emitting its periodic kanban dispatcher [default]: spawned=X ... line. The only debug path was tailing gateway.log and noticing the silence. This patch fixes the observability gap so a monitor (or a human running hermes gateway status once that consumes the file) can detect stalls within ~120 seconds.

Schema v1 (stable contract)

{
  "schema_version": 1,
  "last_cycle_ts": 1780000000.0,
  "last_cycle_iso": "2026-06-07T16:30:00Z",
  "cycle_started_at": 1779999998.5,
  "cycle_duration_seconds": 1.5,
  "interval_seconds": 60.0,
  "cycles_since_start": 42,
  "any_spawned_this_cycle": true,
  "spawned_total_this_cycle": 3,
  "ready_pending": false,
  "consecutive_bad_ticks": 0,
  "gateway_pid": 12345,
  "cycle_error": null
}

Detection rule for monitors:

  • stall = time.time() - last_cycle_ts > 2 × interval_seconds
  • dead = stall > 5 minutes AND gateway PID still alive

Test plan

  • 5 unit tests in tests/gateway/test_dispatcher_heartbeat.py — all passing (Schema-v1 contract, cycle-error path, overwrite semantics, first-cycle-is-1 contract, auto-mkdir state/)
  • python3 -c "import ast; ast.parse(...)" clean
  • Manual: restart gateway after merge, tail dispatcher_health.json for one cycle to confirm format in real environment

Code-review focus

  1. Cancellation pathexcept asyncio.CancelledError writes a final heartbeat (cycle_error="cancelled") then re-raises. Synchronous atomic_json_write cannot be interrupted by another CancelledError mid-write.
  2. Failure isolation — heartbeat-write call is wrapped in try/except. KeyboardInterrupt/SystemExit (BaseException) correctly NOT caught.
  3. Variable scoping — all 6 loop-locals initialized at top of while body BEFORE any try block. No UnboundLocalError risk.
  4. Performance — 60s cadence, ~400-byte payload, single fsync. Negligible vs SQLite WAL traffic.

Follow-up (separate issues, not blocking)

  • Extend hermes gateway status to read dispatcher_health.json and show stall age
  • Schema v2: add gateway_started_at for uptime without ps
  • Optional Prometheus textfile exposition

🤖 Generated with Claude Code

Closes #7.

Problem
The kanban dispatcher loop in `_kanban_dispatcher_watcher` can silently stop
cycling while the gateway process stays alive — observed twice in one day
(2026-06-07). Cause is unclear (possibly hung `dispatch_once()`, sqlite lock,
or event-loop livelock). Without instrumentation we can't tell a dead loop
from an idle gateway.

This unblocks investigation of #5 (HTTP 429 → delayed retry) and #6 (provider
auth crashes → skip-not-stall), both of which were blocked on "is the
dispatcher even running?" being answerable.

Solution
Write a heartbeat JSON to `$HERMES_HOME/state/dispatcher_health.json` at the
end of every dispatcher cycle (success AND exception paths AND cancellation).

Schema (v1, stable contract):
  schema_version: 1
  last_cycle_ts: float        # unix seconds, end of cycle
  last_cycle_iso: str         # UTC ISO 8601 with Z suffix
  cycle_started_at: float
  cycle_duration_seconds: float
  interval_seconds: float     # configured cadence (default 60)
  cycles_since_start: int     # monotonic; first cycle = 1, 0 means "never wrote"
  any_spawned_this_cycle: bool
  spawned_total_this_cycle: int   # count across all boards
  ready_pending: bool         # ready queue non-empty
  consecutive_bad_ticks: int  # mirrors the existing HEALTH_WINDOW counter
  gateway_pid: int
  cycle_error: str | null     # exception text if cycle errored

Detection rule for monitors:
  stall = time.time() - last_cycle_ts > 2 × interval_seconds
  dead  = last_cycle_ts > 5 minutes ago AND gateway PID still alive

Implementation
- New method `GatewayRunner._write_dispatcher_heartbeat` (gateway/run.py).
  Uses existing `atomic_json_write` for crash-safe writes.
- Called from `_kanban_dispatcher_watcher` at end of every iteration.
- Heartbeat-write failures are wrapped in try/except so they can NEVER
  kill the dispatcher (heartbeat is a diagnostic, not a dependency).
- Cancellation path also writes a final heartbeat with cycle_error="cancelled"
  before re-raising — so monitors can distinguish clean shutdown from crash.
- Locals (`cycles_since_start`, `any_spawned`, etc.) initialized at top of
  the loop body BEFORE any try block so they're defined for the heartbeat
  call even if zombie-reap or main tick throws.

Tests (5/5 passing)
- Schema-v1 contract pinned (all 13 keys, types, ISO Z suffix)
- Cycle-error path recorded correctly
- Two writes overwrite (not append)
- First-cycle-is-1 contract (monitors treat 0 as "never wrote")
- Auto-creates `state/` dir if missing

Follow-up (separate issues)
- Extend `hermes gateway status` to read this file and show stall age
- Schema v2: add `gateway_started_at` for uptime computation without ps
- Optional Prometheus textfile exposition for node_exporter setups

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jarvis-stark-ops jarvis-stark-ops merged commit 2aed6f5 into main Jun 7, 2026
@jarvis-stark-ops jarvis-stark-ops deleted the wt/dispatcher-heartbeat-metric branch June 7, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dispatcher: emit explicit liveness/heartbeat metric so silent-stall conditions are detectable from outside

1 participant