fix(kanban): bridge worker runtime activity to board heartbeat (#31752) by teknium1 · Pull Request #34418 · NousResearch/hermes-agent

teknium1 · 2026-05-29T06:25:59Z

Summary

The dispatcher watchdog (release_stale_claims) reads tasks.last_heartbeat_at to decide whether to reclaim a running task. The agent already maintains an in-process _last_activity_ts updated on every chunk and tool result, but those liveness ticks never reach the board unless the model explicitly calls kanban_heartbeat — so a worker actively executing a long run without tool-level heartbeats can be reclaimed mid-flight, returning the task to ready and orphaning the in-flight worker's progress.

Changes

tools/kanban_tools.py — new heartbeat_current_worker_from_env() helper: rate-limited (60s), best-effort (never raises), no-op outside dispatcher-spawned worker context. Identity from HERMES_KANBAN_TASK + HERMES_KANBAN_RUN_ID + HERMES_KANBAN_CLAIM_LOCK.
run_agent.py _touch_activity — calls the helper when HERMES_KANBAN_TASK is set. Lazy import so non-kanban runs pay zero cost.
scripts/release.py — AUTHOR_MAP entries for @faisfamilytravel and @kweiner.

The explicit kanban_heartbeat tool stays unchanged for workers that want to attach a note or pre-emptively extend a claim across a known-long single tool call. No durable note on auto-heartbeats (that's the tool's job).

Validation

E2E test verified:

	Result
No `HERMES_KANBAN_TASK` → no-op	✓ `last_heartbeat_at` stays null
With env → board heartbeat + claim extended	✓ `last_heartbeat_at` set, `claim_expires` pushed +900s
Rate limit (immediate retry)	✓ skipped, no DB write
Garbage `HERMES_KANBAN_RUN_ID`	✓ ignored, doesn't crash
Stale run_id	✓ swallowed at heartbeat_worker level
`_touch_activity` wiring	✓ in-process state still updated + board write fires
Targeted suites: kanban_tools, kanban_db, kanban_cli, approval_heartbeat, delegate	470 tests pass

Closes #31752. Credit @faisfamilytravel for the diagnosis and the patch outline.

Infographic

github-actions · 2026-05-29T06:27:19Z

🔎 Lint report: `fix/kanban-runtime-heartbeat-bridge-31752` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9425 on HEAD, 9425 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4891 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

alt-glitch · 2026-05-29T06:53:35Z

Supersedes #31773 — same auto-heartbeat bridge fix with rate limiting and env-based identity.

The dispatcher watchdog (release_stale_claims) reads tasks.last_heartbeat_at to decide whether to reclaim a running task. The agent maintains its own in-process `_last_activity_ts` for every chunk/tool result, but those liveness ticks never reach the board unless the model explicitly calls the `kanban_heartbeat` tool — so a worker actively executing a long run without tool-level heartbeats can be reclaimed mid-flight as 'stale', returning the task to ready and orphaning the in-flight worker's progress. Fix: in `_touch_activity` (the canonical 'we just did work' hook in run_agent.py), call a new `heartbeat_current_worker_from_env` helper in `tools/kanban_tools.py` that: - No-ops outside dispatcher-spawned worker context (no HERMES_KANBAN_TASK). - Rate-limited to one DB write per 60s (runtime activity ticks too often to faithfully mirror; we just need the watchdog to see liveness). - Best-effort: never raises. heartbeat_claim + heartbeat_worker calls are individually try/except'd; any DB error logs at debug and returns. - Uses worker env identity: HERMES_KANBAN_TASK + HERMES_KANBAN_RUN_ID + HERMES_KANBAN_CLAIM_LOCK (all pinned by the dispatcher at spawn time). - No durable note on auto-heartbeats — that's reserved for the explicit `kanban_heartbeat` tool which carries a model-supplied note. The explicit `kanban_heartbeat` tool stays available unchanged for workers that want to attach a note or pre-emptively extend a claim across a known-long single tool call. Co-authored-by: faisfamilytravel <223516181+faisfamilytravel@users.noreply.github.com>

…29747) Reporter diagnosed three independent gaps that together allowed infinite 'unblock → re-stuck' loops with no surfacing or escalation: GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked` event, so a task that cycles every few minutes is invisible to it regardless of how many times it cycles. Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`) that counts block→unblock cycles in a sliding window. Default threshold 3 cycles within 24h, configurable via `block_cycle_threshold` / `block_cycle_window_seconds`. Walks events in arrival order (event id) since multiple events can share the same `created_at` second. Fires as a warning with a CLI hint to inspect the block reasons. GAP 2: Iteration-budget-exhausted runs in kanban workers map to `kanban_block` (status=blocked, but a clean exit from the kernel's perspective). `_rule_repeated_failures` reads `consecutive_failures`, which `_record_task_failure` increments only for crashed/timed_out/ spawn_failed — `blocked` outcome bypasses the failure counter, so the `kanban.failure_limit` circuit breaker never trips on budget-exhaustion loops. Fix: `agent/conversation_loop.py` budget-exhaustion path now calls `_record_task_failure(outcome="timed_out")` instead of `kanban_block`. Budget exhaustion is genuinely a timeout-shaped failure (the task ran out of allowed iterations), so this is more honest semantics; it also routes through the unified failure counter, so repeated budget exhaustions trip the circuit breaker and the task auto-blocks with `gave_up` after `failure_limit` retries. GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores `last_heartbeat_at`. Reporter observed a 91-min run that held its claim with frozen heartbeat because the worker entered a logic loop with no tool calls — `_pid_alive` kept returning True so the claim was extended every 15 minutes indefinitely. Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim even if the PID is alive. NULL `last_heartbeat_at` preserves backward compatibility (no heartbeat yet = extend, as before). The reclaim event payload now includes a `heartbeat_stale` boolean so operators see why a live-PID worker was reclaimed. This works cleanly in concert with PR #34418 (#31752 runtime → heartbeat bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a side effect of normal API traffic, the backstop only fires for genuinely wedged workers (no chunks, no tool results, no progress at all). Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>

…ousResearch#29747) Reporter diagnosed three independent gaps that together allowed infinite 'unblock → re-stuck' loops with no surfacing or escalation: GAP 1: `_rule_stuck_in_blocked` resets timer on any `commented`/`unblocked` event, so a task that cycles every few minutes is invisible to it regardless of how many times it cycles. Fix: new `_rule_block_unblock_cycling` rule (`hermes_cli/kanban_diagnostics.py`) that counts block→unblock cycles in a sliding window. Default threshold 3 cycles within 24h, configurable via `block_cycle_threshold` / `block_cycle_window_seconds`. Walks events in arrival order (event id) since multiple events can share the same `created_at` second. Fires as a warning with a CLI hint to inspect the block reasons. GAP 2: Iteration-budget-exhausted runs in kanban workers map to `kanban_block` (status=blocked, but a clean exit from the kernel's perspective). `_rule_repeated_failures` reads `consecutive_failures`, which `_record_task_failure` increments only for crashed/timed_out/ spawn_failed — `blocked` outcome bypasses the failure counter, so the `kanban.failure_limit` circuit breaker never trips on budget-exhaustion loops. Fix: `agent/conversation_loop.py` budget-exhaustion path now calls `_record_task_failure(outcome="timed_out")` instead of `kanban_block`. Budget exhaustion is genuinely a timeout-shaped failure (the task ran out of allowed iterations), so this is more honest semantics; it also routes through the unified failure counter, so repeated budget exhaustions trip the circuit breaker and the task auto-blocks with `gave_up` after `failure_limit` retries. GAP 3: `release_stale_claims` uses `_pid_alive(worker_pid)` only and ignores `last_heartbeat_at`. Reporter observed a 91-min run that held its claim with frozen heartbeat because the worker entered a logic loop with no tool calls — `_pid_alive` kept returning True so the claim was extended every 15 minutes indefinitely. Fix: heartbeat-stale backstop. If `last_heartbeat_at` is set AND older than `DEFAULT_CLAIM_HEARTBEAT_MAX_STALE_SECONDS` (default 1h), reclaim even if the PID is alive. NULL `last_heartbeat_at` preserves backward compatibility (no heartbeat yet = extend, as before). The reclaim event payload now includes a `heartbeat_stale` boolean so operators see why a live-PID worker was reclaimed. This works cleanly in concert with PR NousResearch#34418 (NousResearch#31752 runtime → heartbeat bridge): once `_touch_activity` keeps `last_heartbeat_at` fresh as a side effect of normal API traffic, the backstop only fires for genuinely wedged workers (no chunks, no tool results, no progress at all). Co-authored-by: baofuen <45189813+baofuen@users.noreply.github.com>

teknium1 mentioned this pull request May 29, 2026

fix(kanban): close three blocked/iteration-exhausted handling gaps (#29747) #34428

Merged

alt-glitch added P3 Low — cosmetic, nice to have type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder labels May 29, 2026

teknium1 force-pushed the fix/kanban-runtime-heartbeat-bridge-31752 branch from e84d011 to a90e224 Compare May 29, 2026 06:57

teknium1 merged commit bc31ee5 into main May 29, 2026
24 checks passed

teknium1 deleted the fix/kanban-runtime-heartbeat-bridge-31752 branch May 29, 2026 07:06

teknium1 mentioned this pull request May 29, 2026

Kanban worker runtime activity does not update board heartbeat, causing stale reclaim of active workers #31752

Closed

teknium1 mentioned this pull request May 29, 2026

Kanban needs first-class waiting states for human, approval, and review gates #29171

Closed

teknium1 mentioned this pull request May 29, 2026

kanban: three gaps in blocked / iteration-exhausted handling that permit infinite "unblock → re-stuck" loops #29747

Closed

alt-glitch mentioned this pull request May 31, 2026

Umbrella: Kanban orchestration gaps — stale detection, silent recovery, orphan sweep, subagent supervision, and related reliability issues #35986

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): bridge worker runtime activity to board heartbeat (#31752)#34418

fix(kanban): bridge worker runtime activity to board heartbeat (#31752)#34418
teknium1 merged 1 commit into
mainfrom
fix/kanban-runtime-heartbeat-bridge-31752

teknium1 commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

alt-glitch commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Validation

Infographic

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: fix/kanban-runtime-heartbeat-bridge-31752 vs origin/main

ruff

ty (type checker)

Uh oh!

alt-glitch commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

teknium1 commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

🔎 Lint report: `fix/kanban-runtime-heartbeat-bridge-31752` vs `origin/main`