Skip to content

Kanban worker runtime activity does not update board heartbeat, causing stale reclaim of active workers #31752

@faisfamilytravel

Description

@faisfamilytravel

Kanban worker activity does not update board heartbeat, causing stale reclaim of active workers

Summary

Dispatcher-spawned Kanban workers can remain active in the agent runtime while the Kanban board still shows tasks.last_heartbeat_at = NULL and task_runs.last_heartbeat_at = NULL. The dispatcher watchdog reads the board heartbeat fields, not the agent's in-process activity timestamp, so long-running active workers can be reclaimed and respawned as stale.

Local fix candidate: bridge AIAgent._touch_activity() to a non-tool Kanban heartbeat helper when HERMES_KANBAN_TASK is set. The bridge updates the board heartbeat and claim TTL, is rate-limited to one write per 60 seconds, does not persist activity descriptions, and is non-fatal on DB errors.

Observed failure mode

  • Workers were actively executing probe work.
  • Board liveness stayed null: tasks.last_heartbeat_at / task_runs.last_heartbeat_at were not updated unless the model explicitly called kanban_heartbeat.
  • detect_stale_running() saw a long-running task with null heartbeat and reclaimed it.
  • The task was returned to ready and could be re-spawned, causing worker context loss.

This is distinct from task_events.kind='heartbeat' rows with payload=NULL. Null event payload is expected when a heartbeat has no note. The defect is null board liveness while the worker is active.

Root cause

Hermes currently has two separate liveness signals:

  1. Agent/runtime activity: AIAgent._touch_activity(desc) updates in-process activity fields.
  2. Kanban watchdog liveness: tasks.last_heartbeat_at and task_runs.last_heartbeat_at.

If the model does not explicitly call the kanban_heartbeat tool, ordinary runtime activity does not reach the Kanban DB. The watchdog then acts correctly on incomplete liveness data and reclaims an active worker.

Local patch shape

Files changed locally:

  • run_agent.py

    • AIAgent._touch_activity() now calls a best-effort Kanban heartbeat bridge when HERMES_KANBAN_TASK is set.
    • Write rate limit: 60 seconds minimum between auto-heartbeat DB writes.
    • Exceptions are swallowed and logged at debug level.
    • Runtime activity descriptions are not written to durable task events.
  • tools/kanban_tools.py

    • Adds heartbeat_current_worker_from_env() helper.
    • Uses worker env identity:
      • HERMES_KANBAN_TASK
      • HERMES_KANBAN_RUN_ID
      • HERMES_KANBAN_CLAIM_LOCK
    • Calls heartbeat_claim() and heartbeat_worker().

Explicit kanban_heartbeat remains unchanged and remains the correct path for worker-provided human-readable heartbeat notes.

Tests added locally

New focused tests in tests/run_agent/test_kanban_auto_heartbeat.py prove:

  1. _touch_activity() in a Kanban worker sets tasks.last_heartbeat_at and task_runs.last_heartbeat_at.
  2. _touch_activity() outside a Kanban worker does not connect to or mutate Kanban.
  3. Auto-heartbeat is rate-limited to prevent write churn.
  4. Auto-heartbeat extends claim_expires through the claim heartbeat path.
  5. A long-running task older than stale timeout but recently auto-heartbeated is not reclaimed by detect_stale_running().
  6. Heartbeat bridge failures are non-fatal to _touch_activity().

Local validation:

venv/bin/python -m pytest tests/run_agent/test_kanban_auto_heartbeat.py -q
6 passed

venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -q
172 passed

venv/bin/python -m pytest tests/run_agent/test_run_agent.py -q
339 passed

venv/bin/python -m pytest tests/tools/test_kanban_tools.py -q
81 passed

Local live validation

Mission Control was restarted to load the patched runtime. A synthetic live Kanban task validated that runtime activity without explicit kanban_heartbeat set both task and run heartbeat timestamps, did not persist activity text in the heartbeat event payload, and was not reclaimed by detect_stale_running().

Validation task: t_f887eedb
Validation run: 184

Result:

{
  "heartbeat_event_count": 1,
  "heartbeat_payload_contains_activity_text": false,
  "reclaimed": [],
  "run_id": 184,
  "run_last_heartbeat_at": 1779670296,
  "task_id": "t_f887eedb",
  "task_last_heartbeat_at": 1779670296,
  "task_status": "running"
}

Relation to prior local fixes

Related local context:

This heartbeat fix is independent of those dispatcher/WAL fixes. It adds rate-limited worker-side writes and does not change dispatcher connection caching or WAL-safe close behavior.

Local commit:

82e66d75a09ca1e83f6e1b7cb88934f04385245a fix(kanban): bridge worker activity to heartbeat

Why this should be upstreamable

The fix is not site-specific:

  • It uses existing worker env variables.
  • It uses existing DB primitives (heartbeat_claim, heartbeat_worker).
  • It preserves the explicit heartbeat tool.
  • It avoids durable storage of runtime activity text.
  • It rate-limits writes to avoid WAL churn.
  • It leaves watchdog reclaim semantics intact.

Suggested acceptance criteria

  • Dispatcher-spawned workers update board heartbeat timestamps during normal runtime activity, even if the model never explicitly calls kanban_heartbeat.
  • Auto-heartbeat writes are rate-limited.
  • Auto-heartbeat failure cannot crash worker execution.
  • Activity descriptions are not persisted as heartbeat event payloads.
  • Existing explicit kanban_heartbeat behavior remains unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildercomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions