Kanban worker activity does not update board heartbeat, causing stale reclaim of active workers
Summary
Dispatcher-spawned Kanban workers can remain active in the agent runtime while the Kanban board still shows tasks.last_heartbeat_at = NULL and task_runs.last_heartbeat_at = NULL. The dispatcher watchdog reads the board heartbeat fields, not the agent's in-process activity timestamp, so long-running active workers can be reclaimed and respawned as stale.
Local fix candidate: bridge AIAgent._touch_activity() to a non-tool Kanban heartbeat helper when HERMES_KANBAN_TASK is set. The bridge updates the board heartbeat and claim TTL, is rate-limited to one write per 60 seconds, does not persist activity descriptions, and is non-fatal on DB errors.
Observed failure mode
- Workers were actively executing probe work.
- Board liveness stayed null:
tasks.last_heartbeat_at / task_runs.last_heartbeat_at were not updated unless the model explicitly called kanban_heartbeat.
detect_stale_running() saw a long-running task with null heartbeat and reclaimed it.
- The task was returned to ready and could be re-spawned, causing worker context loss.
This is distinct from task_events.kind='heartbeat' rows with payload=NULL. Null event payload is expected when a heartbeat has no note. The defect is null board liveness while the worker is active.
Root cause
Hermes currently has two separate liveness signals:
- Agent/runtime activity:
AIAgent._touch_activity(desc) updates in-process activity fields.
- Kanban watchdog liveness:
tasks.last_heartbeat_at and task_runs.last_heartbeat_at.
If the model does not explicitly call the kanban_heartbeat tool, ordinary runtime activity does not reach the Kanban DB. The watchdog then acts correctly on incomplete liveness data and reclaims an active worker.
Local patch shape
Files changed locally:
-
run_agent.py
AIAgent._touch_activity() now calls a best-effort Kanban heartbeat bridge when HERMES_KANBAN_TASK is set.
- Write rate limit: 60 seconds minimum between auto-heartbeat DB writes.
- Exceptions are swallowed and logged at debug level.
- Runtime activity descriptions are not written to durable task events.
-
tools/kanban_tools.py
- Adds
heartbeat_current_worker_from_env() helper.
- Uses worker env identity:
HERMES_KANBAN_TASK
HERMES_KANBAN_RUN_ID
HERMES_KANBAN_CLAIM_LOCK
- Calls
heartbeat_claim() and heartbeat_worker().
Explicit kanban_heartbeat remains unchanged and remains the correct path for worker-provided human-readable heartbeat notes.
Tests added locally
New focused tests in tests/run_agent/test_kanban_auto_heartbeat.py prove:
_touch_activity() in a Kanban worker sets tasks.last_heartbeat_at and task_runs.last_heartbeat_at.
_touch_activity() outside a Kanban worker does not connect to or mutate Kanban.
- Auto-heartbeat is rate-limited to prevent write churn.
- Auto-heartbeat extends
claim_expires through the claim heartbeat path.
- A long-running task older than stale timeout but recently auto-heartbeated is not reclaimed by
detect_stale_running().
- Heartbeat bridge failures are non-fatal to
_touch_activity().
Local validation:
venv/bin/python -m pytest tests/run_agent/test_kanban_auto_heartbeat.py -q
6 passed
venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -q
172 passed
venv/bin/python -m pytest tests/run_agent/test_run_agent.py -q
339 passed
venv/bin/python -m pytest tests/tools/test_kanban_tools.py -q
81 passed
Local live validation
Mission Control was restarted to load the patched runtime. A synthetic live Kanban task validated that runtime activity without explicit kanban_heartbeat set both task and run heartbeat timestamps, did not persist activity text in the heartbeat event payload, and was not reclaimed by detect_stale_running().
Validation task: t_f887eedb
Validation run: 184
Result:
{
"heartbeat_event_count": 1,
"heartbeat_payload_contains_activity_text": false,
"reclaimed": [],
"run_id": 184,
"run_last_heartbeat_at": 1779670296,
"task_id": "t_f887eedb",
"task_last_heartbeat_at": 1779670296,
"task_status": "running"
}
Relation to prior local fixes
Related local context:
This heartbeat fix is independent of those dispatcher/WAL fixes. It adds rate-limited worker-side writes and does not change dispatcher connection caching or WAL-safe close behavior.
Local commit:
82e66d75a09ca1e83f6e1b7cb88934f04385245a fix(kanban): bridge worker activity to heartbeat
Why this should be upstreamable
The fix is not site-specific:
- It uses existing worker env variables.
- It uses existing DB primitives (
heartbeat_claim, heartbeat_worker).
- It preserves the explicit heartbeat tool.
- It avoids durable storage of runtime activity text.
- It rate-limits writes to avoid WAL churn.
- It leaves watchdog reclaim semantics intact.
Suggested acceptance criteria
- Dispatcher-spawned workers update board heartbeat timestamps during normal runtime activity, even if the model never explicitly calls
kanban_heartbeat.
- Auto-heartbeat writes are rate-limited.
- Auto-heartbeat failure cannot crash worker execution.
- Activity descriptions are not persisted as heartbeat event payloads.
- Existing explicit
kanban_heartbeat behavior remains unchanged.
Kanban worker activity does not update board heartbeat, causing stale reclaim of active workers
Summary
Dispatcher-spawned Kanban workers can remain active in the agent runtime while the Kanban board still shows
tasks.last_heartbeat_at = NULLandtask_runs.last_heartbeat_at = NULL. The dispatcher watchdog reads the board heartbeat fields, not the agent's in-process activity timestamp, so long-running active workers can be reclaimed and respawned as stale.Local fix candidate: bridge
AIAgent._touch_activity()to a non-tool Kanban heartbeat helper whenHERMES_KANBAN_TASKis set. The bridge updates the board heartbeat and claim TTL, is rate-limited to one write per 60 seconds, does not persist activity descriptions, and is non-fatal on DB errors.Observed failure mode
tasks.last_heartbeat_at/task_runs.last_heartbeat_atwere not updated unless the model explicitly calledkanban_heartbeat.detect_stale_running()saw a long-running task with null heartbeat and reclaimed it.This is distinct from
task_events.kind='heartbeat'rows withpayload=NULL. Null event payload is expected when a heartbeat has no note. The defect is null board liveness while the worker is active.Root cause
Hermes currently has two separate liveness signals:
AIAgent._touch_activity(desc)updates in-process activity fields.tasks.last_heartbeat_atandtask_runs.last_heartbeat_at.If the model does not explicitly call the
kanban_heartbeattool, ordinary runtime activity does not reach the Kanban DB. The watchdog then acts correctly on incomplete liveness data and reclaims an active worker.Local patch shape
Files changed locally:
run_agent.pyAIAgent._touch_activity()now calls a best-effort Kanban heartbeat bridge whenHERMES_KANBAN_TASKis set.tools/kanban_tools.pyheartbeat_current_worker_from_env()helper.HERMES_KANBAN_TASKHERMES_KANBAN_RUN_IDHERMES_KANBAN_CLAIM_LOCKheartbeat_claim()andheartbeat_worker().Explicit
kanban_heartbeatremains unchanged and remains the correct path for worker-provided human-readable heartbeat notes.Tests added locally
New focused tests in
tests/run_agent/test_kanban_auto_heartbeat.pyprove:_touch_activity()in a Kanban worker setstasks.last_heartbeat_atandtask_runs.last_heartbeat_at._touch_activity()outside a Kanban worker does not connect to or mutate Kanban.claim_expiresthrough the claim heartbeat path.detect_stale_running()._touch_activity().Local validation:
Local live validation
Mission Control was restarted to load the patched runtime. A synthetic live Kanban task validated that runtime activity without explicit
kanban_heartbeatset both task and run heartbeat timestamps, did not persist activity text in the heartbeat event payload, and was not reclaimed bydetect_stale_running().Validation task:
t_f887eedbValidation run:
184Result:
{ "heartbeat_event_count": 1, "heartbeat_payload_contains_activity_text": false, "reclaimed": [], "run_id": 184, "run_last_heartbeat_at": 1779670296, "task_id": "t_f887eedb", "task_last_heartbeat_at": 1779670296, "task_status": "running" }Relation to prior local fixes
Related local context:
This heartbeat fix is independent of those dispatcher/WAL fixes. It adds rate-limited worker-side writes and does not change dispatcher connection caching or WAL-safe close behavior.
Local commit:
Why this should be upstreamable
The fix is not site-specific:
heartbeat_claim,heartbeat_worker).Suggested acceptance criteria
kanban_heartbeat.kanban_heartbeatbehavior remains unchanged.