Summary
The kanban_heartbeat tool that workers call (registered via tools/kanban_tools.py) only updates last_heartbeat_at — it does not extend claim_expires. As a result, a diligent worker that loops kanban_heartbeat while running a long synchronous tool call (e.g. xcodebuild archive, large flutter test, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.
This is likely the underlying cause of the "reclaims & respawns were exactly 15 minutes apart" symptom reported in #21141 — that issue addresses the post-reclaim cleanup (old worker not killed). The two issues are complementary fixes, not duplicates: my issue keeps diligent workers from being reclaimed in the first place; #21141 ensures that when reclamation does happen (truly stuck worker), the old process is actually terminated.
Repro
- Create a task with default settings:
hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/foo
- Worker is dispatched. In its loop it calls
kanban_heartbeat every 30 s.
- Worker's current shell command runs longer than
DEFAULT_CLAIM_TTL_SECONDS (15 min).
- Dispatcher's
release_stale_claims() (kanban_db.py:1846) reclaims the task because claim_expires < now, even though last_heartbeat_at is fresh.
- A new worker is spawned for the same task — duplicate work / corruption risk on shared workspaces.
Root cause
tools/kanban_tools.py:317-348 (the _handle_heartbeat function) calls kb.heartbeat_worker(...):
ok = kb.heartbeat_worker(
conn,
tid,
note=note,
expected_run_id=_worker_run_id(tid),
)
heartbeat_worker (hermes_cli/kanban_db.py:2641-2691) only updates last_heartbeat_at on tasks and task_runs, plus appends a heartbeat event. It is silent about claim_expires.
The TTL-extending function is heartbeat_claim (hermes_cli/kanban_db.py:1817-1844). Its docstring even states the contract:
"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."
But no caller in the worker tool path invokes it. Workers can't call it themselves either — heartbeat_claim is not exposed via any tool.
Test gap
The kanban_heartbeat tool tests (tests/tools/test_kanban_tools.py:202-218) only check the tool returns ok: true — they don't verify claim_expires actually moves. The heartbeat_claim function is well-tested in isolation (tests/hermes_cli/test_kanban_db.py:231 test_heartbeat_extends_claim), but the integration through the tool is unverified, which is how this regression slipped past CI.
Proposed fix
In tools/kanban_tools.py, _handle_heartbeat should also extend the claim. Two-line change:
def _handle_heartbeat(args: dict, **kw) -> str:
tid = _default_task_id(args.get("task_id"))
if not tid:
return tool_error(...)
ownership_err = _enforce_worker_task_ownership(tid)
if ownership_err:
return ownership_err
note = args.get("note")
try:
kb, conn = _connect()
try:
# Extend the claim TTL — without this, a worker that heartbeats
# diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS.
# The claim_lock check inside heartbeat_claim prevents extending
# a claim we no longer own.
claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
kb.heartbeat_claim(conn, tid, claimer=claim_lock)
ok = kb.heartbeat_worker(
conn, tid, note=note,
expected_run_id=_worker_run_id(tid),
)
if not ok:
return tool_error(
f"could not heartbeat {tid} (unknown id or not running)"
)
return _ok(task_id=tid)
finally:
conn.close()
except Exception as e:
logger.exception("kanban_heartbeat failed")
return tool_error(f"kanban_heartbeat: {e}")
The dispatcher already sets HERMES_KANBAN_CLAIM_LOCK in the worker env (hermes_cli/kanban_db.py:3293), so claim_lock is the right value to pass. If heartbeat_claim returns False (the worker no longer owns the claim — was reclaimed), we let heartbeat_worker also fail and the tool surfaces the standard "not running" error to the worker, who can then exit cleanly.
Test that would have caught this
def test_heartbeat_extends_claim(worker_env):
"""The kanban_heartbeat tool must extend claim_expires, not just
update last_heartbeat_at — otherwise long-running workers are reclaimed
despite heartbeating."""
from tools import kanban_tools as kt
from hermes_cli import kanban_db as kb
conn = kb.connect()
try:
before = conn.execute(
"SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
).fetchone()["claim_expires"]
finally:
conn.close()
time.sleep(1) # ensure now() > before
out = kt._handle_heartbeat({"note": "still alive"})
assert json.loads(out)["ok"] is True
conn = kb.connect()
try:
after = conn.execute(
"SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
).fetchone()["claim_expires"]
finally:
conn.close()
assert after > before, (
f"claim_expires did not advance ({before} -> {after}); "
f"worker would be reclaimed at TTL despite heartbeating"
)
Severity
Medium. Workers that finish under 15 min are unaffected. Workers that exceed 15 min on a single tool call (Xcode Archive, large image generation, dataset processing) experience silent re-spawn — they appear to "loop" from the user's perspective and their first run's progress is discarded. Particularly painful when combined with --max-runtime since the per-task wall budget is consumed by the reclaimed first run, leaving the re-spawn with less budget than expected.
Related
tools/kanban_tools.py:317-348 — bug site
hermes_cli/kanban_db.py:1817-1844 — heartbeat_claim
hermes_cli/kanban_db.py:2641-2691 — heartbeat_worker
hermes_cli/kanban_db.py:1846+ — release_stale_claims (the function that reclaims)
hermes_cli/kanban_db.py:3293 — dispatcher sets HERMES_KANBAN_CLAIM_LOCK in worker env
I'm happy to follow up with a PR if useful.
Summary
The
kanban_heartbeattool that workers call (registered viatools/kanban_tools.py) only updateslast_heartbeat_at— it does not extendclaim_expires. As a result, a diligent worker that loopskanban_heartbeatwhile running a long synchronous tool call (e.g.xcodebuild archive, largeflutter test, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.This is likely the underlying cause of the "reclaims & respawns were exactly 15 minutes apart" symptom reported in #21141 — that issue addresses the post-reclaim cleanup (old worker not killed). The two issues are complementary fixes, not duplicates: my issue keeps diligent workers from being reclaimed in the first place; #21141 ensures that when reclamation does happen (truly stuck worker), the old process is actually terminated.
Repro
hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/fookanban_heartbeatevery 30 s.DEFAULT_CLAIM_TTL_SECONDS(15 min).release_stale_claims()(kanban_db.py:1846) reclaims the task becauseclaim_expires < now, even thoughlast_heartbeat_atis fresh.Root cause
tools/kanban_tools.py:317-348(the_handle_heartbeatfunction) callskb.heartbeat_worker(...):heartbeat_worker(hermes_cli/kanban_db.py:2641-2691) only updateslast_heartbeat_atontasksandtask_runs, plus appends aheartbeatevent. It is silent aboutclaim_expires.The TTL-extending function is
heartbeat_claim(hermes_cli/kanban_db.py:1817-1844). Its docstring even states the contract:But no caller in the worker tool path invokes it. Workers can't call it themselves either —
heartbeat_claimis not exposed via any tool.Test gap
The kanban_heartbeat tool tests (
tests/tools/test_kanban_tools.py:202-218) only check the tool returnsok: true— they don't verifyclaim_expiresactually moves. Theheartbeat_claimfunction is well-tested in isolation (tests/hermes_cli/test_kanban_db.py:231 test_heartbeat_extends_claim), but the integration through the tool is unverified, which is how this regression slipped past CI.Proposed fix
In
tools/kanban_tools.py,_handle_heartbeatshould also extend the claim. Two-line change:The dispatcher already sets
HERMES_KANBAN_CLAIM_LOCKin the worker env (hermes_cli/kanban_db.py:3293), soclaim_lockis the right value to pass. Ifheartbeat_claimreturns False (the worker no longer owns the claim — was reclaimed), we letheartbeat_workeralso fail and the tool surfaces the standard "not running" error to the worker, who can then exit cleanly.Test that would have caught this
Severity
Medium. Workers that finish under 15 min are unaffected. Workers that exceed 15 min on a single tool call (Xcode Archive, large image generation, dataset processing) experience silent re-spawn — they appear to "loop" from the user's perspective and their first run's progress is discarded. Particularly painful when combined with
--max-runtimesince the per-task wall budget is consumed by the reclaimed first run, leaving the re-spawn with less budget than expected.Related
tools/kanban_tools.py:317-348— bug sitehermes_cli/kanban_db.py:1817-1844—heartbeat_claimhermes_cli/kanban_db.py:2641-2691—heartbeat_workerhermes_cli/kanban_db.py:1846+—release_stale_claims(the function that reclaims)hermes_cli/kanban_db.py:3293— dispatcher setsHERMES_KANBAN_CLAIM_LOCKin worker envI'm happy to follow up with a PR if useful.