[Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min

## Summary

The `kanban_heartbeat` tool that workers call (registered via `tools/kanban_tools.py`) only updates `last_heartbeat_at` — it does **not** extend `claim_expires`. As a result, a diligent worker that loops `kanban_heartbeat` while running a long synchronous tool call (e.g. `xcodebuild archive`, large `flutter test`, training loop) still gets reclaimed at the default 15-minute claim TTL and re-spawned by the dispatcher. The function name and its docstring imply otherwise.

This is likely the underlying cause of the "reclaims & respawns were exactly 15 minutes apart" symptom reported in #21141 — that issue addresses the post-reclaim cleanup (old worker not killed). The two issues are complementary fixes, not duplicates: my issue keeps diligent workers from being reclaimed in the first place; #21141 ensures that when reclamation *does* happen (truly stuck worker), the old process is actually terminated.

## Repro

1. Create a task with default settings: `hermes kanban create "long task" --assignee my-profile --workspace dir:/tmp/foo`
2. Worker is dispatched. In its loop it calls `kanban_heartbeat` every 30 s.
3. Worker's current shell command runs longer than `DEFAULT_CLAIM_TTL_SECONDS` (15 min).
4. Dispatcher's `release_stale_claims()` (`kanban_db.py:1846`) reclaims the task because `claim_expires < now`, even though `last_heartbeat_at` is fresh.
5. A new worker is spawned for the same task — duplicate work / corruption risk on shared workspaces.

## Root cause

`tools/kanban_tools.py:317-348` (the `_handle_heartbeat` function) calls `kb.heartbeat_worker(...)`:

```python
ok = kb.heartbeat_worker(
    conn,
    tid,
    note=note,
    expected_run_id=_worker_run_id(tid),
)
```

`heartbeat_worker` (`hermes_cli/kanban_db.py:2641-2691`) only updates `last_heartbeat_at` on `tasks` and `task_runs`, plus appends a `heartbeat` event. It is silent about `claim_expires`.

The TTL-extending function is `heartbeat_claim` (`hermes_cli/kanban_db.py:1817-1844`). Its docstring even states the contract:

> *"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."*

But no caller in the worker tool path invokes it. Workers can't call it themselves either — `heartbeat_claim` is not exposed via any tool.

## Test gap

The kanban_heartbeat tool tests (`tests/tools/test_kanban_tools.py:202-218`) only check the tool returns `ok: true` — they don't verify `claim_expires` actually moves. The `heartbeat_claim` function is well-tested in isolation (`tests/hermes_cli/test_kanban_db.py:231 test_heartbeat_extends_claim`), but the integration through the tool is unverified, which is how this regression slipped past CI.

## Proposed fix

In `tools/kanban_tools.py`, `_handle_heartbeat` should also extend the claim. Two-line change:

```python
def _handle_heartbeat(args: dict, **kw) -> str:
    tid = _default_task_id(args.get("task_id"))
    if not tid:
        return tool_error(...)
    ownership_err = _enforce_worker_task_ownership(tid)
    if ownership_err:
        return ownership_err
    note = args.get("note")
    try:
        kb, conn = _connect()
        try:
            # Extend the claim TTL — without this, a worker that heartbeats
            # diligently still gets reclaimed at DEFAULT_CLAIM_TTL_SECONDS.
            # The claim_lock check inside heartbeat_claim prevents extending
            # a claim we no longer own.
            claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
            kb.heartbeat_claim(conn, tid, claimer=claim_lock)

            ok = kb.heartbeat_worker(
                conn, tid, note=note,
                expected_run_id=_worker_run_id(tid),
            )
            if not ok:
                return tool_error(
                    f"could not heartbeat {tid} (unknown id or not running)"
                )
            return _ok(task_id=tid)
        finally:
            conn.close()
    except Exception as e:
        logger.exception("kanban_heartbeat failed")
        return tool_error(f"kanban_heartbeat: {e}")
```

The dispatcher already sets `HERMES_KANBAN_CLAIM_LOCK` in the worker env (`hermes_cli/kanban_db.py:3293`), so `claim_lock` is the right value to pass. If `heartbeat_claim` returns False (the worker no longer owns the claim — was reclaimed), we let `heartbeat_worker` also fail and the tool surfaces the standard "not running" error to the worker, who can then exit cleanly.

## Test that would have caught this

```python
def test_heartbeat_extends_claim(worker_env):
    """The kanban_heartbeat tool must extend claim_expires, not just
    update last_heartbeat_at — otherwise long-running workers are reclaimed
    despite heartbeating."""
    from tools import kanban_tools as kt
    from hermes_cli import kanban_db as kb

    conn = kb.connect()
    try:
        before = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    time.sleep(1)  # ensure now() > before
    out = kt._handle_heartbeat({"note": "still alive"})
    assert json.loads(out)["ok"] is True

    conn = kb.connect()
    try:
        after = conn.execute(
            "SELECT claim_expires FROM tasks WHERE id = ?", (worker_env,)
        ).fetchone()["claim_expires"]
    finally:
        conn.close()

    assert after > before, (
        f"claim_expires did not advance ({before} -> {after}); "
        f"worker would be reclaimed at TTL despite heartbeating"
    )
```

## Severity

Medium. Workers that finish under 15 min are unaffected. Workers that exceed 15 min on a single tool call (Xcode Archive, large image generation, dataset processing) experience silent re-spawn — they appear to "loop" from the user's perspective and their first run's progress is discarded. Particularly painful when combined with `--max-runtime` since the per-task wall budget is consumed by the reclaimed first run, leaving the re-spawn with less budget than expected.

## Related

- `tools/kanban_tools.py:317-348` — bug site
- `hermes_cli/kanban_db.py:1817-1844` — `heartbeat_claim`
- `hermes_cli/kanban_db.py:2641-2691` — `heartbeat_worker`
- `hermes_cli/kanban_db.py:1846+` — `release_stale_claims` (the function that reclaims)
- `hermes_cli/kanban_db.py:3293` — dispatcher sets `HERMES_KANBAN_CLAIM_LOCK` in worker env

I'm happy to follow up with a PR if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min #21147

Summary

Repro

Root cause

Test gap

Proposed fix

Test that would have caught this

Severity

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min #21147

Description

Summary

Repro

Root cause

Test gap

Proposed fix

Test that would have caught this

Severity

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions