Skip to content

fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at#21153

Closed
stephen0110 wants to merge 1 commit into
NousResearch:mainfrom
stephen0110:fix/kanban-heartbeat-extends-claim
Closed

fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at#21153
stephen0110 wants to merge 1 commit into
NousResearch:mainfrom
stephen0110:fix/kanban-heartbeat-extends-claim

Conversation

@stephen0110

Copy link
Copy Markdown
Contributor

Summary

  • kanban_heartbeat tool now extends claim_expires by also calling heartbeat_claim, fixing the bug where diligent workers calling the tool in a loop were still reclaimed at the 15-minute TTL.
  • Adds a regression test in tests/tools/test_kanban_tools.py that fails against the unfixed code.

Closes #21147 (the issue I just filed describing the root cause). Underlying cause for the symptom in #21141 (#21141 covers the orthogonal post-reclaim cleanup half).

The bug

tools/kanban_tools.py::_handle_heartbeat was calling heartbeat_worker (records a heartbeat event + updates last_heartbeat_at) but never heartbeat_claim (extends claim_expires). release_stale_claims reads claim_expires, not last_heartbeat_at, so a worker looping the tool perfectly was still reclaimed at the default 15-minute TTL.

heartbeat_claim's docstring even spells the contract:

"Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership."

…but no caller in the worker tool path invoked it, and heartbeat_claim itself isn't exposed as a tool.

The fix

# Extend the claim TTL first. The dispatcher pins
# HERMES_KANBAN_CLAIM_LOCK in the worker env at spawn time
# (see _default_spawn in kanban_db.py); falling back to the
# default _claimer_id() covers locally-driven workers that
# never went through the dispatcher path.
claim_lock = os.environ.get("HERMES_KANBAN_CLAIM_LOCK")
kb.heartbeat_claim(conn, tid, claimer=claim_lock)

ok = kb.heartbeat_worker(conn, tid, ...)

The dispatcher already pins HERMES_KANBAN_CLAIM_LOCK in the worker env at spawn time (hermes_cli/kanban_db.py:3291), so the env-read produces an exact lock match in the dispatcher path. Locally-driven (non-dispatcher) workers fall back to _claimer_id(), which is what claim_task itself uses by default — same identity, same match.

If heartbeat_claim returns False (worker no longer owns the claim), we don't error out — we let the subsequent heartbeat_worker call surface the standard "not running" error so the worker can exit cleanly. This preserves existing tool behavior on the "you've already been reclaimed" path.

Test

tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires rewinds claim_expires into the past via direct SQL, calls the tool, and asserts the new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2.

Verified the test fails against the old code (claim_expires did not advance (1 -> 1)) and passes against the fix. Full tests/tools/test_kanban_tools.py + tests/hermes_cli/test_kanban_db.py run green (99 passed).

Test plan

  • pytest tests/tools/test_kanban_tools.py -k heartbeat -v — 4 passed
  • pytest tests/tools/test_kanban_tools.py tests/hermes_cli/test_kanban_db.py — 99 passed
  • Sanity-checked new test fails against the unfixed code, then passes against the fix

…t_at

The kanban_heartbeat tool called heartbeat_worker but never
heartbeat_claim, so a worker that loops the tool while a single tool
call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got
reclaimed by release_stale_claims. The function name and
heartbeat_claim's own docstring imply otherwise:

  "Workers that know they'll exceed 15 minutes should call this
   every few minutes to keep ownership."

But there was no caller in the worker tool path. Workers couldn't
invoke heartbeat_claim themselves either — it isn't exposed as a tool.

Fix: _handle_heartbeat now calls heartbeat_claim first, reading
HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins
this in _default_spawn). Falls back to _claimer_id() for locally-
driven workers that didn't go through dispatcher spawn.

Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires
rewinds claim_expires into the past, calls the tool, and asserts the
new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to
fail against the unfixed code (claim_expires stays at the rewound
value).

Closes the root cause underlying the symptom in NousResearch#21141 (15-min
respawns of long-running workers). NousResearch#21141 separately addresses
post-reclaim cleanup; this fixes the upstream "shouldn't have been
reclaimed in the first place" half.
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/tools Tool registry, model_tools, toolsets labels May 7, 2026
@teknium1

teknium1 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Merged via #21183 with your commit cherry-picked onto current main — authorship preserved in git log via rebase merge. Thanks @stephen0110! Closes #21147.

@teknium1 teknium1 closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: kanban_heartbeat tool doesn't extend claim TTL — diligent workers reclaimed at 15min

3 participants