fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at#21153
Closed
stephen0110 wants to merge 1 commit into
Closed
fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at#21153stephen0110 wants to merge 1 commit into
stephen0110 wants to merge 1 commit into
Conversation
…t_at The kanban_heartbeat tool called heartbeat_worker but never heartbeat_claim, so a worker that loops the tool while a single tool call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got reclaimed by release_stale_claims. The function name and heartbeat_claim's own docstring imply otherwise: "Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership." But there was no caller in the worker tool path. Workers couldn't invoke heartbeat_claim themselves either — it isn't exposed as a tool. Fix: _handle_heartbeat now calls heartbeat_claim first, reading HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins this in _default_spawn). Falls back to _claimer_id() for locally- driven workers that didn't go through dispatcher spawn. Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires rewinds claim_expires into the past, calls the tool, and asserts the new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to fail against the unfixed code (claim_expires stays at the rewound value). Closes the root cause underlying the symptom in NousResearch#21141 (15-min respawns of long-running workers). NousResearch#21141 separately addresses post-reclaim cleanup; this fixes the upstream "shouldn't have been reclaimed in the first place" half.
Contributor
|
Merged via #21183 with your commit cherry-picked onto current main — authorship preserved in git log via rebase merge. Thanks @stephen0110! Closes #21147. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
kanban_heartbeattool now extendsclaim_expiresby also callingheartbeat_claim, fixing the bug where diligent workers calling the tool in a loop were still reclaimed at the 15-minute TTL.tests/tools/test_kanban_tools.pythat fails against the unfixed code.Closes #21147 (the issue I just filed describing the root cause). Underlying cause for the symptom in #21141 (#21141 covers the orthogonal post-reclaim cleanup half).
The bug
tools/kanban_tools.py::_handle_heartbeatwas callingheartbeat_worker(records a heartbeat event + updateslast_heartbeat_at) but neverheartbeat_claim(extendsclaim_expires).release_stale_claimsreadsclaim_expires, notlast_heartbeat_at, so a worker looping the tool perfectly was still reclaimed at the default 15-minute TTL.heartbeat_claim's docstring even spells the contract:…but no caller in the worker tool path invoked it, and
heartbeat_claimitself isn't exposed as a tool.The fix
The dispatcher already pins
HERMES_KANBAN_CLAIM_LOCKin the worker env at spawn time (hermes_cli/kanban_db.py:3291), so the env-read produces an exact lock match in the dispatcher path. Locally-driven (non-dispatcher) workers fall back to_claimer_id(), which is whatclaim_taskitself uses by default — same identity, same match.If
heartbeat_claimreturns False (worker no longer owns the claim), we don't error out — we let the subsequentheartbeat_workercall surface the standard "not running" error so the worker can exit cleanly. This preserves existing tool behavior on the "you've already been reclaimed" path.Test
tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expiresrewindsclaim_expiresinto the past via direct SQL, calls the tool, and asserts the new value is at leastnow + DEFAULT_CLAIM_TTL_SECONDS // 2.Verified the test fails against the old code (
claim_expires did not advance (1 -> 1)) and passes against the fix. Fulltests/tools/test_kanban_tools.py+tests/hermes_cli/test_kanban_db.pyrun green (99 passed).Test plan
pytest tests/tools/test_kanban_tools.py -k heartbeat -v— 4 passedpytest tests/tools/test_kanban_tools.py tests/hermes_cli/test_kanban_db.py— 99 passed