fix(kanban): guard stale workers before startup#23183
Conversation
…ics-preflight-22921 # Conflicts: # tests/tools/test_kanban_tools.py
…-dispatch-recovery-followup
…worker-lifecycle-guard
|
Closing all three of your stacked PRs (#22974, #23154, #23183) in favor of asking for a single focused re-submission. Apologies for the bulk close — the work itself is good, the structure is the problem. Why three closes: the PRs are git-stacked supersets of each other (#23183 contains every commit from #23154 which contains every commit from #22974). Reviewing them independently is impossible without first deciding on the previous one, and #23183 ends up bundling ~1000 LOC of redundant work (the create-time skills validation, which landed on main via PR #23273 using the live toolset registry rather than your hardcoded What's already on main:
What's genuinely new in this stack and worth shipping (~300 LOC if carved out):
Ask: could you re-submit those three pieces as one focused PR against current main? Either the whole bundle (~300 net LOC) or split further (startup guard separately, diagnostics+recovery together) — whichever you prefer. The constraint is just that we need ONE PR per concern instead of three stacked supersets. For the startup guard specifically: it touches Thanks for the work @qWaitCrypto — the diagnostics and startup-guard ideas are good, just need to land them as a focused unit. References:
|
Summary
Stacked on top of #22933 and PR #22974 #23154 follow-up branch.
This PR adds a worker startup guard for dispatcher-spawned Kanban workers.
Before entering the model loop, a worker now verifies that the task is still
running, that the activecurrent_run_idstill matches the spawned run, andthat the claim lock still belongs to this worker.
If the task was reclaimed, blocked, archived, or superseded by a newer run in
the claim-to-spawn gap, the worker exits benignly before making any API calls.
That keeps the existing
expected_run_idtool-call gate as the secondarydefense, while preventing stale workers from being misclassified later as
protocol-violation crashes.
This follow-up also tightens two recovery edge cases that were still too loose
on the stacked branch:
disabling part of the ownership check
kanban edit --clear-claimnow only closes an active running run; it nolonger rewrites terminal run outcomes when clearing stale task-level claim
residue
What changed
kanban_db.check_worker_startup_guard(...)as a read-only ownershippreflight for dispatcher-spawned workers
AIAgent.run_conversation()forHERMES_KANBAN_TASKworkersHERMES_KANBAN_RUN_ID/claim lock) as a benign startup-guard skip instead of silently disabling
the ownership check
HERMES_KANBAN_CLAIM_LOCKfor the startup ownership guard insteadof skipping the claim check when it is missing
kanban edit --clear-claimfrom changing terminaltask_runsrows;only a live running run is closed as
reclaimedrun_agenttests proving stale workers and malformed ownership envskip before any API call
Scope
This PR does not:
required_toolsetsrunningworkersVerification