Skip to content

fix(kanban): guard stale workers before startup#23183

Closed
qWaitCrypto wants to merge 10 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-worker-lifecycle-guard
Closed

fix(kanban): guard stale workers before startup#23183
qWaitCrypto wants to merge 10 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-worker-lifecycle-guard

Conversation

@qWaitCrypto

@qWaitCrypto qWaitCrypto commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Stacked on top of #22933 and PR #22974 #23154 follow-up branch.

This PR adds a worker startup guard for dispatcher-spawned Kanban workers.
Before entering the model loop, a worker now verifies that the task is still
running, that the active current_run_id still matches the spawned run, and
that the claim lock still belongs to this worker.

If the task was reclaimed, blocked, archived, or superseded by a newer run in
the claim-to-spawn gap, the worker exits benignly before making any API calls.
That keeps the existing expected_run_id tool-call gate as the secondary
defense, while preventing stale workers from being misclassified later as
protocol-violation crashes.

This follow-up also tightens two recovery edge cases that were still too loose
on the stacked branch:

  • malformed Kanban worker ownership env now skips benignly instead of silently
    disabling part of the ownership check
  • kanban edit --clear-claim now only closes an active running run; it no
    longer rewrites terminal run outcomes when clearing stale task-level claim
    residue

What changed

  • added kanban_db.check_worker_startup_guard(...) as a read-only ownership
    preflight for dispatcher-spawned workers
  • added an early-return guard in AIAgent.run_conversation() for
    HERMES_KANBAN_TASK workers
  • treat malformed Kanban worker ownership env (HERMES_KANBAN_RUN_ID /
    claim lock) as a benign startup-guard skip instead of silently disabling
    the ownership check
  • require HERMES_KANBAN_CLAIM_LOCK for the startup ownership guard instead
    of skipping the claim check when it is missing
  • keep kanban edit --clear-claim from changing terminal task_runs rows;
    only a live running run is closed as reclaimed
  • added tests for reclaimed and superseded runs
  • added run_agent tests proving stale workers and malformed ownership env
    skip before any API call

Scope

This PR does not:

  • change dispatcher scheduling behavior
  • add a capability model or required_toolsets
  • change default profile toolsets
  • change crash accounting for real running workers

Verification

PYTHONWARNINGS=ignore pytest -q \
  tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_on_non_running_task \
  tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_keeps_terminal_run_terminal \
  tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_reclaimed_run \
  tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_superseded_run_without_failure \
  tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_requires_claim_lock \
  tests/run_agent/test_kanban_worker_startup_guard.py \
  tests/hermes_cli/test_kanban_core_functionality.py::test_detect_crashed_workers_protocol_violation_auto_blocks \
  tests/hermes_cli/test_kanban_core_functionality.py::test_detect_crashed_workers_nonzero_exit_uses_default_limit \
  tests/hermes_cli/test_kanban_core_functionality.py::test_stale_run_cannot_complete_new_attempt \
  tests/hermes_cli/test_kanban_core_functionality.py::test_stale_run_cannot_block_or_heartbeat_new_attempt

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/agent Core agent loop, run_agent.py, prompt builder labels May 10, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Closing all three of your stacked PRs (#22974, #23154, #23183) in favor of asking for a single focused re-submission. Apologies for the bulk close — the work itself is good, the structure is the problem.

Why three closes: the PRs are git-stacked supersets of each other (#23183 contains every commit from #23154 which contains every commit from #22974). Reviewing them independently is impossible without first deciding on the previous one, and #23183 ends up bundling ~1000 LOC of redundant work (the create-time skills validation, which landed on main via PR #23273 using the live toolset registry rather than your hardcoded INVALID_TASK_SKILL_NAMES list) with ~300 LOC of genuinely novel work (the WorkerStartupGuard in run_agent.py).

What's already on main:

What's genuinely new in this stack and worth shipping (~300 LOC if carved out):

  1. kb.check_worker_startup_guard() + WorkerStartupGuard dataclass — preflight in run_conversation() that runs before any model API call, closes the claim→spawn race window where a worker boots after the task was reclaimed/blocked.
  2. Three new diagnostic rules (invalid_task_skills, assignee_profile_not_found, stale_running_claim).
  3. hermes kanban edit --clear-skills / --reset-failures / --clear-claim recovery CLI flags backed by edit_task_recovery_fields() and reset_task_failures().

Ask: could you re-submit those three pieces as one focused PR against current main? Either the whole bundle (~300 net LOC) or split further (startup guard separately, diagnostics+recovery together) — whichever you prefer. The constraint is just that we need ONE PR per concern instead of three stacked supersets.

For the startup guard specifically: it touches run_agent.py which is on the protected-files list (~7500 lines, lots of repeated patterns), so we need to look at the dict-construction carefully. The current ~95-line hand-rolled return dict mirroring run_conversation's normal return shape is a maintenance hazard — factoring a _terminal_run_result() helper that builds it from kwargs would be cleaner. Happy to discuss the shape on the new PR.

Thanks for the work @qWaitCrypto — the diagnostics and startup-guard ideas are good, just need to land them as a focused unit.

References:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants