fix(kanban): guard stale workers before startup by qWaitCrypto · Pull Request #23183 · NousResearch/hermes-agent

qWaitCrypto · 2026-05-10T12:45:07Z

Summary

Stacked on top of #22933 and PR #22974 #23154 follow-up branch.

This PR adds a worker startup guard for dispatcher-spawned Kanban workers.
Before entering the model loop, a worker now verifies that the task is still
running, that the active current_run_id still matches the spawned run, and
that the claim lock still belongs to this worker.

If the task was reclaimed, blocked, archived, or superseded by a newer run in
the claim-to-spawn gap, the worker exits benignly before making any API calls.
That keeps the existing expected_run_id tool-call gate as the secondary
defense, while preventing stale workers from being misclassified later as
protocol-violation crashes.

This follow-up also tightens two recovery edge cases that were still too loose
on the stacked branch:

malformed Kanban worker ownership env now skips benignly instead of silently
disabling part of the ownership check
kanban edit --clear-claim now only closes an active running run; it no
longer rewrites terminal run outcomes when clearing stale task-level claim
residue

What changed

added kanban_db.check_worker_startup_guard(...) as a read-only ownership
preflight for dispatcher-spawned workers
added an early-return guard in AIAgent.run_conversation() for
HERMES_KANBAN_TASK workers
treat malformed Kanban worker ownership env (HERMES_KANBAN_RUN_ID /
claim lock) as a benign startup-guard skip instead of silently disabling
the ownership check
require HERMES_KANBAN_CLAIM_LOCK for the startup ownership guard instead
of skipping the claim check when it is missing
keep kanban edit --clear-claim from changing terminal task_runs rows;
only a live running run is closed as reclaimed
added tests for reclaimed and superseded runs
added run_agent tests proving stale workers and malformed ownership env
skip before any API call

Scope

This PR does not:

change dispatcher scheduling behavior
add a capability model or required_toolsets
change default profile toolsets
change crash accounting for real running workers

Verification

PYTHONWARNINGS=ignore pytest -q \
  tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_on_non_running_task \
  tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_keeps_terminal_run_terminal \
  tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_reclaimed_run \
  tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_superseded_run_without_failure \
  tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_requires_claim_lock \
  tests/run_agent/test_kanban_worker_startup_guard.py \
  tests/hermes_cli/test_kanban_core_functionality.py::test_detect_crashed_workers_protocol_violation_auto_blocks \
  tests/hermes_cli/test_kanban_core_functionality.py::test_detect_crashed_workers_nonzero_exit_uses_default_limit \
  tests/hermes_cli/test_kanban_core_functionality.py::test_stale_run_cannot_complete_new_attempt \
  tests/hermes_cli/test_kanban_core_functionality.py::test_stale_run_cannot_block_or_heartbeat_new_attempt

…ics-preflight-22921 # Conflicts: # tests/tools/test_kanban_tools.py

…-dispatch-recovery-followup

…worker-lifecycle-guard

teknium1 · 2026-05-10T16:10:18Z

Closing all three of your stacked PRs (#22974, #23154, #23183) in favor of asking for a single focused re-submission. Apologies for the bulk close — the work itself is good, the structure is the problem.

Why three closes: the PRs are git-stacked supersets of each other (#23183 contains every commit from #23154 which contains every commit from #22974). Reviewing them independently is impossible without first deciding on the previous one, and #23183 ends up bundling ~1000 LOC of redundant work (the create-time skills validation, which landed on main via PR #23273 using the live toolset registry rather than your hardcoded INVALID_TASK_SKILL_NAMES list) with ~300 LOC of genuinely novel work (the WorkerStartupGuard in run_agent.py).

What's already on main:

Skills validation at create time (PR fix(kanban): reject toolset names in task skills field (salvage #22933) #23273) — uses KNOWN_TOOLSET_NAMES = frozenset(name.casefold() for name in get_toolset_names()), dynamic registry lookup, aggregates all typos before raising. Functionally identical to your _normalize_task_skills but with a live source of truth.

What's genuinely new in this stack and worth shipping (~300 LOC if carved out):

kb.check_worker_startup_guard() + WorkerStartupGuard dataclass — preflight in run_conversation() that runs before any model API call, closes the claim→spawn race window where a worker boots after the task was reclaimed/blocked.
Three new diagnostic rules (invalid_task_skills, assignee_profile_not_found, stale_running_claim).
hermes kanban edit --clear-skills / --reset-failures / --clear-claim recovery CLI flags backed by edit_task_recovery_fields() and reset_task_failures().

Ask: could you re-submit those three pieces as one focused PR against current main? Either the whole bundle (~300 net LOC) or split further (startup guard separately, diagnostics+recovery together) — whichever you prefer. The constraint is just that we need ONE PR per concern instead of three stacked supersets.

For the startup guard specifically: it touches run_agent.py which is on the protected-files list (~7500 lines, lots of repeated patterns), so we need to look at the dict-construction carefully. The current ~95-line hand-rolled return dict mirroring run_conversation's normal return shape is a maintenance hazard — factoring a _terminal_run_result() helper that builds it from kwargs would be cleaner. Happy to discuss the shape on the new PR.

Thanks for the work @qWaitCrypto — the diagnostics and startup-guard ideas are good, just need to land them as a focused unit.

References:

PR fix(kanban): reject toolset names in task skills field (salvage #22933) #23273 — KNOWN_TOOLSET_NAMES create-time validation
AGENTS.md "Critical File Restrictions" — run_agent.py editing rules

qWaitCrypto added 10 commits May 10, 2026 11:31

fix(kanban): validate task skills and surface stuck task diagnostics

7b635aa

fix(kanban): tighten edit recovery semantics

22c6fed

test(kanban): isolate dashboard diagnostics profile checks

a8104b4

fix(kanban): skip bad task skills and add recovery commands

a215786

fix(kanban): tighten dispatch recovery invariants

fd03875

fix(kanban): guard stale workers before startup

5505ad8

fix(kanban): tighten worker startup ownership checks

53024fd

Merge remote-tracking branch 'upstream/main' into fix/kanban-diagnost…

c3e621b

…ics-preflight-22921 # Conflicts: # tests/tools/test_kanban_tools.py

Merge branch 'fix/kanban-diagnostics-preflight-22921' into fix/kanban…

00ab664

…-dispatch-recovery-followup

Merge branch 'fix/kanban-dispatch-recovery-followup' into fix/kanban-…

dddbf93

…worker-lifecycle-guard

qWaitCrypto mentioned this pull request May 10, 2026

[Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership #23209

Open

1 task

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/agent Core agent loop, run_agent.py, prompt builder labels May 10, 2026

This was referenced May 10, 2026

fix(kanban): validate task skills and surface stuck task diagnostics #22974

Closed

fix(kanban): skip invalid task skills and add recovery commands #23154

Closed

teknium1 closed this May 10, 2026

qWaitCrypto mentioned this pull request May 10, 2026

fix(kanban): harden worker ownership and recovery paths #23334

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): guard stale workers before startup#23183

fix(kanban): guard stale workers before startup#23183
qWaitCrypto wants to merge 10 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-worker-lifecycle-guard

qWaitCrypto commented May 10, 2026 •

edited

Loading

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qWaitCrypto commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Scope

Verification

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qWaitCrypto commented May 10, 2026 •

edited

Loading