fix(kanban): validate task skills and surface stuck task diagnostics by qWaitCrypto · Pull Request #22974 · NousResearch/hermes-agent

qWaitCrypto · 2026-05-10T03:54:48Z

Bug Description

Kanban tasks could persist invalid worker skill configuration and only fail much later at dispatch/runtime.

Before this change:

hermes kanban create --skill web --skill browser ... was accepted even though web / browser are toolset names, not SKILL bundle names.
Dispatcher-spawned workers eventually received those values through --skills ... and failed startup with unknown-skill errors.
Existing bad rows in tasks.skills were hard to diagnose from the board alone.
Operators did not have a supported CLI recovery path to clear invalid persisted skills without editing SQLite directly.
Tasks that were still marked running after their claim TTL expired also had no explicit diagnostic explaining the stuck state.

This patch also surfaces related operator-visible failure modes:

tasks assigned to profiles that do not exist
running tasks whose claim has already expired

Fixes #22921
Related to #22922, #22924, #22925, #22926

Root Cause

Kanban was treating tasks.skills as an opaque string list.

That missed an important invariant:

tasks.skills are SKILL bundle names forwarded to hermes --skills ...
toolset names belong on the assignee profile config, not on the task row

Because the invariant was not enforced at create time, bad tasks could be written successfully and only fail later during worker spawn. Recovery and diagnostics were also incomplete: existing malformed rows and stale running claims were not clearly surfaced from the Kanban diagnostics subsystem, and there was no minimal supported repair command for invalid skills.

Fix

This PR keeps the behavioral scope narrow and focuses on validation + visibility + minimal recovery:

Reject obvious toolset names in create_task() / kanban create / kanban_create when they are passed via skills.
Add Kanban diagnostics for:
- invalid_task_skills
- assignee_profile_not_found
- stale_running_claim
Add hermes kanban edit <task_id> --clear-skills as a minimal recovery path for already-persisted bad task rows.

Notably, this PR does not:

change dispatcher scheduling behavior
change default profile toolsets
introduce required-toolset manifests
attempt to solve the separate worker-startup race class

How to Verify

Reproduce the pre-fix bad input path:
hermes kanban create "bad task" --assignee worker --skill web
Confirm the command now fails immediately with a message explaining that task skills must be SKILL bundle names, not toolset names.
Create or patch a task row with invalid persisted skills and run:
hermes kanban diagnostics --task <task_id>
Confirm it reports invalid_task_skills.
Run:
hermes kanban edit <task_id> --clear-skills
Confirm the task's skills field is cleared and the recovery event is recorded.
For a running task whose claim_expires is in the past, confirm diagnostics reports stale_running_claim.

Test Plan

Added regression test for create-time invalid toolset-name rejection
Added regression tests for diagnostics on invalid skills, missing profiles, and stale running claims
Added regression tests for edit --clear-skills
Existing targeted Kanban tests still pass
Manual verification of the fix

Targeted test command run for this PR:

pytest tests/hermes_cli/test_kanban_diagnostics.py \
       tests/hermes_cli/test_kanban_db.py \
       tests/hermes_cli/test_kanban_core_functionality.py \
       tests/tools/test_kanban_tools.py -q

Result:

281 passed in 51.36s

Risk Assessment

Low / Medium — validation is intentionally narrow and only rejects a small explicit set of known toolset names when used in the skills field. The new diagnostics are read-only, and the new recovery path only clears skills on non-running tasks.

…ics-preflight-22921 # Conflicts: # tests/tools/test_kanban_tools.py

teknium1 · 2026-05-10T16:10:14Z

Closing in favor of #23183 below — your three stacked PRs (#22974, #23154, #23183) each strictly contain the previous one's commits, with #23183 being the latest superset. See the combined comment on #23183 for the full picture.

Quick summary for this PR specifically: the create-time skills validation half (INVALID_TASK_SKILL_NAMES + _normalize_task_skills) is now redundant — main shipped equivalent logic via PR #23273 (commit 1f5983c4c) using KNOWN_TOOLSET_NAMES = frozenset(name.casefold() for name in get_toolset_names()) which derives the list dynamically from the toolset registry rather than hardcoding it. Aggregates all toolset-name typos before raising, same shape as your _normalize_task_skills.

The genuinely novel work in your PR — three new diagnostic rules (invalid_task_skills, assignee_profile_not_found, stale_running_claim) plus the --clear-skills recovery CLI flag — is real and complementary, but it's bundled in the larger #23183 stack. See that PR's comment for the path forward.

Thanks @qWaitCrypto!

fix(kanban): validate task skills and surface stuck task diagnostics

7b635aa

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 10, 2026

qWaitCrypto added 2 commits May 10, 2026 12:48

fix(kanban): tighten edit recovery semantics

22c6fed

test(kanban): isolate dashboard diagnostics profile checks

a8104b4

alt-glitch mentioned this pull request May 10, 2026

fix(kanban): reject toolset names in kanban_create skills field (#22921) #23105

Closed

qWaitCrypto mentioned this pull request May 10, 2026

fix(kanban): guard stale workers before startup #23183

Closed

Merge remote-tracking branch 'upstream/main' into fix/kanban-diagnost…

c3e621b

…ics-preflight-22921 # Conflicts: # tests/tools/test_kanban_tools.py

qWaitCrypto mentioned this pull request May 10, 2026

[Feature]: Harden Kanban task validity, dispatch preflight, and worker ownership #23209

Open

1 task

teknium1 closed this May 10, 2026

teknium1 mentioned this pull request May 10, 2026

fix(kanban): skip invalid task skills and add recovery commands #23154

Closed

19 tasks

qWaitCrypto mentioned this pull request May 10, 2026

fix(kanban): harden worker ownership and recovery paths #23334

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): validate task skills and surface stuck task diagnostics#22974

fix(kanban): validate task skills and surface stuck task diagnostics#22974
qWaitCrypto wants to merge 4 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-diagnostics-preflight-22921

qWaitCrypto commented May 10, 2026

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qWaitCrypto commented May 10, 2026

Bug Description

Root Cause

Fix

How to Verify

Test Plan

Risk Assessment

Uh oh!

teknium1 commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants