Skip to content

fix(kanban): validate task skills and surface stuck task diagnostics#22974

Closed
qWaitCrypto wants to merge 4 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-diagnostics-preflight-22921
Closed

fix(kanban): validate task skills and surface stuck task diagnostics#22974
qWaitCrypto wants to merge 4 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-diagnostics-preflight-22921

Conversation

@qWaitCrypto

Copy link
Copy Markdown
Contributor

Bug Description

Kanban tasks could persist invalid worker skill configuration and only fail much later at dispatch/runtime.

Before this change:

  • hermes kanban create --skill web --skill browser ... was accepted even though web / browser are toolset names, not SKILL bundle names.
  • Dispatcher-spawned workers eventually received those values through --skills ... and failed startup with unknown-skill errors.
  • Existing bad rows in tasks.skills were hard to diagnose from the board alone.
  • Operators did not have a supported CLI recovery path to clear invalid persisted skills without editing SQLite directly.
  • Tasks that were still marked running after their claim TTL expired also had no explicit diagnostic explaining the stuck state.

This patch also surfaces related operator-visible failure modes:

  • tasks assigned to profiles that do not exist
  • running tasks whose claim has already expired

Fixes #22921
Related to #22922, #22924, #22925, #22926

Root Cause

Kanban was treating tasks.skills as an opaque string list.

That missed an important invariant:

  • tasks.skills are SKILL bundle names forwarded to hermes --skills ...
  • toolset names belong on the assignee profile config, not on the task row

Because the invariant was not enforced at create time, bad tasks could be written successfully and only fail later during worker spawn. Recovery and diagnostics were also incomplete: existing malformed rows and stale running claims were not clearly surfaced from the Kanban diagnostics subsystem, and there was no minimal supported repair command for invalid skills.

Fix

This PR keeps the behavioral scope narrow and focuses on validation + visibility + minimal recovery:

  • Reject obvious toolset names in create_task() / kanban create / kanban_create when they are passed via skills.
  • Add Kanban diagnostics for:
    • invalid_task_skills
    • assignee_profile_not_found
    • stale_running_claim
  • Add hermes kanban edit <task_id> --clear-skills as a minimal recovery path for already-persisted bad task rows.

Notably, this PR does not:

  • change dispatcher scheduling behavior
  • change default profile toolsets
  • introduce required-toolset manifests
  • attempt to solve the separate worker-startup race class

How to Verify

  1. Reproduce the pre-fix bad input path:
    hermes kanban create "bad task" --assignee worker --skill web
  2. Confirm the command now fails immediately with a message explaining that task skills must be SKILL bundle names, not toolset names.
  3. Create or patch a task row with invalid persisted skills and run:
    hermes kanban diagnostics --task <task_id>
    Confirm it reports invalid_task_skills.
  4. Run:
    hermes kanban edit <task_id> --clear-skills
    Confirm the task's skills field is cleared and the recovery event is recorded.
  5. For a running task whose claim_expires is in the past, confirm diagnostics reports stale_running_claim.

Test Plan

  • Added regression test for create-time invalid toolset-name rejection
  • Added regression tests for diagnostics on invalid skills, missing profiles, and stale running claims
  • Added regression tests for edit --clear-skills
  • Existing targeted Kanban tests still pass
  • Manual verification of the fix

Targeted test command run for this PR:

pytest tests/hermes_cli/test_kanban_diagnostics.py \
       tests/hermes_cli/test_kanban_db.py \
       tests/hermes_cli/test_kanban_core_functionality.py \
       tests/tools/test_kanban_tools.py -q

Result:

281 passed in 51.36s

Risk Assessment

Low / Medium — validation is intentionally narrow and only rejects a small explicit set of known toolset names when used in the skills field. The new diagnostics are read-only, and the new recovery path only clears skills on non-running tasks.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 10, 2026
…ics-preflight-22921

# Conflicts:
#	tests/tools/test_kanban_tools.py
@teknium1

Copy link
Copy Markdown
Contributor

Closing in favor of #23183 below — your three stacked PRs (#22974, #23154, #23183) each strictly contain the previous one's commits, with #23183 being the latest superset. See the combined comment on #23183 for the full picture.

Quick summary for this PR specifically: the create-time skills validation half (INVALID_TASK_SKILL_NAMES + _normalize_task_skills) is now redundant — main shipped equivalent logic via PR #23273 (commit 1f5983c4c) using KNOWN_TOOLSET_NAMES = frozenset(name.casefold() for name in get_toolset_names()) which derives the list dynamically from the toolset registry rather than hardcoding it. Aggregates all toolset-name typos before raising, same shape as your _normalize_task_skills.

The genuinely novel work in your PR — three new diagnostic rules (invalid_task_skills, assignee_profile_not_found, stale_running_claim) plus the --clear-skills recovery CLI flag — is real and complementary, but it's bundled in the larger #23183 stack. See that PR's comment for the path forward.

Thanks @qWaitCrypto!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: kanban_create accepts toolset names in skills field, causing immediate worker crash

3 participants