fix(kanban): validate task skills and surface stuck task diagnostics#22974
fix(kanban): validate task skills and surface stuck task diagnostics#22974qWaitCrypto wants to merge 4 commits into
Conversation
…ics-preflight-22921 # Conflicts: # tests/tools/test_kanban_tools.py
|
Closing in favor of #23183 below — your three stacked PRs (#22974, #23154, #23183) each strictly contain the previous one's commits, with #23183 being the latest superset. See the combined comment on #23183 for the full picture. Quick summary for this PR specifically: the create-time skills validation half ( The genuinely novel work in your PR — three new diagnostic rules ( Thanks @qWaitCrypto! |
Bug Description
Kanban tasks could persist invalid worker skill configuration and only fail much later at dispatch/runtime.
Before this change:
hermes kanban create --skill web --skill browser ...was accepted even thoughweb/browserare toolset names, not SKILL bundle names.--skills ...and failed startup with unknown-skill errors.tasks.skillswere hard to diagnose from the board alone.runningafter their claim TTL expired also had no explicit diagnostic explaining the stuck state.This patch also surfaces related operator-visible failure modes:
Fixes #22921
Related to #22922, #22924, #22925, #22926
Root Cause
Kanban was treating
tasks.skillsas an opaque string list.That missed an important invariant:
tasks.skillsare SKILL bundle names forwarded tohermes --skills ...Because the invariant was not enforced at create time, bad tasks could be written successfully and only fail later during worker spawn. Recovery and diagnostics were also incomplete: existing malformed rows and stale running claims were not clearly surfaced from the Kanban diagnostics subsystem, and there was no minimal supported repair command for invalid
skills.Fix
This PR keeps the behavioral scope narrow and focuses on validation + visibility + minimal recovery:
create_task()/kanban create/kanban_createwhen they are passed viaskills.invalid_task_skillsassignee_profile_not_foundstale_running_claimhermes kanban edit <task_id> --clear-skillsas a minimal recovery path for already-persisted bad task rows.Notably, this PR does not:
How to Verify
hermes kanban create "bad task" --assignee worker --skill webhermes kanban diagnostics --task <task_id>Confirm it reports
invalid_task_skills.hermes kanban edit <task_id> --clear-skillsConfirm the task's
skillsfield is cleared and the recovery event is recorded.claim_expiresis in the past, confirm diagnostics reportsstale_running_claim.Test Plan
edit --clear-skillsTargeted test command run for this PR:
pytest tests/hermes_cli/test_kanban_diagnostics.py \ tests/hermes_cli/test_kanban_db.py \ tests/hermes_cli/test_kanban_core_functionality.py \ tests/tools/test_kanban_tools.py -qResult:
Risk Assessment
Low / Medium — validation is intentionally narrow and only rejects a small explicit set of known toolset names when used in the
skillsfield. The new diagnostics are read-only, and the new recovery path only clearsskillson non-running tasks.