Skip to content

fix(kanban): harden worker ownership and recovery paths#23334

Closed
qWaitCrypto wants to merge 3 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-worker-ownership-recovery
Closed

fix(kanban): harden worker ownership and recovery paths#23334
qWaitCrypto wants to merge 3 commits into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-worker-ownership-recovery

Conversation

@qWaitCrypto

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR is a focused Kanban resubmission against current main.

Create-time validation for toolset names in task.skills has already landed
separately in #23273. This PR intentionally does not reimplement that path.

It resubmits only the still-novel pieces from the earlier stacked PRs:

  • diagnostics for stuck / undispatchable tasks
  • narrow recovery commands and defensive dispatch preflight
  • worker startup ownership guard before any model API call

Related Issue

Related to #22925, #22926, #22927.

Follow-up to #23209.

Builds on #23273, which already landed create-time validation for toolset
names in task.skills.

Supersedes the earlier stacked PRs #22974, #23154, and #23183 by resubmitting
only the still-novel Kanban hardening pieces against current main.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Add read-only diagnostics for invalid_task_skills, assignee_profile_not_found, and stale_running_claim
  • Add narrow operator recovery commands:
    • hermes kanban edit <task> --clear-skills
    • hermes kanban edit <task> --reset-failures
    • hermes kanban edit <task> --clear-claim
  • Teach dispatch to hard-skip ready tasks whose persisted state already proves
    they cannot spawn correctly:
    • tasks with historical invalid persisted skills
    • tasks assigned to missing profiles
  • Add a read-only worker startup ownership guard before any model API call
  • Require the Kanban startup guard to validate task status, run ownership, and claim lock ownership
  • Treat malformed Kanban worker ownership env as a benign startup-guard skip instead of silently disabling the check
  • Factor the startup-guard early return through a shared helper in run_agent.py instead of hand-rolling large inline result dicts
  • Add targeted regression tests for diagnostics, recovery commands, dispatch skip behavior, and worker startup guard behavior

How to Test

  1. Run targeted Kanban tests:
    pytest -q tests/hermes_cli/test_kanban_diagnostics.py tests/hermes_cli/test_kanban_db.py::test_reset_task_failures_clears_counter_and_emits_event tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_on_non_running_task tests/hermes_cli/test_kanban_db.py::test_edit_task_recovery_fields_clear_claim_keeps_terminal_run_terminal tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_on_non_running_task tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_rejects_running_task tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_skills_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_reset_failures tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_reset_failures_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_claim tests/hermes_cli/test_kanban_core_functionality.py::test_cli_edit_clear_claim_rejects_result_fields tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_reclaimed_run tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_rejects_superseded_run_without_failure tests/hermes_cli/test_kanban_core_functionality.py::test_worker_startup_guard_requires_claim_lock tests/run_agent/test_kanban_worker_startup_guard.py
  2. Run the focused dispatch-preflight regression:
    pytest -q tests/hermes_cli/test_kanban_db.py::test_dispatch_skips_invalid_task_skills_and_keeps_ready
  3. Create or patch a ready task so tasks.skills contains a toolset name, then run hermes kanban dispatch --json and verify it is reported under skipped_invalid_skills and remains ready
  4. Use hermes kanban edit <task_id> --reset-failures and hermes kanban edit <task_id> --clear-claim to verify both recovery actions succeed on eligible tasks
  5. Start a dispatcher-spawned worker against a reclaimed or superseded task and confirm the worker exits before any model API call

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux (WSL-style dev environment)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • Targeted verification passed: 44 passed in 38.48s
  • Focused dispatch-preflight regression passed locally

@qWaitCrypto

Copy link
Copy Markdown
Contributor Author

@teknium1 I resubmitted the Kanban hardening work as a focused PR against current main, following your feedback on #22974 / #23154 / #23183.

@alt-glitch alt-glitch added type/bug Something isn't working comp/cron Cron scheduler and job management P3 Low — cosmetic, nice to have labels May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants