fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees by Brecht-H · Pull Request #20134 · NousResearch/hermes-agent

Brecht-H · 2026-05-05T09:44:04Z

Summary

Stacked on #20105. After that PR landed locally on a multi-lane Hermes setup (orion-cc + orion-research terminal lanes pulling kanban tasks via claim_task), the gateway emits a noisy WARN every 5 min:

WARNING gateway.run: kanban dispatcher stuck: ready queue non-empty for
6/N consecutive ticks but 0 workers spawned. Check profile health (venv,
PATH, credentials) and `hermes kanban list --status ready`.

The warning is correct in spirit — catches broken PATH / missing venv / credential loss for a real Hermes profile — but on a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn (assignee is a control-plane lane, not a Hermes profile), and there is nothing for the operator to fix.

This PR splits the bookkeeping so health telemetry only fires on real failures.

Changes

DispatchResult gains a skipped_nonspawnable field (separate from skipped_unassigned). PR fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105 was lumping both into skipped_unassigned, which has different operator semantics:
- skipped_unassigned — task has no assignee → operator should route it.
- skipped_nonspawnable — task is owned by a control-plane lane → terminal will pull it via claim_task, expected steady-state.
dispatch_once routes the not profile_exists(assignee) skip into the new bucket.
New helper kanban_db.has_spawnable_ready(conn) returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when profile_exists is unimportable so degraded installs still surface the original warn.
The gateway dispatcher (gateway/run.py) and the CLI standalone daemon (hermes_cli/kanban.py) both swap their cheap ready_nonempty probe to use has_spawnable_ready. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start.
CLI dispatch output prints Skipped (non-spawnable assignee — terminal lane, OK) for visibility without alarm.

Tests

New has_spawnable_ready cases: empty queue, terminal-lane only, mixed real+terminal.
New test_dispatch_skips_nonspawnable_into_separate_bucket verifies the bucketing change.
Updated test_dispatch_skips_unassigned to assert no cross-leak between buckets.
Added all_assignees_spawnable fixture in tests/hermes_cli/conftest.py and threaded it through dispatcher tests that use synthetic assignees (alice, bob). PR fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105 silently broke 8 such tests by routing those assignees into the skip path instead of spawning; this PR repairs them as part of the same code area:
- test_dispatch_dry_run_does_not_claim
- test_dispatch_promotes_ready_and_spawns
- test_dispatch_spawn_failure_releases_claim
- test_spawn_failure_auto_blocks_after_limit
- test_successful_spawn_resets_failure_counter
- test_workspace_resolution_failure_also_counts
- test_spawn_failure_circuit_breaker_emits_gave_up
- test_spawned_event_emitted_with_pid
- test_run_on_spawn_failure_records_failed_runs

Test plan

pytest tests/hermes_cli/test_kanban_db.py — 54/54 pass.
pytest tests/hermes_cli/test_kanban_{db,cli,boards,core_functionality}.py — 246/246 pass.
Reviewer: confirm 30+ minutes of gateway uptime on a multi-lane host shows zero "dispatcher stuck" warnings while the ready queue is steadily full of orion-cc / orion-research tasks.
Reviewer: create a ready task with assignee=daily (or any real profile name on the host) without supplying credentials and confirm the warn STILL fires after HEALTH_WINDOW=6 ticks — the original safety net is intact.

Merge order

⚠️ Merge after #20105.

🤖 Generated with Claude Code

…l profile The kanban dispatcher's `_default_spawn` invokes ``hermes -p <task.assignee> chat -q ...``. When ``assignee`` names a control-plane lane (e.g. an interactive Claude Code terminal like ``orion-cc`` / ``orion-research``) instead of a real Hermes profile, the subprocess fails on startup with "Profile 'X' does not exist", gets reaped as a zombie, the TTL/crash detector marks the task back to ``ready``, and the next tick re-spawns the same crashing worker. Result: a permanent crash loop emitting ``spawned=2 crashed=2 every tick`` in the gateway log and burning CPU forever. Reproduce on a fresh Hermes-agent install: # 1. Create a kanban task whose assignee names a non-profile. hermes kanban create --assignee orion-cc --status ready \ --title "Review PR #N" --body "..." # 2. Start the gateway with the embedded dispatcher. hermes gateway run # gateway.log lines every minute: # kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ... # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies. Fix --- ``dispatch_once()`` now pre-checks ``hermes_cli.profiles. profile_exists(assignee)`` before claiming. If False, the row is added to ``skipped_unassigned`` (it's effectively "unassigned-to-an-executable-profile") and the dispatcher moves on without claiming, spawning, or counting a crash. The check is opt-in safe: if the import fails (e.g. test isolation, profile module restructured), ``profile_exists`` falls back to ``None`` and the original behaviour is preserved unchanged. This addresses the explicit hint in the kanban task body (``t_2bab06e3``): "Should ready-state tasks auto-spawn at all, or only on explicit orion-cc claim? If spurious, gate the auto-spawn behind a config flag (e.g. only assignee=hermes or assignee=auto)." Profile-existence is a tighter gate than a config flag — it self-documents (the user already knows whether they have an ``orion-cc`` profile), and it doesn't require Mac to maintain an allowlist as new lane names appear. New lanes that ARE real profiles (created via ``hermes profile create``) auto- qualify the moment the profile dir is created. Validated live -------------- On Orion's hermes-agent install, two ``orion-research``- assigned tasks (Bug A and Bug C investigations) had been crash-looping since 2026-05-05 06:58 local. After applying the patch + restarting the gateway: - Stale ``running`` claims released to ``ready`` cleanly. - New gateway emitted ``kanban dispatcher: embedded`` and has ticked silently for 2+ minutes — no spawned=, crashed=, or stuck= log lines (all spawn skips are quiet). - Tasks remain ``ready`` with ``claim_lock=None``, ``worker_pid=None``, ``spawn_failures=0``. - Dashboard + telegram + freqtrade unaffected. Confidence: high (live verified on Orion). Scope-risk: narrow (additive guard inside one function). Not-tested: behaviour when a profile is renamed mid-tick — current code re-imports ``profile_exists`` per row so a freshly created profile auto-qualifies on the next tick. Machine: orion-terminal Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ly non-spawnable assignees After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails ``profile_exists()`` to prevent the orion-cc/orion-research crash loop), the gateway and CLI emit a spurious "kanban dispatcher stuck: ready queue non-empty for N consecutive ticks but 0 workers spawned" warning every 5 minutes on multi-lane setups where the queue is steadily full of human-pulled work assigned to terminal lanes. The warn is intended to catch real failure modes (broken PATH, missing venv, credential loss for a real Hermes profile). On a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn, and there is nothing for the operator to fix. Changes: * ``DispatchResult`` gains a ``skipped_nonspawnable`` field (separate from ``skipped_unassigned``) so callers can distinguish "task missing an owner — operator should route it" from "task owned by a control-plane lane — terminal will pull it". * ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip into the new bucket (was lumped into ``skipped_unassigned``). * New helper ``has_spawnable_ready(conn)`` returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when ``profile_exists`` is unimportable so degraded installs still surface the original warn. * The gateway dispatcher (``gateway/run.py``) and the CLI standalone daemon (``hermes_cli/kanban.py``) both swap their cheap ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start. * CLI dispatch output prints ``Skipped (non-spawnable assignee — terminal lane, OK)`` for visibility without alarm. Tests: * New ``has_spawnable_ready`` cases (empty queue, terminal-lane only, mixed real+terminal). * New ``test_dispatch_skips_nonspawnable_into_separate_bucket`` verifies the bucketing change. * Updated ``test_dispatch_skips_unassigned`` to assert no cross-leak. * Added ``all_assignees_spawnable`` fixture in ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105 (the parent commit) silently broke 8 such tests by routing those assignees into ``skipped_nonspawnable`` instead of spawning; this PR repairs them as part of the same code area. Verified locally: 246/246 kanban-suite tests pass. Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05 (PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR refines kanban dispatcher bookkeeping and health telemetry so “dispatcher stuck” warnings only fire when there is spawnable ready work (i.e., tasks assigned to real Hermes profiles), avoiding persistent false-positive warnings on multi-lane hosts where ready tasks may be intentionally owned by terminal/control-plane lanes and pulled via claim_task.

Changes:

Added DispatchResult.skipped_nonspawnable and updated dispatch_once() to bucket “assignee is not a real profile” separately from truly unassigned tasks.
Introduced kanban_db.has_spawnable_ready(conn) and switched both the gateway dispatcher and CLI daemon health probes to use it.
Expanded/adjusted tests and added a shared all_assignees_spawnable fixture to keep dispatcher spawn tests using synthetic assignees stable.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`hermes_cli/kanban_db.py`	Adds `skipped_nonspawnable`, implements `has_spawnable_ready()`, and updates `dispatch_once()` skip logic.
`hermes_cli/kanban.py`	Extends CLI dispatch output (JSON + human) and updates daemon stuck-warn probe to use `has_spawnable_ready()`.
`gateway/run.py`	Updates embedded dispatcher health telemetry probe across boards to use `has_spawnable_ready()`.
`tests/hermes_cli/test_kanban_db.py`	Adds targeted tests for new skip bucket and `has_spawnable_ready()` behavior; threads fixture into spawn-related tests.
`tests/hermes_cli/test_kanban_core_functionality.py`	Threads `all_assignees_spawnable` into spawn/circuit-breaker tests that use synthetic assignees.
`tests/hermes_cli/conftest.py`	Adds shared `all_assignees_spawnable` fixture to monkeypatch `profile_exists()` for synthetic assignees in tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # the task would loop back to ``ready`` on next tick, and we'd
+        # burn CPU forever (#kanban-dispatcher-crash-loop 2026-05-05).
+        try:
+            from hermes_cli.profiles import profile_exists  # local import: avoids cycle
+        except Exception:
+            profile_exists = None  # type: ignore[assignment]
+        if profile_exists is not None and not profile_exists(row["assignee"]):
+            # Bucket separately from skipped_unassigned: the operator


+    rows = conn.execute(
+        "SELECT DISTINCT assignee FROM tasks "
+        "WHERE status = 'ready' AND assignee IS NOT NULL "
+        "    AND claim_lock IS NULL"
+    ).fetchall()
+    if not rows:
+        return False
+    try:
+        from hermes_cli.profiles import profile_exists  # local import: avoids cycle
+    except Exception:
+        # Can't introspect — assume spawnable, preserve legacy behavior.
+        return True
+    for row in rows:
+        if profile_exists(row["assignee"]):
+            return True


teknium1 · 2026-05-05T11:13:51Z

Merged via PR #20165 (commit f25d3ec). The skipped_nonspawnable bucket, has_spawnable_ready() helper, gateway + CLI probe swap, and the 8 test repairs via the all_assignees_spawnable fixture all landed on main with your authorship preserved in git log. Stacking the telemetry fix on top of #20105 with the test repairs in the same PR made this a clean salvage — thanks.

#20165

Hermes Sovereign AgentCore and others added 2 commits May 5, 2026 07:47

Copilot AI review requested due to automatic review settings May 5, 2026 09:44

Copilot started reviewing on behalf of Brecht-H May 5, 2026 09:44 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins labels May 5, 2026

kallidean mentioned this pull request May 5, 2026

Feature Request: First-Class Persistent Kanban Worker Lanes #20157

Closed

teknium1 mentioned this pull request May 5, 2026

fix(kanban): skip dispatch for tasks assigned to non-profile lanes (salvages #20105, #20134) #20165

Merged

teknium1 closed this in #20165 May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees#20134

fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees#20134
Brecht-H wants to merge 2 commits into
NousResearch:mainfrom
Brecht-H:fix/kanban-dispatcher-suppress-stuck-warn-nonspawnable-2026-05-05

Brecht-H commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

teknium1 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Brecht-H commented May 5, 2026

Summary

Changes

Tests

Test plan

Merge order

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

teknium1 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants