fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees#20134
Conversation
…l profile
The kanban dispatcher's `_default_spawn` invokes
``hermes -p <task.assignee> chat -q ...``. When ``assignee``
names a control-plane lane (e.g. an interactive Claude Code
terminal like ``orion-cc`` / ``orion-research``) instead of a
real Hermes profile, the subprocess fails on startup with
"Profile 'X' does not exist", gets reaped as a zombie, the
TTL/crash detector marks the task back to ``ready``, and the
next tick re-spawns the same crashing worker. Result: a
permanent crash loop emitting ``spawned=2 crashed=2 every tick``
in the gateway log and burning CPU forever.
Reproduce on a fresh Hermes-agent install:
# 1. Create a kanban task whose assignee names a non-profile.
hermes kanban create --assignee orion-cc --status ready \
--title "Review PR #N" --body "..."
# 2. Start the gateway with the embedded dispatcher.
hermes gateway run
# gateway.log lines every minute:
# kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
# 3. ps -ef | grep '[h]ermes.*defunct' shows zombies.
Fix
---
``dispatch_once()`` now pre-checks ``hermes_cli.profiles.
profile_exists(assignee)`` before claiming. If False, the row
is added to ``skipped_unassigned`` (it's effectively
"unassigned-to-an-executable-profile") and the dispatcher
moves on without claiming, spawning, or counting a crash.
The check is opt-in safe: if the import fails (e.g. test
isolation, profile module restructured), ``profile_exists``
falls back to ``None`` and the original behaviour is preserved
unchanged.
This addresses the explicit hint in the kanban task body
(``t_2bab06e3``):
"Should ready-state tasks auto-spawn at all, or only on
explicit orion-cc claim? If spurious, gate the auto-spawn
behind a config flag (e.g. only assignee=hermes or
assignee=auto)."
Profile-existence is a tighter gate than a config flag — it
self-documents (the user already knows whether they have an
``orion-cc`` profile), and it doesn't require Mac to maintain
an allowlist as new lane names appear. New lanes that ARE
real profiles (created via ``hermes profile create``) auto-
qualify the moment the profile dir is created.
Validated live
--------------
On Orion's hermes-agent install, two ``orion-research``-
assigned tasks (Bug A and Bug C investigations) had been
crash-looping since 2026-05-05 06:58 local. After applying
the patch + restarting the gateway:
- Stale ``running`` claims released to ``ready`` cleanly.
- New gateway emitted ``kanban dispatcher: embedded`` and
has ticked silently for 2+ minutes — no spawned=,
crashed=, or stuck= log lines (all spawn skips are quiet).
- Tasks remain ``ready`` with ``claim_lock=None``,
``worker_pid=None``, ``spawn_failures=0``.
- Dashboard + telegram + freqtrade unaffected.
Confidence: high (live verified on Orion).
Scope-risk: narrow (additive guard inside one function).
Not-tested: behaviour when a profile is renamed mid-tick —
current code re-imports ``profile_exists`` per row so a
freshly created profile auto-qualifies on the next tick.
Machine: orion-terminal
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly non-spawnable assignees After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails ``profile_exists()`` to prevent the orion-cc/orion-research crash loop), the gateway and CLI emit a spurious "kanban dispatcher stuck: ready queue non-empty for N consecutive ticks but 0 workers spawned" warning every 5 minutes on multi-lane setups where the queue is steadily full of human-pulled work assigned to terminal lanes. The warn is intended to catch real failure modes (broken PATH, missing venv, credential loss for a real Hermes profile). On a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn, and there is nothing for the operator to fix. Changes: * ``DispatchResult`` gains a ``skipped_nonspawnable`` field (separate from ``skipped_unassigned``) so callers can distinguish "task missing an owner — operator should route it" from "task owned by a control-plane lane — terminal will pull it". * ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip into the new bucket (was lumped into ``skipped_unassigned``). * New helper ``has_spawnable_ready(conn)`` returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when ``profile_exists`` is unimportable so degraded installs still surface the original warn. * The gateway dispatcher (``gateway/run.py``) and the CLI standalone daemon (``hermes_cli/kanban.py``) both swap their cheap ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start. * CLI dispatch output prints ``Skipped (non-spawnable assignee — terminal lane, OK)`` for visibility without alarm. Tests: * New ``has_spawnable_ready`` cases (empty queue, terminal-lane only, mixed real+terminal). * New ``test_dispatch_skips_nonspawnable_into_separate_bucket`` verifies the bucketing change. * Updated ``test_dispatch_skips_unassigned`` to assert no cross-leak. * Added ``all_assignees_spawnable`` fixture in ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105 (the parent commit) silently broke 8 such tests by routing those assignees into ``skipped_nonspawnable`` instead of spawning; this PR repairs them as part of the same code area. Verified locally: 246/246 kanban-suite tests pass. Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05 (PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR refines kanban dispatcher bookkeeping and health telemetry so “dispatcher stuck” warnings only fire when there is spawnable ready work (i.e., tasks assigned to real Hermes profiles), avoiding persistent false-positive warnings on multi-lane hosts where ready tasks may be intentionally owned by terminal/control-plane lanes and pulled via claim_task.
Changes:
- Added
DispatchResult.skipped_nonspawnableand updateddispatch_once()to bucket “assignee is not a real profile” separately from truly unassigned tasks. - Introduced
kanban_db.has_spawnable_ready(conn)and switched both the gateway dispatcher and CLI daemon health probes to use it. - Expanded/adjusted tests and added a shared
all_assignees_spawnablefixture to keep dispatcher spawn tests using synthetic assignees stable.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
hermes_cli/kanban_db.py |
Adds skipped_nonspawnable, implements has_spawnable_ready(), and updates dispatch_once() skip logic. |
hermes_cli/kanban.py |
Extends CLI dispatch output (JSON + human) and updates daemon stuck-warn probe to use has_spawnable_ready(). |
gateway/run.py |
Updates embedded dispatcher health telemetry probe across boards to use has_spawnable_ready(). |
tests/hermes_cli/test_kanban_db.py |
Adds targeted tests for new skip bucket and has_spawnable_ready() behavior; threads fixture into spawn-related tests. |
tests/hermes_cli/test_kanban_core_functionality.py |
Threads all_assignees_spawnable into spawn/circuit-breaker tests that use synthetic assignees. |
tests/hermes_cli/conftest.py |
Adds shared all_assignees_spawnable fixture to monkeypatch profile_exists() for synthetic assignees in tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # the task would loop back to ``ready`` on next tick, and we'd | ||
| # burn CPU forever (#kanban-dispatcher-crash-loop 2026-05-05). | ||
| try: | ||
| from hermes_cli.profiles import profile_exists # local import: avoids cycle | ||
| except Exception: | ||
| profile_exists = None # type: ignore[assignment] | ||
| if profile_exists is not None and not profile_exists(row["assignee"]): | ||
| # Bucket separately from skipped_unassigned: the operator |
| rows = conn.execute( | ||
| "SELECT DISTINCT assignee FROM tasks " | ||
| "WHERE status = 'ready' AND assignee IS NOT NULL " | ||
| " AND claim_lock IS NULL" | ||
| ).fetchall() | ||
| if not rows: | ||
| return False | ||
| try: | ||
| from hermes_cli.profiles import profile_exists # local import: avoids cycle | ||
| except Exception: | ||
| # Can't introspect — assume spawnable, preserve legacy behavior. | ||
| return True | ||
| for row in rows: | ||
| if profile_exists(row["assignee"]): | ||
| return True |
|
Merged via PR #20165 (commit f25d3ec). The |
Summary
Stacked on #20105. After that PR landed locally on a multi-lane Hermes setup (orion-cc + orion-research terminal lanes pulling kanban tasks via
claim_task), the gateway emits a noisy WARN every 5 min:The warning is correct in spirit — catches broken PATH / missing venv / credential loss for a real Hermes profile — but on a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn (assignee is a control-plane lane, not a Hermes profile), and there is nothing for the operator to fix.
This PR splits the bookkeeping so health telemetry only fires on real failures.
Changes
DispatchResultgains askipped_nonspawnablefield (separate fromskipped_unassigned). PR fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105 was lumping both intoskipped_unassigned, which has different operator semantics:skipped_unassigned— task has no assignee → operator should route it.skipped_nonspawnable— task is owned by a control-plane lane → terminal will pull it viaclaim_task, expected steady-state.dispatch_onceroutes thenot profile_exists(assignee)skip into the new bucket.kanban_db.has_spawnable_ready(conn)returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" whenprofile_existsis unimportable so degraded installs still surface the original warn.gateway/run.py) and the CLI standalone daemon (hermes_cli/kanban.py) both swap their cheapready_nonemptyprobe to usehas_spawnable_ready. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start.Skipped (non-spawnable assignee — terminal lane, OK)for visibility without alarm.Tests
has_spawnable_readycases: empty queue, terminal-lane only, mixed real+terminal.test_dispatch_skips_nonspawnable_into_separate_bucketverifies the bucketing change.test_dispatch_skips_unassignedto assert no cross-leak between buckets.all_assignees_spawnablefixture intests/hermes_cli/conftest.pyand threaded it through dispatcher tests that use synthetic assignees (alice,bob). PR fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105 silently broke 8 such tests by routing those assignees into the skip path instead of spawning; this PR repairs them as part of the same code area:test_dispatch_dry_run_does_not_claimtest_dispatch_promotes_ready_and_spawnstest_dispatch_spawn_failure_releases_claimtest_spawn_failure_auto_blocks_after_limittest_successful_spawn_resets_failure_countertest_workspace_resolution_failure_also_countstest_spawn_failure_circuit_breaker_emits_gave_uptest_spawned_event_emitted_with_pidtest_run_on_spawn_failure_records_failed_runsTest plan
pytest tests/hermes_cli/test_kanban_db.py— 54/54 pass.pytest tests/hermes_cli/test_kanban_{db,cli,boards,core_functionality}.py— 246/246 pass.orion-cc/orion-researchtasks.assignee=daily(or any real profile name on the host) without supplying credentials and confirm the warn STILL fires afterHEALTH_WINDOW=6ticks — the original safety net is intact.Merge order
🤖 Generated with Claude Code