Skip to content

fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees#20134

Closed
Brecht-H wants to merge 2 commits into
NousResearch:mainfrom
Brecht-H:fix/kanban-dispatcher-suppress-stuck-warn-nonspawnable-2026-05-05
Closed

fix(kanban): suppress dispatcher stuck-warn for non-spawnable assignees#20134
Brecht-H wants to merge 2 commits into
NousResearch:mainfrom
Brecht-H:fix/kanban-dispatcher-suppress-stuck-warn-nonspawnable-2026-05-05

Conversation

@Brecht-H

@Brecht-H Brecht-H commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Stacked on #20105. After that PR landed locally on a multi-lane Hermes setup (orion-cc + orion-research terminal lanes pulling kanban tasks via claim_task), the gateway emits a noisy WARN every 5 min:

WARNING gateway.run: kanban dispatcher stuck: ready queue non-empty for
6/N consecutive ticks but 0 workers spawned. Check profile health (venv,
PATH, credentials) and `hermes kanban list --status ready`.

The warning is correct in spirit — catches broken PATH / missing venv / credential loss for a real Hermes profile — but on a multi-lane host it fires forever even though everything is healthy: the dispatcher correctly chose not to spawn (assignee is a control-plane lane, not a Hermes profile), and there is nothing for the operator to fix.

This PR splits the bookkeeping so health telemetry only fires on real failures.

Changes

  • DispatchResult gains a skipped_nonspawnable field (separate from skipped_unassigned). PR fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105 was lumping both into skipped_unassigned, which has different operator semantics:
    • skipped_unassigned — task has no assignee → operator should route it.
    • skipped_nonspawnable — task is owned by a control-plane lane → terminal will pull it via claim_task, expected steady-state.
  • dispatch_once routes the not profile_exists(assignee) skip into the new bucket.
  • New helper kanban_db.has_spawnable_ready(conn) returns True iff at least one ready+assigned+unclaimed task in the DB has an assignee that maps to a real Hermes profile. Falls back to legacy "any ready+assigned" when profile_exists is unimportable so degraded installs still surface the original warn.
  • The gateway dispatcher (gateway/run.py) and the CLI standalone daemon (hermes_cli/kanban.py) both swap their cheap ready_nonempty probe to use has_spawnable_ready. Stuck-warn now fires only when there is genuine spawnable work the dispatcher failed to start.
  • CLI dispatch output prints Skipped (non-spawnable assignee — terminal lane, OK) for visibility without alarm.

Tests

  • New has_spawnable_ready cases: empty queue, terminal-lane only, mixed real+terminal.
  • New test_dispatch_skips_nonspawnable_into_separate_bucket verifies the bucketing change.
  • Updated test_dispatch_skips_unassigned to assert no cross-leak between buckets.
  • Added all_assignees_spawnable fixture in tests/hermes_cli/conftest.py and threaded it through dispatcher tests that use synthetic assignees (alice, bob). PR fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile #20105 silently broke 8 such tests by routing those assignees into the skip path instead of spawning; this PR repairs them as part of the same code area:
    • test_dispatch_dry_run_does_not_claim
    • test_dispatch_promotes_ready_and_spawns
    • test_dispatch_spawn_failure_releases_claim
    • test_spawn_failure_auto_blocks_after_limit
    • test_successful_spawn_resets_failure_counter
    • test_workspace_resolution_failure_also_counts
    • test_spawn_failure_circuit_breaker_emits_gave_up
    • test_spawned_event_emitted_with_pid
    • test_run_on_spawn_failure_records_failed_runs

Test plan

  • pytest tests/hermes_cli/test_kanban_db.py — 54/54 pass.
  • pytest tests/hermes_cli/test_kanban_{db,cli,boards,core_functionality}.py — 246/246 pass.
  • Reviewer: confirm 30+ minutes of gateway uptime on a multi-lane host shows zero "dispatcher stuck" warnings while the ready queue is steadily full of orion-cc / orion-research tasks.
  • Reviewer: create a ready task with assignee=daily (or any real profile name on the host) without supplying credentials and confirm the warn STILL fires after HEALTH_WINDOW=6 ticks — the original safety net is intact.

Merge order

⚠️ Merge after #20105.

🤖 Generated with Claude Code

Hermes Sovereign AgentCore and others added 2 commits May 5, 2026 07:47
…l profile

The kanban dispatcher's `_default_spawn` invokes
``hermes -p <task.assignee> chat -q ...``. When ``assignee``
names a control-plane lane (e.g. an interactive Claude Code
terminal like ``orion-cc`` / ``orion-research``) instead of a
real Hermes profile, the subprocess fails on startup with
"Profile 'X' does not exist", gets reaped as a zombie, the
TTL/crash detector marks the task back to ``ready``, and the
next tick re-spawns the same crashing worker. Result: a
permanent crash loop emitting ``spawned=2 crashed=2 every tick``
in the gateway log and burning CPU forever.

Reproduce on a fresh Hermes-agent install:

  # 1. Create a kanban task whose assignee names a non-profile.
  hermes kanban create --assignee orion-cc --status ready \
      --title "Review PR #N" --body "..."
  # 2. Start the gateway with the embedded dispatcher.
  hermes gateway run
  # gateway.log lines every minute:
  #   kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
  # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies.

Fix
---
``dispatch_once()`` now pre-checks ``hermes_cli.profiles.
profile_exists(assignee)`` before claiming. If False, the row
is added to ``skipped_unassigned`` (it's effectively
"unassigned-to-an-executable-profile") and the dispatcher
moves on without claiming, spawning, or counting a crash.

The check is opt-in safe: if the import fails (e.g. test
isolation, profile module restructured), ``profile_exists``
falls back to ``None`` and the original behaviour is preserved
unchanged.

This addresses the explicit hint in the kanban task body
(``t_2bab06e3``):

  "Should ready-state tasks auto-spawn at all, or only on
  explicit orion-cc claim? If spurious, gate the auto-spawn
  behind a config flag (e.g. only assignee=hermes or
  assignee=auto)."

Profile-existence is a tighter gate than a config flag — it
self-documents (the user already knows whether they have an
``orion-cc`` profile), and it doesn't require Mac to maintain
an allowlist as new lane names appear. New lanes that ARE
real profiles (created via ``hermes profile create``) auto-
qualify the moment the profile dir is created.

Validated live
--------------
On Orion's hermes-agent install, two ``orion-research``-
assigned tasks (Bug A and Bug C investigations) had been
crash-looping since 2026-05-05 06:58 local. After applying
the patch + restarting the gateway:

- Stale ``running`` claims released to ``ready`` cleanly.
- New gateway emitted ``kanban dispatcher: embedded`` and
  has ticked silently for 2+ minutes — no spawned=,
  crashed=, or stuck= log lines (all spawn skips are quiet).
- Tasks remain ``ready`` with ``claim_lock=None``,
  ``worker_pid=None``, ``spawn_failures=0``.
- Dashboard + telegram + freqtrade unaffected.

Confidence: high (live verified on Orion).
Scope-risk: narrow (additive guard inside one function).
Not-tested: behaviour when a profile is renamed mid-tick —
current code re-imports ``profile_exists`` per row so a
freshly created profile auto-qualifies on the next tick.
Machine: orion-terminal

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 09:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refines kanban dispatcher bookkeeping and health telemetry so “dispatcher stuck” warnings only fire when there is spawnable ready work (i.e., tasks assigned to real Hermes profiles), avoiding persistent false-positive warnings on multi-lane hosts where ready tasks may be intentionally owned by terminal/control-plane lanes and pulled via claim_task.

Changes:

  • Added DispatchResult.skipped_nonspawnable and updated dispatch_once() to bucket “assignee is not a real profile” separately from truly unassigned tasks.
  • Introduced kanban_db.has_spawnable_ready(conn) and switched both the gateway dispatcher and CLI daemon health probes to use it.
  • Expanded/adjusted tests and added a shared all_assignees_spawnable fixture to keep dispatcher spawn tests using synthetic assignees stable.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
hermes_cli/kanban_db.py Adds skipped_nonspawnable, implements has_spawnable_ready(), and updates dispatch_once() skip logic.
hermes_cli/kanban.py Extends CLI dispatch output (JSON + human) and updates daemon stuck-warn probe to use has_spawnable_ready().
gateway/run.py Updates embedded dispatcher health telemetry probe across boards to use has_spawnable_ready().
tests/hermes_cli/test_kanban_db.py Adds targeted tests for new skip bucket and has_spawnable_ready() behavior; threads fixture into spawn-related tests.
tests/hermes_cli/test_kanban_core_functionality.py Threads all_assignees_spawnable into spawn/circuit-breaker tests that use synthetic assignees.
tests/hermes_cli/conftest.py Adds shared all_assignees_spawnable fixture to monkeypatch profile_exists() for synthetic assignees in tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_cli/kanban_db.py
Comment on lines +2558 to +2565
# the task would loop back to ``ready`` on next tick, and we'd
# burn CPU forever (#kanban-dispatcher-crash-loop 2026-05-05).
try:
from hermes_cli.profiles import profile_exists # local import: avoids cycle
except Exception:
profile_exists = None # type: ignore[assignment]
if profile_exists is not None and not profile_exists(row["assignee"]):
# Bucket separately from skipped_unassigned: the operator
Comment thread hermes_cli/kanban_db.py
Comment on lines +2485 to +2499
rows = conn.execute(
"SELECT DISTINCT assignee FROM tasks "
"WHERE status = 'ready' AND assignee IS NOT NULL "
" AND claim_lock IS NULL"
).fetchall()
if not rows:
return False
try:
from hermes_cli.profiles import profile_exists # local import: avoids cycle
except Exception:
# Can't introspect — assume spawnable, preserve legacy behavior.
return True
for row in rows:
if profile_exists(row["assignee"]):
return True
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins labels May 5, 2026
@teknium1

teknium1 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Merged via PR #20165 (commit f25d3ec). The skipped_nonspawnable bucket, has_spawnable_ready() helper, gateway + CLI probe swap, and the 8 test repairs via the all_assignees_spawnable fixture all landed on main with your authorship preserved in git log. Stacking the telemetry fix on top of #20105 with the test repairs in the same PR made this a clean salvage — thanks.

#20165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants