Skip to content

fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile#20105

Closed
Brecht-H wants to merge 1 commit into
NousResearch:mainfrom
Brecht-H:fix/kanban-dispatcher-skip-missing-profile-2026-05-05
Closed

fix(kanban): dispatcher skips ready tasks whose assignee is not a real profile#20105
Brecht-H wants to merge 1 commit into
NousResearch:mainfrom
Brecht-H:fix/kanban-dispatcher-skip-missing-profile-2026-05-05

Conversation

@Brecht-H

@Brecht-H Brecht-H commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

The kanban dispatcher's _default_spawn invokes hermes -p <task.assignee> chat -q .... When assignee names a control-plane lane (e.g. an interactive Claude Code terminal like orion-cc / orion-research) instead of a real Hermes profile, the subprocess fails on startup with Profile 'X' does not exist, gets reaped as a zombie, the TTL/crash detector reclaims the task back to ready, and the next tick re-spawns the same crashing worker.

Result: a permanent crash loop emitting spawned=N reclaimed=0 crashed=N in the gateway log every minute, two zombie processes per affected task, and CPU burn until someone notices.

Reproduce

# 1. Create a kanban task whose assignee names a non-profile.
hermes kanban create --assignee orion-cc --status ready \
    --title "Review PR #N" --body "..."
# 2. Start the gateway with the embedded dispatcher.
hermes gateway run

# gateway.log emits every minute:
#   kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
# Per-task log /home/<u>/.hermes/<profile>/kanban/logs/<task_id>.log:
#   Error: Profile 'orion-cc' does not exist. Create it with:
#       hermes profile create orion-cc
# ps -ef | grep '[h]ermes.*defunct' — zombies pile up until reaped.

Fix

dispatch_once() now pre-checks hermes_cli.profiles.profile_exists(assignee) before claiming. If the profile does NOT exist, the row is appended to skipped_unassigned (semantically: it's unassigned to an executable profile) and the dispatcher moves on without claiming, spawning, or counting a crash.

The import is locally scoped + try/except wrapped, so if profile_exists is missing or fails to import (test isolation, future module restructure) the original behaviour is preserved unchanged.

Why profile-existence over a config flag

The kanban task body (t_2bab06e3 on Brecht-H's local kanban) hinted at gating behind a config flag like assignee=hermes|auto. Profile-existence is a strictly tighter check:

  • Self-documenting — the operator already knows whether they have an orion-cc profile; no allowlist to maintain.
  • Forward-compatible — the moment a new lane gets a real hermes profile create <name>, it auto-qualifies for spawn.
  • No new config surface — zero new keys in config.yaml.

Operators who want the "config flag" semantics can still opt in via creating an empty placeholder profile.

Validated live (Orion machine)

Two orion-research-assigned tasks (t_a14dc1d5 Bug-C investigation, t_646c96f2 provider-routing validation) had been crash-looping since 2026-05-05 06:58 UTC after Mac switched the lane workflow to kanban-pull-by-terminal. Pre-patch:

2026-05-05 07:30:05 INFO gateway.run: kanban dispatcher: tick spawned=2 reclaimed=0 crashed=2 timed_out=0 promoted=0 auto_blocked=0
2026-05-05 07:31:05 INFO gateway.run: kanban dispatcher: tick spawned=2 reclaimed=0 crashed=2 timed_out=0 promoted=0 auto_blocked=0
... (every minute, 2 hours+)

Post-patch (gateway restart at 07:41:39):

2026-05-05 07:41:39 INFO gateway.run: kanban dispatcher: embedded in gateway (interval=60.0s)
( silent — spawn_any=False on every tick, log line guarded behind `if res.spawned` )

Live state:

  • Stale running claims auto-reclaimed to ready on the first post-patch tick.
  • Tasks now sit at status=ready, claim_lock=None, worker_pid=None, spawn_failures=0 — clean, ready for terminal pull.
  • Dashboard / telegram / freqtrade / committee_listener all unaffected (only the dispatcher path changed).

Test plan

  • Live verification on Orion: 2-hour crash loop terminated, dispatcher silent, no defuncts pile up
  • Tasks reclaim cleanly to ready post-restart
  • Existing well-behaved tasks (assignee=daily) still spawn (counterfactual: profile_exists("daily") = True confirmed via Python REPL)
  • Defensive import — if hermes_cli.profiles ever moves, fall-through to original behaviour

🤖 Generated with Claude Code

…l profile

The kanban dispatcher's `_default_spawn` invokes
``hermes -p <task.assignee> chat -q ...``. When ``assignee``
names a control-plane lane (e.g. an interactive Claude Code
terminal like ``orion-cc`` / ``orion-research``) instead of a
real Hermes profile, the subprocess fails on startup with
"Profile 'X' does not exist", gets reaped as a zombie, the
TTL/crash detector marks the task back to ``ready``, and the
next tick re-spawns the same crashing worker. Result: a
permanent crash loop emitting ``spawned=2 crashed=2 every tick``
in the gateway log and burning CPU forever.

Reproduce on a fresh Hermes-agent install:

  # 1. Create a kanban task whose assignee names a non-profile.
  hermes kanban create --assignee orion-cc --status ready \
      --title "Review PR #N" --body "..."
  # 2. Start the gateway with the embedded dispatcher.
  hermes gateway run
  # gateway.log lines every minute:
  #   kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
  # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies.

Fix
---
``dispatch_once()`` now pre-checks ``hermes_cli.profiles.
profile_exists(assignee)`` before claiming. If False, the row
is added to ``skipped_unassigned`` (it's effectively
"unassigned-to-an-executable-profile") and the dispatcher
moves on without claiming, spawning, or counting a crash.

The check is opt-in safe: if the import fails (e.g. test
isolation, profile module restructured), ``profile_exists``
falls back to ``None`` and the original behaviour is preserved
unchanged.

This addresses the explicit hint in the kanban task body
(``t_2bab06e3``):

  "Should ready-state tasks auto-spawn at all, or only on
  explicit orion-cc claim? If spurious, gate the auto-spawn
  behind a config flag (e.g. only assignee=hermes or
  assignee=auto)."

Profile-existence is a tighter gate than a config flag — it
self-documents (the user already knows whether they have an
``orion-cc`` profile), and it doesn't require Mac to maintain
an allowlist as new lane names appear. New lanes that ARE
real profiles (created via ``hermes profile create``) auto-
qualify the moment the profile dir is created.

Validated live
--------------
On Orion's hermes-agent install, two ``orion-research``-
assigned tasks (Bug A and Bug C investigations) had been
crash-looping since 2026-05-05 06:58 local. After applying
the patch + restarting the gateway:

- Stale ``running`` claims released to ``ready`` cleanly.
- New gateway emitted ``kanban dispatcher: embedded`` and
  has ticked silently for 2+ minutes — no spawned=,
  crashed=, or stuck= log lines (all spawn skips are quiet).
- Tasks remain ``ready`` with ``claim_lock=None``,
  ``worker_pid=None``, ``spawn_failures=0``.
- Dashboard + telegram + freqtrade unaffected.

Confidence: high (live verified on Orion).
Scope-risk: narrow (additive guard inside one function).
Not-tested: behaviour when a profile is renamed mid-tick —
current code re-imports ``profile_exists`` per row so a
freshly created profile auto-qualifies on the next tick.
Machine: orion-terminal

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 5, 2026 07:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the embedded kanban dispatcher to avoid repeatedly spawning crashing hermes -p <assignee> subprocesses when a task’s assignee refers to a non-existent Hermes profile (e.g., a control-plane “lane” name), preventing crash loops, zombie buildup, and unnecessary CPU usage.

Changes:

  • Add a pre-check in dispatch_once() to skip ready tasks whose assignee does not correspond to an on-disk Hermes profile (best-effort via a guarded local import).
  • Treat these skipped tasks as “unassigned” for dispatcher result accounting by appending them to skipped_unassigned.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_cli/kanban_db.py
Comment on lines +2509 to +2516
# Skip ready tasks whose assignee is not a real Hermes profile.
# `_default_spawn` invokes ``hermes -p <assignee>`` which fails
# with "Profile 'X' does not exist" when the assignee names a
# control-plane lane (e.g. an interactive Claude Code terminal
# like ``orion-cc`` / ``orion-research``) rather than a Hermes
# profile. Those task lanes are pulled by terminals via
# ``claim_task`` directly and should NEVER auto-spawn — the
# subprocess would crash on startup, get reaped as a zombie,
Comment thread hermes_cli/kanban_db.py
Comment on lines +2523 to +2525
if profile_exists is not None and not profile_exists(row["assignee"]):
result.skipped_unassigned.append(row["id"])
continue
Comment thread hermes_cli/kanban_db.py
Comment on lines +2519 to +2522
try:
from hermes_cli.profiles import profile_exists # local import: avoids cycle
except Exception:
profile_exists = None # type: ignore[assignment]
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 5, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #20054 — stricter fix that checks profile existence rather than just config readiness.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #20054 — stricter fix that checks profile existence rather than just config readiness.

@Brecht-H

Brecht-H commented May 5, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the deep diagnosis. Live patch deployed locally on Mac/Allaert's Hermes 2026-05-05 — gateway dispatcher silent for 5+ min after restart vs 2 crashes/min for the prior 2+ hours. Root cause confirmed: _default_spawn invoking hermes -p <assignee> against assignee strings that aren't real Hermes profiles (we use orion-cc and orion-research as logical lane names for human terminal pulling, not as profile names).

The profile_exists() skip is the right shape — keeps auto-spawn semantics intact for assignee=hermes (autonomous tasks) while letting human-lane assignees (orion-cc, orion-research, mac, allaert, etc.) flow through to manual claim.

Side note for downstream Hermes operators: this also unblocks the kanban-as-handover-queue pattern where Mac creates tasks for human terminals to pull. Without this fix, every status='ready' row crashes the dispatcher.

No blockers — green to merge from our end.

teknium1 pushed a commit that referenced this pull request May 5, 2026
…ly non-spawnable assignees

After PR #20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR #20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR #20105). Reviewer: this PR is meant to merge AFTER #20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@teknium1

teknium1 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Merged via PR #20165 (commit f25d3ec). Both your commits (the profile_exists guard in dispatch_once and the stuck-warn suppression stacked on top) were cherry-picked onto current main with your authorship preserved in git log. Thanks for the clean fix, detailed repro, and live-verification on Orion — really solid work.

#20165

nickdlkk pushed a commit to nickdlkk/hermes-agent that referenced this pull request May 11, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…ly non-spawnable assignees

After PR NousResearch#20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.

The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.

Changes:

* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
  (separate from ``skipped_unassigned``) so callers can distinguish
  "task missing an owner — operator should route it" from "task
  owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
  into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
  one ready+assigned+unclaimed task in the DB has an assignee that
  maps to a real Hermes profile. Falls back to legacy "any
  ready+assigned" when ``profile_exists`` is unimportable so degraded
  installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
  daemon (``hermes_cli/kanban.py``) both swap their cheap
  ``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
  now fires only when there is genuine spawnable work the dispatcher
  failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
  terminal lane, OK)`` for visibility without alarm.

Tests:

* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
  only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
  verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
  cross-leak.
* Added ``all_assignees_spawnable`` fixture in
  ``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
  tests that use synthetic assignees ("alice", "bob"). PR NousResearch#20105
  (the parent commit) silently broke 8 such tests by routing those
  assignees into ``skipped_nonspawnable`` instead of spawning; this
  PR repairs them as part of the same code area.

Verified locally: 246/246 kanban-suite tests pass.

Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR NousResearch#20105). Reviewer: this PR is meant to merge AFTER NousResearch#20105.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants