Skip to content

feat: single gateway, multiple agents (multi-agent MVP)#25008

Closed
02356abc wants to merge 6997 commits into
NousResearch:mainfrom
02356abc:feat/single-gateway-multi-agent
Closed

feat: single gateway, multiple agents (multi-agent MVP)#25008
02356abc wants to merge 6997 commits into
NousResearch:mainfrom
02356abc:feat/single-gateway-multi-agent

Conversation

@02356abc

Copy link
Copy Markdown
Contributor

Summary

This PR implements single-gateway multi-agent routing — the ability to run multiple isolated AI agents from a single gateway process, each with its own model, personality (SOUL.md), memory, skills, and sessions.

Key Changes

  1. FoundationSessionSource.agent_id + build_session_key rewrite + SessionEntry.agent_id + SessionDB migration. All defaults preserve existing agent:main:... key format.
  2. AgentProfile + ContextVar — New agent/profile.py with AgentProfile dataclass and use_profile() context manager. Path getters (get_hermes_home, get_memory_dir, etc.) read the ContextVar first, falling back to env — zero behavior change for single-agent installs.
  3. Routing — New gateway/agent_routing.py with declarative route matching (platform, chat_id, thread_id, user_id, guild_id). First-match-wins. Plus select_agent plugin hook for custom logic.
  4. Adapter WiringBasePlatformAdapter._attach_agent_id() stamps agent_id on event.source before build_session_key, so all platforms share one routing point.
  5. GatewayRunner — Loads agent registry from config, wraps _handle_message_with_agent in use_profile(), passes registry to cron scheduler and delivery router.
  6. Cron + DeliveryCronJob.agent_id + DeliveryTarget.agent_id. Scheduler and router switch profiles per job/target. get_all_due_jobs(registry) iterates all profiles.
  7. Hooks — All invoke_hook call sites pass agent_id= kwarg.
  8. CLI — New hermes agent subcommand: list, show, add, remove.
  9. Docs — Multi-agent routing guide, cli-config.yaml.example updates, README mention.
  10. Tests — 62 new tests for routing, ContextVar isolation, and session keys. All gateway tests pass (8 pre-existing failures unrelated to this PR).

Backward Compatibility

Scenario Behavior
No agents: / routes: config Everything routes to main, session keys unchanged
Existing SQLite sessions Migrated with agent_id column default "main"
Existing sessions.json from_dict defaults "main"
HERMES_HOME env Still governs default profile home
Single-agent installs Zero behavior change

Configuration Example

default_agent: main
agents:
  coder:
    model: "anthropic/claude-opus-4-6"
    home_dir: ~/.hermes/profiles/coder
  research:
    model: "anthropic/claude-sonnet-4-6"
routes:
  - match: { platform: telegram, chat_id: "-1001234", thread_id: "42" }
    agent: coder
  - match: { platform: slack, guild_id: "T0ABC" }
    agent: coder

Smoke Tested

  • Gateway running with weixin → main and wecom → wecom-agent
  • Session keys correctly prefixed: agent:main:weixin:... and agent:wecom-agent:wecom:...
  • Memory isolation: separate MEMORY.md / USER.md / SOUL.md per profile
  • Cron jobs execute under correct profile context and deliver to correct channels

Test Results

gateway tests:  5370 passed, 8 failed (pre-existing)
agent tests:    2726 passed, 3 failed (pre-existing)
new tests:      62 passed (0 failed)

🤖 Generated with Claude Code

teknium1 and others added 30 commits April 30, 2026 10:31
restore_skill() in tools/skill_usage.py used archive_root.iterdir(), which
only walked the top level of .archive/. Skills archived under nested layouts
(e.g. .archive/openclaw-imports/<skill>/ from older archive paths or
external imports) were invisible to both the exact-match and prefix-match
candidate scans, surfacing as a misleading "skill '<name>' not found in
archive" error even though the directory existed on disk.

Switch both candidate scans to archive_root.rglob('*') so the lookup
descends into category subdirectories.

Fixes NousResearch#17942
Treat skill views and edits as activity when curator reports and applies lifecycle transitions, so recently loaded or patched skills are not displayed or transitioned as never used.\n\nAdds regression tests for activity derivation, automatic transitions, and CLI status output.
it feels so nice :3 just a lil popup ! doesn't get in the way or take
any focus or anything, and directs users to /help for more info :3
…r status` (NousResearch#18033)

Alongside the existing 'least recently used' section, surface two more
rankings so users can see which of their agent-created skills actually
get exercised:

- 'most used (top 5)' — sorted by use_count descending. Hidden when every
  skill has use_count=0 (noise suppression on fresh installs).
- 'least used (top 5)' — sorted by use_count ascending. Always shown
  when the catalog is non-empty.

use_count started tracking real agent skill activation in PR NousResearch#17932
(bump_use wired into skill_view tool + slash invocation + --skill
preload), so these rankings are now meaningful.

Tests: 3 new in tests/hermes_cli/test_curator_status.py — happy path
with mixed use_counts, zero-use suppression of the most-used section,
and the no-skills clean-empty case.
feat(tui): add a mini help menu when u write ? in the input field
Builds on NousResearch#16855 (@lsdsjy) which fixed DeepSeek v4 reasoning_content
replay via model_extra fallback + capturing tool_calls at method entry.
Kimi / Moonshot thinking mode enforces the same echo-back contract and
hits the same 400 when a tool-call turn is persisted without
reasoning_content.

- _build_assistant_message: pad branch now uses _needs_thinking_reasoning_pad()
  (DeepSeek OR Kimi) instead of _needs_deepseek_tool_reasoning() alone.
- Extract _needs_thinking_reasoning_pad() and reuse it in
  _copy_reasoning_content_for_api so both sites share one predicate.
- tests/run_agent/test_deepseek_reasoning_content_echo.py: add
  TestBuildAssistantMessagePadsStrictProviders parametrized over DeepSeek
  (attr=None, attr-absent), Kimi (attr=None), Moonshot (via base_url),
  and an OpenRouter negative control that must NOT pad. Proven to fail
  2/5 cases on Kimi/Moonshot without this change.
- scripts/release.py: add AUTHOR_MAP entries for lsdsjy and season179.

Refs NousResearch#17400.

Co-authored-by: season179 <season.saw@gmail.com>
The Curator release — Hermes Agent now maintains itself. Autonomous
background Curator grades, prunes, and consolidates the skill library;
self-improvement loop substantially upgraded; four new inference
providers; Microsoft Teams (via pluggable platforms) + Yuanbao as 18th
and 19th messaging platforms; Spotify + Google Meet native integrations;
ComfyUI + TouchDesigner-MCP bundled by default; Humanizer skill ported;
~57% cut to visible TUI cold start.

Stats since v0.11.0: 1,096 commits, 550 merged PRs, 1,270 files
changed, 217,776 insertions, 213 community contributors.
…persist-user-message-test-mocks

test(acp): accept prompt persistence kwargs in MCP E2E mocks
…board-profiles-hms-coder

feat(dashboard): add profiles management page
Replace the tsc + babel pipeline with a single esbuild invocation that
produces a self-contained dist/entry.js. The nix TUI derivation no
longer copies node_modules — only dist/ + package.json ship, shrinking
the output from hundreds of MB to ~2.9 MB.

- ui-tui/scripts/build.mjs: new esbuild bundler. Aliases @hermes/ink
  to source (esbuild's __esm helper doesn't await nested async init,
  which breaks lazy-assigned exports like 'render' when re-exporting
  through a prebuilt submodule). Stubs react-devtools-core (dev-only).
  Injects a createRequire shim for transitive CJS deps. Strips the
  shebang from src/entry.tsx because Nix patchShebangs mangles
  '/usr/bin/env -S node --max-old-space-size=8192 --expose-gc' — it
  drops the 'node' token. The Python launcher always invokes node
  explicitly, so the shebang is redundant.
- nix/tui.nix: installPhase no longer copies node_modules or the
  @hermes/ink packages dir.
- nix/checks.nix: drop the 'node_modules present' assertion.
- hermes_cli/main.py: _tui_need_npm_install short-circuits when
  dist/entry.js exists and no package-lock.json is present. That is
  the prebuilt-bundle layout (nix / packaged release) and there is
  nothing to install. Without this, the launcher tried to npm install
  in a non-existent site-packages/ui-tui path.
the esbuild pipeline (scripts/build.mjs) already bundles ink into a
single self-contained dist/entry.js.

remove the Dockerfile steps that manually copied packages/hermes-ink
into node_modules/@hermes/ink and ran a nested
npm install there.

- Dockerfile: simplify TUI build step to just 'npm run build'
- hermes_cli/main.py: _tui_build_needed now checks dist/entry.js
staleness against source files before falling back to the old
ink-bundle.js logic
- tests: update TUI npm install tests and drop the Dockerfile contract
test for the removed ink materialization step
Update all platform enumeration lists to include Teams:
index.md, quickstart.md, integrations/index.md, sessions.md,
slash-commands.md, updating.md, hooks.md, hermes-agent skill.

Skipped PII redaction docs — Teams uses AAD object IDs, not
phone numbers, so redaction doesn't apply there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add to platform description and intro paragraph
- Add row to platform comparison table (images + typing)
- Add node to architecture mermaid diagram
- Add TEAMS_ALLOWED_USERS to security examples
- Add to platform-specific toolsets table
- Add to Next Steps links

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire reply_to into send() using App.reply(conv_id, msg_id, content)
which constructs the threaded conversation ID internally.
Threads supported in channels and group chats.

Update comparison table: Threads ✅

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Group chats return 400 for threaded sends. Catch the error and
fall back to a flat send so messages always get delivered.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The SDK requires Python >=3.12 so CI (3.11) falls to the except
ImportError branch, leaving TypingActivityInput=None. After loading
the adapter module, explicitly restore it from the mock so
test_send_typing doesn't silently no-op.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous bare except swallowed every exception from app.reply()
silently. Log at debug so real failures (auth, chat gone) leave a
trace while keeping the group-chat 400 fallback working. Also fix
the Teams entry's indentation in the messaging flowchart.
…ousResearch#20144)

The fix-lockfiles script used 'nix build .#tui.npmDeps' to detect stale
hashes. This always succeeds when the OLD derivation is cached in Cachix
or cache.nixos.org — even when the source package-lock.json has changed.

Fix: use prefetch-npm-deps to compute the hash directly from the lockfile
and compare against what's in the nix file. Falls back to nix build only
if prefetch-npm-deps fails.
hermes setup / hermes model used to silently skip the key prompt when
any value was present in .env — even a malformed paste — leaving users
with a stuck '✓' and no way to recover without hand-editing .env.

Replace the silent acknowledgement at all three API-key provider flows
(Kimi, Stepfun, generic) with a single [K]eep / [R]eplace / [C]lear
menu via a shared `_prompt_api_key` helper.

- K / Enter / Ctrl-C / unknown input → keep (never destroys the key)
- R → getpass for new key; empty input cancels and preserves existing
- C → clears the env var, tells user to rerun hermes setup, aborts flow

LM Studio's no-auth-placeholder substitution stays on first-time entry
only; on Replace an empty input means 'cancel', not 'overwrite with
dummy key'.

11 unit tests cover all branches incl. garbage-input-keeps-key, Ctrl-C
at the choice prompt, Replace-cancel preserving the old key, Clear
wiping only the target env var, and lmstudio placeholder semantics.

Fixes NousResearch#16394
Reshapes NousResearch#18355 — original PR pasted the menu inline at 3 sites with
no tests; this consolidates to one helper (+88/-66) with coverage.

Co-authored-by: Feranmi10 <89228157+Feranmi10@users.noreply.github.com>
…l profile

The kanban dispatcher's `_default_spawn` invokes
``hermes -p <task.assignee> chat -q ...``. When ``assignee``
names a control-plane lane (e.g. an interactive Claude Code
terminal like ``orion-cc`` / ``orion-research``) instead of a
real Hermes profile, the subprocess fails on startup with
"Profile 'X' does not exist", gets reaped as a zombie, the
TTL/crash detector marks the task back to ``ready``, and the
next tick re-spawns the same crashing worker. Result: a
permanent crash loop emitting ``spawned=2 crashed=2 every tick``
in the gateway log and burning CPU forever.

Reproduce on a fresh Hermes-agent install:

  # 1. Create a kanban task whose assignee names a non-profile.
  hermes kanban create --assignee orion-cc --status ready \
      --title "Review PR #N" --body "..."
  # 2. Start the gateway with the embedded dispatcher.
  hermes gateway run
  # gateway.log lines every minute:
  #   kanban dispatcher: tick spawned=1 reclaimed=0 crashed=1 ...
  # 3. ps -ef | grep '[h]ermes.*defunct' shows zombies.

Fix
---
``dispatch_once()`` now pre-checks ``hermes_cli.profiles.
profile_exists(assignee)`` before claiming. If False, the row
is added to ``skipped_unassigned`` (it's effectively
"unassigned-to-an-executable-profile") and the dispatcher
moves on without claiming, spawning, or counting a crash.

The check is opt-in safe: if the import fails (e.g. test
isolation, profile module restructured), ``profile_exists``
falls back to ``None`` and the original behaviour is preserved
unchanged.

This addresses the explicit hint in the kanban task body
(``t_2bab06e3``):

  "Should ready-state tasks auto-spawn at all, or only on
  explicit orion-cc claim? If spurious, gate the auto-spawn
  behind a config flag (e.g. only assignee=hermes or
  assignee=auto)."

Profile-existence is a tighter gate than a config flag — it
self-documents (the user already knows whether they have an
``orion-cc`` profile), and it doesn't require Mac to maintain
an allowlist as new lane names appear. New lanes that ARE
real profiles (created via ``hermes profile create``) auto-
qualify the moment the profile dir is created.

Validated live
--------------
On Orion's hermes-agent install, two ``orion-research``-
assigned tasks (Bug A and Bug C investigations) had been
crash-looping since 2026-05-05 06:58 local. After applying
the patch + restarting the gateway:

- Stale ``running`` claims released to ``ready`` cleanly.
- New gateway emitted ``kanban dispatcher: embedded`` and
  has ticked silently for 2+ minutes — no spawned=,
  crashed=, or stuck= log lines (all spawn skips are quiet).
- Tasks remain ``ready`` with ``claim_lock=None``,
  ``worker_pid=None``, ``spawn_failures=0``.
- Dashboard + telegram + freqtrade unaffected.

Confidence: high (live verified on Orion).
Scope-risk: narrow (additive guard inside one function).
Not-tested: behaviour when a profile is renamed mid-tick —
current code re-imports ``profile_exists`` per row so a
freshly created profile auto-qualifies on the next tick.
Machine: orion-terminal

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
02356abc and others added 6 commits May 13, 2026 13:38
When create_job(agent_id='xxx') is called from a context whose
ContextVar still points to main (e.g. direct script invocation or
gateway handlers), the job was previously saved into main's jobs.json
while stamped with agent_id='xxx'.  This caused mark_job_run to fail
because it looked in the wrong profile directory.

Now create_job detects the mismatch and switches to the target
profile's context before load_jobs/save_jobs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All plugin hooks now receive an agent_id kwarg so callbacks can
branch on which agent profile fired the event:

- gateway/run.py: on_session_finalize (shutdown + expiry),
  pre_gateway_dispatch, on_session_reset
- run_agent.py: on_session_start, pre_llm_call, pre_api_request,
  post_api_request, transform_llm_output, post_llm_call, on_session_end
- model_tools.py: post_tool_call, transform_tool_result
- tools/approval.py: all approval hooks
- tools/terminal_tool.py: transform_terminal_output
- tools/delegate_tool.py: subagent_stop
- cli.py: on_session_finalize, on_session_end
- tui_gateway/server.py: session lifecycle hooks
- gateway/platforms/base.py: select_agent
- hermes_cli/plugins.py: pre_tool_call

The agent_id is resolved from the active ContextVar profile at each
fire point; when no profile is active (bare CLI, tests) it defaults
to None.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New commands:
- hermes agent list    — table of agents with model/routes/home
- hermes agent show    — paths, routes, SOUL.md preview
- hermes agent add     — create agent, optionally clone profile
- hermes agent remove  — delete agent with orphan-route warning

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add comprehensive multi-agent routing guide to website docs
- Update cli-config.yaml.example with agents/routes examples
- Add multi-agent feature mention to README
- Add multi-agent link to messaging gateway index

Smoke tested: gateway routes weixin→main and wecom→wecom-agent
with isolated memory, skills, SOUL.md, and sessions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- test_agent_routing.py: 25 tests for resolve_agent_id, route matching,
  declaration order, invalid route handling, all match keys
- test_profile_contextvar.py: 25 tests for AgentProfile, ContextVar,
  use_profile context manager, async isolation (gather, sibling tasks),
  load_agent_registry
- test_session.py: add 12 tests for build_session_key with agent_id
  across DM, group, thread, WhatsApp, shared group modes

All 62 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- gateway/run.py: use getattr(self, '_agent_registry', None) so tests
  that mock GatewayRunner without setting _agent_registry don't crash
- test_session_boundary_hooks.py: expect agent_id=None in invoke_hook
  assertions (matches commit 6 hook kwargs change)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/cron Cron scheduler and job management comp/tools Tool registry, model_tools, toolsets labels May 13, 2026
02356abc and others added 2 commits May 13, 2026 21:33
…ept, clarify select_agent hook

- gateway/run.py: Remove duplicate set_routing_context() call
- hermes_constants.py: Change broad `except Exception` to `except ImportError`
  for lazy get_active_profile import
- gateway/platforms/base.py: Remove agent_id=None from select_agent hook
  and add explanatory comment about why it's intentionally omitted

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR NousResearch#25008 (single gateway, multi-agent MVP)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@liuhao1024

Copy link
Copy Markdown
Contributor

CI Fix Needed: test_cron_context_from.py broken by HERMES_DIR removal

The PR removes the module-level constants HERMES_DIR, CRON_DIR, JOBS_FILE, OUTPUT_DIR from cron/jobs.py (replacing them with dynamic _get_cron_dir() / _get_jobs_file() / _get_output_dir() functions), but tests/cron/test_cron_context_from.py still monkeypatches the old names:

# Line 21-24 in test_cron_context_from.py — these attributes no longer exist:
monkeypatch.setattr(jobs_mod, "HERMES_DIR", hermes_home)
monkeypatch.setattr(jobs_mod, "CRON_DIR", hermes_home / "cron")
monkeypatch.setattr(jobs_mod, "JOBS_FILE", hermes_home / "cron" / "jobs.json")
monkeypatch.setattr(jobs_mod, "OUTPUT_DIR", hermes_home / "cron" / "output")

This causes AttributeError: module has no attribute HERMES_DIR for all 12 tests in that file.

Suggested fix — monkeypatch get_hermes_home instead, since the new functions resolve paths through it:

import hermes_constants as hc

monkeypatch.setattr(hc, "get_hermes_home", lambda: hermes_home)

Everything else in the diff looks solid — the ContextVar approach and the routing table design are well thought out.

…compat

Commit 5 replaced module-level path constants with dynamic _get_cron_dir()
functions for per-agent ContextVar support.  Existing tests monkeypatch
CRON_DIR and HERMES_DIR directly, causing 150 test errors across:
  - tests/cron/test_jobs.py
  - tests/cron/test_rewrite_skill_refs.py
  - tests/hermes_cli/test_cron.py
  - tests/tools/test_cronjob_tools.py
  - tests/test_timezone.py
  - tests/agent/test_curator_*.py

Fix: keep the constants as backwards-compatible fallbacks (resolved at
import time from the default profile), while production code continues
to use _get_cron_dir() for dynamic per-agent resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@02356abc 02356abc force-pushed the feat/single-gateway-multi-agent branch from 8c84c93 to 1c8f824 Compare May 14, 2026 01:06
@02356abc

Copy link
Copy Markdown
Contributor Author

@liuhao1024 Thanks for catching this! Fixed in 1c8f824 — restored , , , and as backwards-compatible module-level constants so test monkeypatching works again. Production code continues to use the dynamic / / functions for per-agent ContextVar resolution.

@02356abc

Copy link
Copy Markdown
Contributor Author

Small correction to the above: the restored constants are `HERMES_DIR`, `CRON_DIR`, `JOBS_FILE`, and `OUTPUT_DIR`.

…for agent_id

- hermes_cli/main.py: Add 'agent' to _BUILTIN_SUBCOMMANDS frozenset so
  plugin discovery is skipped for the new hermes agent subcommand.
- tests/test_model_tools.py: Update expected hook call signatures to
  include agent_id=None (added in commit 6 for multi-agent support).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@02356abc 02356abc force-pushed the feat/single-gateway-multi-agent branch from ce10b91 to 41c6915 Compare May 14, 2026 02:19
@02356abc 02356abc requested a review from a team May 14, 2026 02:19
@02356abc

Copy link
Copy Markdown
Contributor Author

Closing this PR to resolve the git history rewrite caused by filter-branch. Will reopen a clean PR after local verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/cron Cron scheduler and job management comp/gateway Gateway runner, session dispatch, delivery comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.