feat: single gateway, multiple agents (MVP)#25660
Conversation
4d8a642 to
b856e06
Compare
|
Tracked follow-up technical debt from this PR:
|
CI Test Failure AnalysisThe Verified locallyAll tests related to this PR pass locally (313+ tests):
CI failure breakdown (all pre-existing)
None of these failures are related to the multi-agent changes introduced in this PR. |
|
@discolotus Thanks for tracking these follow-ups! All four items are already documented in the DESIGN.md file under the "Non-Goals (Future PRs)" section with the same issue numbers you listed. The design doc explicitly scopes them out of this MVP to keep the PR reviewable. |
E2E Test Report — Multi-Agent Routing ValidationWe completed end-to-end validation of the multi-agent routing feature. Here is the summary: Test Matrix
Configuration Useddefault_agent: main
agents:
main: {}
wecom-agent:
home_dir: /root/.hermes/profiles/wecom-agent
code:
model: kimi-for-coding
provider: moonshot
home_dir: /root/.hermes/profiles/code
routes:
- match: { platform: wecom }
agent: wecom-agent
- match: { platform: matrix }
agent: codeKanban Subsystem Impact AnalysisThe Kanban subsystem requires zero code changes. Key findings:
Configuration convention: Kanban task Full details: |
|
@alt-glitch This PR is ready for review. Here's a summary of what's been addressed since the initial submission: Changes since last review
Key design decisions for reviewer attention
Please let me know if you'd like any section expanded or if there are specific areas you'd like me to walk through. |
673123a to
48894d4
Compare
|
Force-pushed: rewrote commit history from 11 commits to 7 focused commits. Line count breakdown by category
Key point: tests + docs together account for 53.6% of the diff. Production code surface (1,422 lines)Only 16 files contain production code changes; the rest are tests, docs, or config:
What was removed vs the previous 11-commit version
Verification
|
48894d4 to
730d92c
Compare
|
Force-pushed (rebased onto latest main + fixed run_agent.py refactor migration). What changed in this pushRebased onto latest main —
Line count breakdown by category
PR status
|
|
Rebased onto latest main (519657a) and resolved conflicts from the run_agent.py refactor. What changed since last push:
Test results after rebase:
Commit breakdown (7 commits):
Ready for review. |
Introduce AgentProfile dataclass and a ContextVar (_current_agent_profile) that lets path getters (get_hermes_home, get_skills_dir, get_memory_dir) resolve to the active agent's home directory under asyncio. - agent/profile.py: AgentProfile, use_profile() context manager, load_agent_registry() from GatewayConfig - hermes_constants.py: get_hermes_home() reads ContextVar before env fallback - tests/agent/test_profile_contextvar.py: ContextVar isolation under asyncio.gather, nested contexts, registry loading Single-agent installs see zero change — no profile bound means fallback to HERMES_HOME env var as before.
Add agent_id field to SessionSource and SessionEntry, prefix session keys with agent:<id>: in build_session_key. Default "main" preserves every historical key string for single-agent installs. - gateway/session.py: SessionSource.agent_id, SessionEntry.agent_id, build_session_key prefixing - hermes_state.py: sessions table migration (agent_id TEXT DEFAULT 'main'), new idx_sessions_agent index - tests/gateway/test_session.py: build_session_key prefixing for all chat_type × agent_id combinations - tests/*/test_session_boundary_hooks.py: hook payload agent_id kwarg
… hook
Add declarative routing (routes: match → agent) and a select_agent plugin
hook. _attach_agent_id injects the resolved agent_id into event.source
before build_session_key. Seven platform adapters get pre-injection for
batching paths; the rest inherit it from base.py.
- gateway/agent_routing.py: resolve_agent_id(), _route_matches()
- gateway/config.py: agents, routes, default_agent schema
- gateway/platforms/base.py: _attach_agent_id(), set_routing_context()
- gateway/platforms/{telegram,discord,slack,matrix,feishu,wecom,yuanbao}.py:
pre-batch injection
- hermes_cli/plugins.py: select_agent hook registration
- tests/gateway/test_agent_routing.py: declared-order matching, hook chain,
default fallback, profile isolation
…s agent_id to hooks
GatewayRunner loads the agent registry at init and wraps every inbound
message in use_profile(). AIAgent accepts an optional profile= kwarg.
All invoke_hook call sites gain agent_id= kwarg. _handle_message is
split into _handle_message (ContextVar plumbing) + _handle_message_inner
(legacy logic) so tests that grep the source body continue to work.
- gateway/run.py: registry loading, use_profile() wrapping, hook kwargs
- run_agent.py: AIAgent(profile=), profile-aware model/toolset resolution
- model_tools.py, tools/{approval,terminal,delegate}.py: hook agent_id
- cli.py, tui_gateway/server.py: session boundary hook agent_id
- tests/gateway/test_profile_overrides.py: per-agent model/toolset overrides
- tests/test_model_tools.py: hook payload verification
- tests/gateway/test_{update,title,reasoning}_command.py: adapt to
_handle_message split
…veries Cron tick and delivery routing now bind the correct profile before execution. jobs.py does NOT persist agent_id in JSON — the directory is the identity. Delivery uses nullcontext() for the unrouted case. - cron/jobs.py: in-memory agent_id stamping at read time, directory-based identity (no JSON field) - cron/scheduler.py: use_profile() wrapper in tick path - gateway/delivery.py: use_profile() wrapper per delivery target - tests/cron/test_scheduler.py: agent_id propagation in delivery targets
New hermes agent subcommand group: list, show, add, remove. Manages agent profiles and routing config in ~/.hermes/config.yaml. - hermes_cli/agent.py: cmd_agent_list, cmd_agent_show, cmd_agent_add, cmd_agent_remove with profile cloning and route cleanup - hermes_cli/main.py: parser registration - tests/hermes_cli/test_agent_cli.py: list/show/add/remove coverage, route orphan warnings, SOUL summarization
730d92c to
49789b1
Compare
|
I noticed an issue with multi-agent cron: when the scheduler calls get_all_due_jobs(registry), it iterates ALL agents — so the same cron job (with a fixed deliver target) gets executed N times, producing duplicate deliveries to the same chat. In my setup with 5 agents (main, coder, reviewer, wife, matrix), every cron job with deliver: telegram:<my_chat_id> was delivered 5 times. The explicit Proposed fix: add an optional python Users would then set in config.yaml: yaml Happy to submit a PR if this aligns with the direction. |
|
Hi @02356abc — thanks for this PR; the architecture here is exactly what self-hosted multi-agent setups need. We've been building on it locally (test install + a follow-on patch wiring This PR is in CONFLICTING state vs current main — we'd like to help if useful. A few ways we could:
To unblock our downstream work, we need this architecture landed in some form by 2026-05-30 (two days from this comment). If we haven't heard from you by then, we'll go with option 3 to keep things moving — but our strong preference is collaborating with you directly. The architecture is sound; we just need the base PR in mergeable shape. Happy to discuss tradeoffs or design questions if any of #25695-#25698 are blockers from your side. |
|
Hi @02356abc — thanks for this PR; the architecture is exactly what's needed for self-hosted multi-agent setups. We have a follow-on patch wiring Upstream has moved 1174 commits since the PR's base (
All 38 routing tests still pass ( If it'd help, we're happy to:
Happy to do whichever works for you. We'd like this architecture to land so we can build on it. |
|
Thanks @davidgut1982 — option 3 works great for me. Go ahead and open the follow-on PR with the rebased base + your The more the community can build on top of this architecture, the better. Would love to see more contributors involved — if anyone else has patches or ideas around the multi-agent surface (#25695–#25698), now is a good time to jump in. @azharkov78 — the cron duplication issue you raised (#25695 area) is real and worth fixing. If you want to submit a PR targeting the follow-on, that would be a great way to contribute. |
The OpenAI-compatible HTTP adapter was the one inbound surface from PR NousResearch#25660 that never called ``_attach_agent_id`` — every ``/v1/chat/completions``, ``/v1/responses``, and ``/v1/runs`` request fell through to ``default_agent`` regardless of the configured routes, silently undermining the multi-agent guarantee on any deployment that exposes the API server. Add a single routing entry point, ``_resolve_agent_profile``, that: * Reads ``X-Hermes-Chat-Id`` / ``X-Hermes-User-Id`` / ``X-Hermes-Thread-Id`` from the request (sanitised through the same length + control-char caps as the existing ``X-Hermes-Session-Id`` / ``X-Hermes-Session-Key``). * Builds a synthetic ``SessionSource(platform=API_SERVER, …)`` and pipes it through the shared ``_attach_agent_id`` hook so declarative routes *and* the ``select_agent`` plugin hook fire identically to every other adapter. * Looks up the resolved ``agent_id`` in ``self._gateway_ref._agent_registry`` and returns the matching ``AgentProfile`` (or ``None`` for legacy single-agent installs). The three agent-invoking handlers (chat completions, responses, runs) now resolve the profile up front and bind it via ``use_profile`` for the duration of the run. Binding happens twice — once on the asyncio side and once inside the executor thread — because asyncio's default executor does not propagate ContextVars. Behaviour is fully backward compatible: requests with no routing headers (the existing OpenAI-API contract) resolve to ``default_agent``, exactly the current behaviour. New tests in ``tests/gateway/test_api_server_routing.py`` cover: * Header sanitisation (CRLF rejection, length caps, whitespace). * Route resolution: matching, no-header fall-through, unmatched header fall-through, ``platform``-only catch-all, ``user_id`` and ``thread_id`` routes, route-order precedence. * Resilience: missing gateway reference, empty registry. * ContextVar isolation under ``asyncio.gather`` so two concurrent HTTP requests with different chat_ids stay isolated. Refs: PR NousResearch#25660 (single-gateway multi-agent).
The OpenAI-compatible HTTP adapter was the one inbound surface from PR NousResearch#25660 that never called ``_attach_agent_id`` — every ``/v1/chat/completions``, ``/v1/responses``, and ``/v1/runs`` request fell through to ``default_agent`` regardless of the configured routes, silently undermining the multi-agent guarantee on any deployment that exposes the API server. Add a single routing entry point, ``_resolve_agent_profile``, that: * Reads ``X-Hermes-Chat-Id`` / ``X-Hermes-User-Id`` / ``X-Hermes-Thread-Id`` from the request (sanitised through the same length + control-char caps as the existing ``X-Hermes-Session-Id`` / ``X-Hermes-Session-Key``). * Builds a synthetic ``SessionSource(platform=API_SERVER, …)`` and pipes it through the shared ``_attach_agent_id`` hook so declarative routes *and* the ``select_agent`` plugin hook fire identically to every other adapter. * Looks up the resolved ``agent_id`` in ``self._gateway_ref._agent_registry`` and returns the matching ``AgentProfile`` (or ``None`` for legacy single-agent installs). The three agent-invoking handlers (chat completions, responses, runs) now resolve the profile up front and bind it via ``use_profile`` for the duration of the run. Binding happens twice — once on the asyncio side and once inside the executor thread — because asyncio's default executor does not propagate ContextVars. Behaviour is fully backward compatible: requests with no routing headers (the existing OpenAI-API contract) resolve to ``default_agent``, exactly the current behaviour. New tests in ``tests/gateway/test_api_server_routing.py`` cover: * Header sanitisation (CRLF rejection, length caps, whitespace). * Route resolution: matching, no-header fall-through, unmatched header fall-through, ``platform``-only catch-all, ``user_id`` and ``thread_id`` routes, route-order precedence. * Resilience: missing gateway reference, empty registry. * ContextVar isolation under ``asyncio.gather`` so two concurrent HTTP requests with different chat_ids stay isolated. Refs: PR NousResearch#25660 (single-gateway multi-agent).
The OpenAI-compatible HTTP adapter was the one inbound surface from PR NousResearch#25660 that never called ``_attach_agent_id`` — every ``/v1/chat/completions``, ``/v1/responses``, and ``/v1/runs`` request fell through to ``default_agent`` regardless of the configured routes, silently undermining the multi-agent guarantee on any deployment that exposes the API server. Add a single routing entry point, ``_resolve_agent_profile``, that: * Reads ``X-Hermes-Chat-Id`` / ``X-Hermes-User-Id`` / ``X-Hermes-Thread-Id`` from the request (sanitised through the same length + control-char caps as the existing ``X-Hermes-Session-Id`` / ``X-Hermes-Session-Key``). * Builds a synthetic ``SessionSource(platform=API_SERVER, …)`` and pipes it through the shared ``_attach_agent_id`` hook so declarative routes *and* the ``select_agent`` plugin hook fire identically to every other adapter. * Looks up the resolved ``agent_id`` in ``self._gateway_ref._agent_registry`` and returns the matching ``AgentProfile`` (or ``None`` for legacy single-agent installs). The three agent-invoking handlers (chat completions, responses, runs) now resolve the profile up front and bind it via ``use_profile`` for the duration of the run. Binding happens twice — once on the asyncio side and once inside the executor thread — because asyncio's default executor does not propagate ContextVars. Behaviour is fully backward compatible: requests with no routing headers (the existing OpenAI-API contract) resolve to ``default_agent``, exactly the current behaviour. New tests in ``tests/gateway/test_api_server_routing.py`` cover: * Header sanitisation (CRLF rejection, length caps, whitespace). * Route resolution: matching, no-header fall-through, unmatched header fall-through, ``platform``-only catch-all, ``user_id`` and ``thread_id`` routes, route-order precedence. * Resilience: missing gateway reference, empty registry. * ContextVar isolation under ``asyncio.gather`` so two concurrent HTTP requests with different chat_ids stay isolated. Refs: PR NousResearch#25660 (single-gateway multi-agent).
The OpenAI-compatible HTTP adapter was the one inbound surface from PR NousResearch#25660 that never called ``_attach_agent_id`` — every ``/v1/chat/completions``, ``/v1/responses``, and ``/v1/runs`` request fell through to ``default_agent`` regardless of the configured routes, silently undermining the multi-agent guarantee on any deployment that exposes the API server. Add a single routing entry point, ``_resolve_agent_profile``, that: * Reads ``X-Hermes-Chat-Id`` / ``X-Hermes-User-Id`` / ``X-Hermes-Thread-Id`` from the request (sanitised through the same length + control-char caps as the existing ``X-Hermes-Session-Id`` / ``X-Hermes-Session-Key``). * Builds a synthetic ``SessionSource(platform=API_SERVER, …)`` and pipes it through the shared ``_attach_agent_id`` hook so declarative routes *and* the ``select_agent`` plugin hook fire identically to every other adapter. * Looks up the resolved ``agent_id`` in ``self._gateway_ref._agent_registry`` and returns the matching ``AgentProfile`` (or ``None`` for legacy single-agent installs). The three agent-invoking handlers (chat completions, responses, runs) now resolve the profile up front and bind it via ``use_profile`` for the duration of the run. Binding happens twice — once on the asyncio side and once inside the executor thread — because asyncio's default executor does not propagate ContextVars. Behaviour is fully backward compatible: requests with no routing headers (the existing OpenAI-API contract) resolve to ``default_agent``, exactly the current behaviour. New tests in ``tests/gateway/test_api_server_routing.py`` cover: * Header sanitisation (CRLF rejection, length caps, whitespace). * Route resolution: matching, no-header fall-through, unmatched header fall-through, ``platform``-only catch-all, ``user_id`` and ``thread_id`` routes, route-order precedence. * Resilience: missing gateway reference, empty registry. * ContextVar isolation under ``asyncio.gather`` so two concurrent HTTP requests with different chat_ids stay isolated. Refs: PR NousResearch#25660 (single-gateway multi-agent).
The OpenAI-compatible HTTP adapter was the one inbound surface from PR NousResearch#25660 that never called ``_attach_agent_id`` — every ``/v1/chat/completions``, ``/v1/responses``, and ``/v1/runs`` request fell through to ``default_agent`` regardless of the configured routes, silently undermining the multi-agent guarantee on any deployment that exposes the API server. Add a single routing entry point, ``_resolve_agent_profile``, that: * Reads ``X-Hermes-Chat-Id`` / ``X-Hermes-User-Id`` / ``X-Hermes-Thread-Id`` from the request (sanitised through the same length + control-char caps as the existing ``X-Hermes-Session-Id`` / ``X-Hermes-Session-Key``). * Builds a synthetic ``SessionSource(platform=API_SERVER, …)`` and pipes it through the shared ``_attach_agent_id`` hook so declarative routes *and* the ``select_agent`` plugin hook fire identically to every other adapter. * Looks up the resolved ``agent_id`` in ``self._gateway_ref._agent_registry`` and returns the matching ``AgentProfile`` (or ``None`` for legacy single-agent installs). The three agent-invoking handlers (chat completions, responses, runs) now resolve the profile up front and bind it via ``use_profile`` for the duration of the run. Binding happens twice — once on the asyncio side and once inside the executor thread — because asyncio's default executor does not propagate ContextVars. Behaviour is fully backward compatible: requests with no routing headers (the existing OpenAI-API contract) resolve to ``default_agent``, exactly the current behaviour. New tests in ``tests/gateway/test_api_server_routing.py`` cover: * Header sanitisation (CRLF rejection, length caps, whitespace). * Route resolution: matching, no-header fall-through, unmatched header fall-through, ``platform``-only catch-all, ``user_id`` and ``thread_id`` routes, route-order precedence. * Resilience: missing gateway reference, empty registry. * ContextVar isolation under ``asyncio.gather`` so two concurrent HTTP requests with different chat_ids stay isolated. Refs: PR NousResearch#25660 (single-gateway multi-agent).
|
+1 — strongly in favor of this landing. Adding a real-world data point: I've been running exactly this architecture in OpenClaw for months: a single gateway process hosting 8 agents, each with its own Telegram bot token, personality, model config, and isolated memory. One process polls all 8 bots, routes inbound by bot/chat, and operationally it's one daemon to install, watch, and restart instead of eight. I've started building agents in Hermes and want to migrate fully — but the one-gateway-per-profile model is the blocker. Recreating my setup today means 8 separate gateway services, 8 restart paths, and 8 chances for the PID/launchd races already reported elsewhere in the tracker. That's a hard sell when the single-gateway model demonstrably works at this scale day-to-day. The design here (per-agent profile + declarative routes, zero behavior change for existing single-agent installs) maps 1:1 to how I'd consolidate. Happy to test this MVP against a real 8-bot Telegram fleet if useful. |
Summary
Enable a single
hermes gateway runprocess to host N isolated AI agents,routing inbound messages by platform/chat/thread/user metadata while keeping
each agent's memory, skills, SOUL.md, and model config fully separate.
Fixes the bottleneck behind #23735, #7517, #9514, and #12099.
Deployment scenario matrix
Architecture (8 commits)
agent_idinSessionSource/SessionEntry,build_session_keyprefix, SQLite migrationuse_profile()propagates through async chainsroutes:list with 9 match keys, first-match-wins;select_agentplugin hook override_apply_profile_runtime_overrides,_apply_profile_toolsetsCronJob.agent_id, per-profile storage,DeliveryTarget.agent_idhermes agent list/add/remove/showPrecedence chain
Session
/modeloverride → Profile override → Gateway defaultThe default
"main"profile is a no-op overlay; existing single-agentinstalls see zero behavior change.
Migration Guide
Existing single-agent users (no action required)
No configuration changes needed. The default
default_agent: mainensuresall existing behavior is preserved. Your existing
~/.hermes/directorycontinues to work as the
mainagent profile.Adding a second agent
Consolidating multiple gateway processes
Before this PR:
hermes -p coder gateway run+hermes -p research gateway runAfter this PR:
~/.hermes/profiles/<name>/config.yamlPerformance Impact
_agent_cacheagent:main:...(+9 chars)agent:<id>:...No measurable throughput regression for single-agent configs.
Tests
tests/agent/test_profile_contextvar.pytests/gateway/test_agent_routing.pytests/gateway/test_session.pybuild_session_keywithagent_idacross all chat typestests/gateway/test_profile_overrides.pytests/hermes_cli/test_agent_cli.pyhermes agentlist/show/add/remove commandstests/gateway/test_session_boundary_hooks.pyagent_idassertionstests/test_model_tools.pyagent_idMulti-agent suite: 181 passed
Full regression: 22677 passed / 38 failed (pre-existing env issues) / 105 skipped
E2E Validation
Matrix →
codeagent routing validated with local Dendrite homeserver:codeagentmain/wecom-agent)agent:code:matrix:dm:...session keysFull report:
docs/plans/2026-05-15-multi-agent-matrix-e2e-report.mdNon-goals (future PRs)
Verification commands
Manual smoke checklist
main, session_keyagent:main:...coder, session_keyagent:coder:.../newin topic 42 →on_session_finalizereceivesagent_id="coder"profiles/coder/cron/jobs.json"research"fromselect_agenthook → overrides route matchagent_idagents:androutes:from config → all messages route tomain"main"