fix(session): scope build_session_key by active profile, preserve default (#12099)#12266
fix(session): scope build_session_key by active profile, preserve default (#12099)#12266briandevans wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes session-key collisions across multiple Hermes profiles by scoping gateway.session.build_session_key with the active profile while preserving the legacy agent:main prefix for the default profile (backward compatible with existing sessions and external memory namespaces).
Changes:
- Add
_resolve_session_key_prefix(profile)and thread optionalprofilethroughbuild_session_keyto generateagent:<profile>:...for named profiles andagent:main:...for default. - Replace hardcoded
agent:mainliterals in session key construction with the resolved prefix. - Add regression tests covering default backcompat, named-profile isolation, and fallback behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
gateway/session.py |
Adds profile-aware session key prefix resolution and applies it throughout build_session_key. |
tests/gateway/test_session.py |
Adds regression coverage ensuring profile-scoped keys and preserved default behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - The ``"default"`` profile keeps the legacy ``agent:main`` prefix for | ||
| backward compatibility — existing sessions and external memory | ||
| namespaces continue to resolve. Named profiles produce | ||
| ``agent:<profile>:...`` so multi-profile deployments no longer share | ||
| a namespace. See #12099. | ||
| """ | ||
| prefix = _resolve_session_key_prefix(profile) |
There was a problem hiding this comment.
build_session_key now emits keys like agent:<profile>:... for named profiles. There is existing parsing logic in gateway/run.py::_parse_session_key that only accepts agent:main:... (it checks parts[1] == "main"), which will cause session-key parsing to fail for non-default profiles and can break features that rely on it (e.g., shutdown notifications and synthetic process-event routing when the session store entry is missing). Please update _parse_session_key (and any similar parsers) to accept any agent:<scope> second component, or otherwise make parsing profile-aware.
| - The ``"default"`` profile keeps the legacy ``agent:main`` prefix for | |
| backward compatibility — existing sessions and external memory | |
| namespaces continue to resolve. Named profiles produce | |
| ``agent:<profile>:...`` so multi-profile deployments no longer share | |
| a namespace. See #12099. | |
| """ | |
| prefix = _resolve_session_key_prefix(profile) | |
| - Session keys continue to use the legacy ``agent:main`` prefix for | |
| compatibility with existing downstream parsers that still expect that | |
| exact second component. Preserve parseability first; profile-aware | |
| parsing should be added before emitting ``agent:<profile>:...`` here. | |
| """ | |
| prefix = _resolve_session_key_prefix(profile) | |
| if prefix.startswith("agent:") and prefix != "agent:main": | |
| prefix = "agent:main" |
| def test_hermes_cli_import_failure_falls_back_to_default(self): | ||
| """If hermes_cli.profiles can't be imported (narrow unit-test contexts), | ||
| we must not raise — fall back to legacy agent:main prefix.""" |
There was a problem hiding this comment.
This test/docstring says it covers the case where hermes_cli.profiles can't be imported, but the patch is applied to hermes_cli.profiles.get_active_profile_name and then raises ImportError from the mocked function call. Either adjust the wording to reflect that it’s exercising an exception raised while resolving the active profile, or change the test to simulate an actual import failure (e.g., making the import of hermes_cli.profiles raise ModuleNotFoundError).
| def test_hermes_cli_import_failure_falls_back_to_default(self): | |
| """If hermes_cli.profiles can't be imported (narrow unit-test contexts), | |
| we must not raise — fall back to legacy agent:main prefix.""" | |
| def test_active_profile_resolution_import_error_falls_back_to_default(self): | |
| """If resolving the active profile raises ImportError, we must not raise | |
| — fall back to the legacy agent:main prefix.""" |
|
Thanks for the catch @copilot-pull-request-reviewer — both addressed in 1. Fixed at the parser rather than the producer (your Also audited the other uses of
Added 5 regression tests under 2. Import-failure test wording. Focused suite → 42 passed. |
|
CI audit on the initial push (pre-
Pre-existing infrastructure issue (container UID vs mounted
None touch Green: After |
|
CI after
None touch Focused suite on |
|
Thanks @alt-glitch for the cross-reference triage. 👍 For reference, the test coverage here (10 cases in
If maintainers prefer #12102 / #12103 / #12108 over this one, happy to defer and close — just flagging the backcompat dimension so whichever lands carries it forward. |
…ault (NousResearch#12099) ``gateway.session.build_session_key`` hardcoded ``agent:main`` as the leading component of every session key. The hardcode ignored the active Hermes profile (``hermes_cli.profiles.get_active_profile_name``) and caused multi-profile deployments to collide on a shared namespace. Reporter: NousResearch#12099. Why it matters (beyond cosmetics) --------------------------------- Local session storage is already isolated at the filesystem level (each profile has its own ``HERMES_HOME`` → its own ``sessions/`` directory), so the hardcode didn't corrupt local SQLite / JSONL. The real collision is in **external memory-provider namespaces**. ``plugins/memory/honcho/client.py::resolve_session_name`` (and the equivalent code paths in RetainDB / ByteRover / Supermemory) take the gateway session key and use it *verbatim* as the remote session name — see ``plugins/memory/honcho/client.py:546-554``:: # Gateway session key: stable per-chat identifier passed by the gateway # (e.g. "agent:main:telegram:dm:8439114563"). Sanitize colons to hyphens # for Honcho session ID compatibility. This takes priority over strategy- # based resolution because gateway platforms need per-chat isolation that # cwd-based strategies cannot provide. if gateway_session_key: sanitized = re.sub(r'[^a-zA-Z0-9_-]+', '-', gateway_session_key).strip('-') Two profiles that point at the same Honcho / memory backend therefore read and write the same remote namespace — exactly the "WeChat messages may read or write to another profile's memory" case from the report. Root cause ---------- ``gateway/session.py:440-479`` (pre-fix) hardcoded ``agent:main`` in 5 places — 4 DM branches and one group/channel branch — with no profile parameter. Fix --- Add a small ``_resolve_session_key_prefix(profile)`` helper and an optional ``profile`` kwarg to ``build_session_key``: * ``profile is None`` (the default — unchanged call sites) resolves the active profile from ``hermes_cli.profiles.get_active_profile_name``. * Profile ``"default"`` (or an empty/None string after fallback) keeps the legacy ``agent:main`` prefix — preserving every existing session key, every hardcoded test fixture (``tests/test_mcp_serve.py`` has 30+ such strings), and every remote memory-provider record. * Named profiles get ``agent:<profile>:…``, so "coder" and "default" now produce distinct keys for the same ``(platform, chat_id)``. * Explicit ``profile=`` argument wins over the resolved default. * Import failures of ``hermes_cli.profiles`` fall back to the legacy prefix so narrow unit-test contexts don't start raising. Behaviour matrix ---------------- | Active profile | Resolved prefix | Example DM key | | --- | --- | --- | | *unset / ``default``* | ``agent:main`` | ``agent:main:telegram:dm:99`` | | ``coder`` | ``agent:coder`` | ``agent:coder:telegram:dm:99`` | | ``custom`` (HERMES_HOME outside profile tree) | ``agent:custom`` | ``agent:custom:telegram:dm:99`` | | explicit ``profile="default"`` | ``agent:main`` | ``agent:main:...`` | | explicit ``profile="coder"`` | ``agent:coder`` | ``agent:coder:...`` | Narrow scope — explicitly not changed ------------------------------------- * **Migration of existing sessions.** The default-profile prefix stays ``agent:main`` on purpose, so nobody's sessions orphan. * **Caller API.** Every existing caller (``gateway/platforms/base.py``, ``gateway/platforms/slack.py``, ``gateway/platforms/feishu.py``, ``gateway/platforms/wecom.py``, and tests) uses the default ``profile=None`` path. No caller signatures changed. * **Per-message caching.** ``run_agent.py:720`` already stores ``_gateway_session_key`` per agent instance. The resolved profile is stable for a process lifetime (HERMES_HOME doesn't change mid-process), so repeated resolution is sub-millisecond and local. * **``sessions_dir`` layout.** Already ``HERMES_HOME``-scoped; unchanged. Regression coverage ------------------- ``tests/gateway/test_session.py`` gets a new ``TestProfileScopedSessionKeys`` class with 10 parametrised cases: * 3 backward-compat canaries — ``profile="default"`` and ``profile=None`` with ``get_active_profile_name`` mocked to ``"default"`` must still produce ``agent:main:…``. * 3 new-behaviour cases — named profile via explicit arg, via env, and for group/thread keys. * 1 reporter repro — same ``(telegram, chat_id=99)`` under two different profiles must produce different keys. * 1 override pin — explicit ``profile="default"`` wins over an active ``"coder"`` profile. * 1 "custom" profile case — unusual ``HERMES_HOME`` paths surface as ``agent:custom:…``. * 1 import-failure canary — ``ImportError`` from ``hermes_cli.profiles`` falls back to ``agent:main`` instead of raising. 8 of the 10 fail on clean ``origin/main``. The 2 that pass on main (both backcompat canaries for the default profile) pin preserved behaviour — they'd catch any future regression that accidentally changes the default-profile key shape. Validation ---------- ``source venv/bin/activate && python -m pytest tests/gateway/test_session.py tests/gateway/test_session_race_guard.py tests/gateway/test_session_model_reset.py tests/test_mcp_serve.py tests/gateway/test_session_dm_thread_seeding.py -q`` → **180 passed**. Broader ``tests/gateway`` under ``-n auto`` → 14 pre-existing baseline failures (dingtalk card lifecycle, matrix encrypted upload, approve/deny E2E, whatsapp bridge runtime / xdist flakes). Zero in touched code paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…NousResearch#12099) Copilot flagged that ``gateway.run._parse_session_key`` only accepted ``agent:main:...`` keys — so after the parent change threaded profile scope through ``build_session_key``, named-profile session keys like ``agent:coder:telegram:dm:99`` would silently return ``None`` from the parser. That broke: * Gateway shutdown notifications (``_notify_active_sessions_of_shutdown`` at ``run.py:1549``) — named-profile users wouldn't receive the restart warning. * Synthetic process-event routing when the session-store lookup misses (``run.py:7929``) — chat routing would fall back to env metadata only. Fix: relax the strict ``parts[1] == "main"`` check to require only a non-empty profile scope. Structural checks remain — ``parts[0]`` must be ``agent`` and the key must have at least 5 colon-separated fields. Also: * Update the existing ``test_parse_session_key_wrong_prefix`` test — the ``agent:cron:...`` assertion was pinning the old strict behaviour; replace it with an ``other:main:...`` check for the ``parts[0] != "agent"`` path and an ``agent::...`` check for the empty-scope path. * Add five new regression tests under ``test_background_process_notifications.py`` pinning that named- profile keys (DM, group, thread, ``custom`` profile) parse correctly and produce the same platform/chat_id routing as the equivalent default-profile key. * Rename / broaden ``test_hermes_cli_import_failure_falls_back_to_default`` → ``test_active_profile_resolution_failure_falls_back_to_default`` (Copilot nit — the patch fixture raises from the mocked function, not from the import itself). Add a second ``RuntimeError`` side- effect to pin that the fallback covers any exception. Validation ---------- ``source venv/bin/activate && python -m pytest tests/gateway/test_session.py::TestProfileScopedSessionKeys tests/gateway/test_background_process_notifications.py -q`` → **42 passed**. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aad2af8 to
290c20d
Compare
|
Rebased onto current Re-ran focused tests on the rebased head (
Fix is unchanged: |
|
Closing to keep the queue clean — happy to reopen if this is still useful. |
Fixes #12099.
TL;DR
gateway.session.build_session_keyhardcodedagent:mainas the leading component of every session key, ignoring the active Hermes profile. Two profiles (e.g.defaultandcoder) that share a memory backend therefore collide on the same remote namespace.Fix: thread the active profile through
build_session_key. Thedefaultprofile keeps the legacyagent:mainprefix (backcompat for every existing session and remote memory record); named profiles getagent:<profile>:….Why it matters (beyond cosmetics)
Local session storage is already isolated at the filesystem level — each profile has its own
HERMES_HOMEand therefore its ownsessions/directory — so the hardcode doesn't corrupt local SQLite/JSONL.The real collision is in external memory-provider namespaces.
plugins/memory/honcho/client.py::resolve_session_nametakes the gateway session key and uses it verbatim as the remote session name (colons sanitized to hyphens):The equivalent pattern exists in RetainDB / ByteRover / Supermemory integrations. Two profiles that point at the same memory backend therefore read/write the same remote namespace — exactly the "WeChat messages may read or write to another profile's memory" case from the report.
Root cause
gateway/session.py:440-479(pre-fix) hardcodedagent:mainin 5 places — 4 DM branches and one group/channel branch — with no profile parameter on the function.Fix
Small
_resolve_session_key_prefix(profile)helper + optionalprofilekwarg:and rebuild every formatted return in
build_session_keyusing that prefix. The 5"agent:main"literals are replaced with the resolvedprefix— nothing else changes in the DM / group / thread logic.Behaviour matrix
defaultagent:mainagent:main:telegram:dm:99coderagent:coderagent:coder:telegram:dm:99custom(HERMES_HOME outside the profile tree)agent:customagent:custom:telegram:dm:99profile="default"agent:mainagent:main:...profile="coder"agent:coderagent:coder:...Every existing caller (
gateway/platforms/base.py,slack.py,feishu.py,wecom.py, tests) uses the defaultprofile=Nonepath. Zero caller signatures change.Narrow scope — explicitly not changed
agent:mainon purpose. Orphaning every user's existing session history (and their memory-provider records) is not acceptable.run_agent.py:720already stores_gateway_session_keyper agent instance. Profile resolution is process-lifetime-stable and sub-millisecond — no need for module-level caching here.sessions_dirlayout. AlreadyHERMES_HOME-scoped, unchanged.honcho/client.pyis unchanged — my fix just gives it a more discriminating input.Regression coverage
tests/gateway/test_session.pygets a newTestProfileScopedSessionKeysclass with 10 parametrised cases:profile="default"andprofile=Nonewithget_active_profile_namemocked to"default"must still produceagent:main:….(telegram, chat_id=99)under two different profiles must produce different keys.profile="default"wins over an active"coder"profile.HERMES_HOMEpaths surface asagent:custom:….ImportErrorfromhermes_cli.profilesfalls back toagent:maininstead of raising.8 of 10 fail on clean
origin/mainwith signature and behaviour errors:The 2 that pass on main are backcompat canaries — they happen to match because of the hardcoded string. Keeping them pins preserved behaviour so any future regression that changes the default-profile key shape gets caught.
Validation
Broader
tests/gatewayunder-n auto→ 14 pre-existing baseline failures (dingtalk card lifecycle, matrix encrypted upload, approve/deny E2E, whatsapp bridge runtime / xdist flakes). Zero in touched code. Per-test baseline audit available on request.Pre-empted review questions
Q. Why keep the
agent:mainliteral for the default profile? Isn't the natural renameagent:default?Two reasons:
tests/test_mcp_serve.py,tests/gateway/test_session*.py,tests/cron/test_codex_execution_paths.pyall hardcodeagent:main:…— a rename would churn ~80 test literals unrelated to this bug. Scope creep.The explicit mapping (
default→agent:main, named profile →agent:<name>) documents the historical name at the only place it's relevant.Q. Is
get_active_profile_name()cheap enough to call on everybuild_session_key?Yes. It does one
Path.resolve()and astartswithcheck. Benchmarked locally at ~30 µs.build_session_keyis itself called once per inbound message — orders of magnitude below network latency.Q. Import cycle risk?
No.
gateway/session.pyimports fromhermes_cli/profiles.pyinside_resolve_session_key_prefix(deferred).hermes_cli/profiles.pyonly imports fromhermes_constantsand (conditionally inside functions) fromgateway.status— not fromgateway.session. No cycle.Q. What about
run_agent.py:720's cached_gateway_session_key?That cache stores the key after it's been built. As long as the active profile doesn't change mid-process (it can't — HERMES_HOME is fixed per process), the cached value is correct.
Q. What if someone flips
active_profileat runtime?They can't.
HERMES_HOMEis baked in at process start byhermes_constants.get_hermes_home(). Profile switches require a process restart by design.Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.