Skip to content

fix(gateway): persist transcript changes in /retry, /undo and fix /reset attribute #217

Merged
teknium1 merged 1 commit into
NousResearch:mainfrom
Farukest:fix/gateway-session-attribute-bugs
Mar 2, 2026
Merged

fix(gateway): persist transcript changes in /retry, /undo and fix /reset attribute #217
teknium1 merged 1 commit into
NousResearch:mainfrom
Farukest:fix/gateway-session-attribute-bugs

Conversation

@Farukest

Copy link
Copy Markdown
Contributor

/retry and /undo set session_entry.conversation_history which does not exist on SessionEntry. The truncated history was never written to disk, so the next message reload picked up the full unmodified transcript - both commands appeared to work but did nothing.

/reset accessed self.session_store._sessions which does not exist on SessionStore (the correct attribute is _entries). The resulting AttributeError was caught by a bare except, silently skipping the memory flush before reset.

Changes

  • Added SessionStore.rewrite_transcript() that persists changes to both the JSONL file and SQLite database
  • Updated /retry and /undo to call rewrite_transcript() instead of setting a non-existent attribute
  • Fixed /reset to use _entries instead of _sessions

Tests

Added 3 regression tests to tests/gateway/test_session.py:

  • rewrite_transcript replaces JSONL content correctly
  • rewrite_transcript with empty list clears transcript
  • SessionStore has _entries attribute, not _sessions

All 20 tests pass.

Closes #210

/retry and /undo set session_entry.conversation_history which does not
exist on SessionEntry. The truncated history was never written to disk,
so the next message reload picked up the full unmodified transcript.

Added SessionStore.rewrite_transcript() that persists changes to both
the JSONL file and SQLite database, and updated both commands to use it.

/reset accessed self.session_store._sessions which does not exist on
SessionStore (the correct attribute is _entries). Also replaced the
hand-coded session key with _generate_session_key() to fix WhatsApp DM
sessions using the wrong key format.

Closes NousResearch#210
@teknium1

teknium1 commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Merged into main via 719f2ee — thanks for the excellent work @Farukest!

We had already landed a fix for the core bugs in 698b359 (from your issue #210), but your PR brought two things we didn't have:

  1. Regression tests — the 3 tests in tests/gateway/test_session.py are great, all 20 pass
  2. _generate_session_key() refactor — replacing the hardcoded session key string in /reset with the existing method is a nice cleanup

We kept our SessionDB.clear_messages() approach (proper public method) over the direct _conn access, and we also caught an additional instance of the same bug in /compress that your PR didn't cover.

Final result is a combination of both contributions. Great bug hunting!

@Farukest

Farukest commented Mar 3, 2026

Copy link
Copy Markdown
Contributor Author

Final result is a combination of both contributions. Great bug hunting!

Thanks for the merge and the follow up fix. Happy to contribute @teknium1

kkrivonos added a commit to kkrivonos/hermes-agent that referenced this pull request Jun 12, 2026
…stance (G1) (NousResearch#217, NousResearch#44)

The singleton ACP transport pool landing in this branch hands one
ToolBridgeServer to many AIAgents. The pre-change bridge held a single
``_dispatcher`` slot plus a single ``_snapshot`` field, both mutated by
whichever AIAgent most recently asked gemini-cli to start a session.
With a shared bridge that means session B's update silently overwrites
session A's routing — a tool call from a connection spawned during
session A would route into session B's AIAgent and read session B's
tool list. Privacy-grade cross-session leak.

This commit closes the leak by keying both fields off the per-connection
binding instead of off the bridge instance:

- ``publish(tools, *, binding, dispatcher)`` registers a per-session
  binding under an opaque id. Pool callers pass both; un-pooled
  callers omit them and keep using the legacy single-tenant slots.
- The mcpServers payload now carries ``HERMES_ACP_SESSION_BINDING`` so
  gemini-cli's proxy receives the id and echoes it in its hello frame.
- ``_handle_client`` stores ``conn_binding`` as a local on the
  connection task. ``_handle_tools_call`` and ``tools/list`` look the
  dispatcher and snapshot up under that id, never via a shared
  attribute. Missing binding at call-time returns an error envelope
  rather than routing into someone else's session.
- ``unpublish(binding)`` lets the pool drop a binding when an agent
  releases its hold so the dispatcher closure (which pins the
  AIAgent) gets GC'd.
- ``stop()`` clears both ``_bindings`` and ``_snapshot`` so a restarted
  bridge starts cold.

Unpooled callers (the entire current production path) are unchanged:
no binding env-var, no binding in hello, same per-instance dispatcher
and snapshot. Sets up but does not enable the pool — that comes in a
later commit gated by ``agent.transport_pool.enabled``.

Existing tests/agent/test_gemini_acp_tool_bridge.py still passes (7/7)
since the legacy path is preserved.
kkrivonos added a commit to kkrivonos/hermes-agent that referenced this pull request Jun 12, 2026
, NousResearch#44)

New module ``agent/transports/acp_pool.py``. Phase 1 of the NousResearch#44 cold-
start latency fix: lift the gemini-acp subprocess + tool bridge out of
per-AIAgent ownership into a gateway-scoped singleton keyed by
(provider, model, working_dir).

Design notes captured in the module docstring:

- ``ACPPool.get()`` returns the same ``GeminiACPClient`` for repeat
  calls with the same key; first miss spawns, concurrent miss attaches
  to the in-flight spawn future.
- ``_Slot`` mirrors the NousResearch#74 TOCTOU-race pattern from
  ``tools.mcp_tool``: states are ``creating`` / ``ready`` / ``closed``,
  ``waiters`` is a loop-agnostic ``concurrent.futures.Future`` so
  cross-thread callers serialize on one spawn instead of racing into
  two subprocesses.
- The spawn path eagerly calls ``_ensure_underlying()`` so the slot is
  only marked ``ready`` after the gemini-cli subprocess is up — without
  that the first real request would still pay the ~4 s boot.
- Pool teardown is owned exclusively by ``ACPPool.shutdown()``; the
  per-agent release/close paths must NOT touch a pooled transport (a
  follow-up commit enforces this in ``run_agent.py``).
- Currently only ``provider="gemini-acp"`` is wired — codex-app-server
  pooling is named in the plan but lives in its own module and is not
  in this phase. Calls for other providers raise ``ValueError`` so
  misuse fails loudly rather than silently bypassing the pool.

This commit only adds the type; the gateway is not yet wiring it up.
Landing it standalone keeps the diff reviewable and produces a clean
no-op until the next commits flip the feature flag.
kkrivonos added a commit to kkrivonos/hermes-agent that referenced this pull request Jun 12, 2026
…usResearch#217)

Phase 1 of NousResearch#44 cold-start latency fix.

* GatewayRunner.__init__ constructs a singleton ACPPool unconditionally so
  its shutdown hook is always wired. Whether AIAgents actually pull from
  it is gated by ``agent.transport_pool.enabled`` (default False) or the
  ``HERMES_TRANSPORT_POOL_ENABLED`` env-var. Constructor never spawns a
  subprocess on its own; the pool stays cold until first ``get()`` or
  pre-warm.

* ``_start_transport_pool_prewarm`` reads the gateway's effective default
  via ``_resolve_session_agent_runtime``; if the operator has configured
  gemini-acp as the default it spawns a daemon thread that calls
  ``pool.get()`` once with that (provider, model, cwd) tuple. Errors land
  at DEBUG only - pre-warm is a latency optimization, never a correctness
  requirement. Skips silently when the flag is off so prod sees no change.

* Pre-warm scheduling fires in ``_start`` right after the ripgrep boot
  check, off the gateway boot path's critical section.

* ``_stop_impl`` calls ``self._transport_pool.shutdown()`` after the idle
  agent cache cleanup so any in-flight tool-dispatcher closures have been
  released first. This is the ONLY place pooled gemini-cli subprocesses
  and tool-bridge sockets get torn down; per-agent close/release paths
  deliberately skip pooled clients.

Landing this commit is a no-op until ``agent.transport_pool.enabled`` is
flipped on - the pool is constructed but no AIAgent consults it yet
(that wiring lands in the next commit).
kkrivonos added a commit to kkrivonos/hermes-agent that referenced this pull request Jun 12, 2026
…earch#217)

Phase 1 of NousResearch#44 cold-start latency fix.

* ``init_agent`` and ``AIAgent.__init__`` accept ``transport_pool`` +
  ``transport_pool_enabled``. Stashed on the agent as ``_transport_pool``,
  ``_transport_pool_enabled``, ``_transport_from_pool`` (bool), and
  ``_acp_binding`` (per-agent id). Subagents inherit nothing here yet
  - that wiring lands when subagent spawn paths get explicit pool
  forwarding (not in scope for this commit).

* ``create_openai_client`` (gemini-acp branch) now consults the pool
  when both pool ref and the feature flag are set. Pool returns a
  pre-warmed singleton ``GeminiACPClient``; each AIAgent gets a unique
  binding token (``secrets.token_hex(16)``) and pre-registers a NULL-
  tools binding on the bridge so route-by-binding is wired before the
  first publish. Pool failure falls back to the legacy per-AIAgent
  spawn (and logs at ERROR) - we never ship a half-wired pooled agent.

* ``_create_request_openai_client`` injects the binding +
  dispatcher into ``api_kwargs`` for pooled gemini-acp. The facade's
  ``_create_chat_completion`` pops them and forwards to
  ``ToolBridgeServer.publish(binding=..., dispatcher=...)`` - this is
  the G1 routing path the previous commit's per-connection
  ``_ConnectionBinding`` was designed to receive.

* G2 fix (session-id leak): the facade's ``_session_id``, ``_tools_fp``
  and ``_first_prompt_measured`` are now per-binding for pooled callers
  (``_binding_sessions`` / ``_binding_tools_fp`` /
  ``_binding_first_prompt_measured``). Without this two AIAgents
  sharing a pooled facade would collide on a single session-id and
  one's prompts would route into the other's gemini-cli session.
  Un-pooled callers (binding=None) keep using the legacy instance
  fields so the pre-pool path is byte-identical.

* ``GeminiACPClient.release_binding`` drops a binding from the bridge
  (so the dispatcher closure can be GC'd) and clears its per-session
  state on the facade. Called from ``release_clients``, ``close``, and
  ``_replace_primary_openai_client`` instead of ``client.close()``
  when ``_transport_from_pool`` is True - hard close of a pooled
  client would orphan every sibling AIAgent that shares it. The pool
  alone owns the subprocess lifecycle (torn down in
  ``GatewayRunner._stop_impl``).

* All four ``AIAgent(...)`` construction sites in ``gateway/run.py``
  now pass ``transport_pool=self._transport_pool`` +
  ``transport_pool_enabled=self._transport_pool_enabled``.

Cold-start telemetry buckets (``_last_*_ms``) stay facade-wide and
will race noisily across concurrent pooled turns; the consumer
(commit 5's breakdown log) reads them inside the writing turn so the
race is bounded to noisy numbers, not corruption. Per-binding
telemetry is deferred to the metrics commit. gemini-cli has no
session-close RPC, so released bindings leak a dead session-id in
the subprocess's memory; documented and accepted for phase 1
(bounded by subprocess restart).

Landing this commit is still a no-op until
``agent.transport_pool.enabled`` is flipped on - the legacy branch
runs whenever the pool ref or the flag is missing.
kkrivonos added a commit to kkrivonos/hermes-agent that referenced this pull request Jun 12, 2026
…search#217)

Phase 1 of NousResearch#44 cold-start latency fix.

Add ``transport=pool|agent`` to the per-turn ``cold_start_breakdown``
log so operators can verify the pool is doing its job in production:
filter on ``transport=pool`` and the ``spawn`` column should collapse
to near-zero once the pre-warm task has run, while ``transport=agent``
turns continue to pay the full ~10s respawn cost the legacy path
incurs on cache eviction.

Also exposed as ``transport_from_pool`` (bool) in the structured log
``extra`` so structured-log pipelines can split histograms.

No new buckets, no new schema fields - the existing five
spawn/bridge/session_new/first_tool/generation columns are preserved
verbatim.
kkrivonos added a commit to kkrivonos/hermes-agent that referenced this pull request Jun 12, 2026
…rch#217)

Pins the new gateway-scoped pool's contract and guards the two privacy
gates introduced alongside it:

- ACPPool: identity-on-repeat-get, key separation by
  (provider, model, working_dir), spawn-kwargs propagation, _Slot race
  serialization (8 concurrent misses collapse to one spawn), no-close
  on per-agent release_binding, hard close at shutdown (idempotent),
  post-shutdown get raises, failed spawn pops the slot for retry,
  unsupported provider raises.
- G1 (ToolBridgeServer): with one bridge serving two bindings,
  tools/list returns each connection's own snapshot; tools/call routes
  into each connection's own dispatcher; re-publishing binding A does
  not redirect binding B's in-flight calls; unpublish revokes a
  binding mid-session with an error envelope rather than a stale
  route.
- G2 (GeminiACPClient): per-binding _session_id / _tools_fp /
  _first_prompt_measured track independently when one facade serves
  two bindings (no cross-binding session-id clobber); release_binding
  drops only the target binding's per-binding state and the bridge
  unpublish, leaves the sibling untouched, and is safe after close().

The pool tests stub GeminiACPClient via monkeypatch on the lazy import
inside _spawn_gemini_acp, so no gemini-cli subprocess is involved. The
G1 test runs a real ToolBridgeServer over a unix socket. The G2 test
hand-builds a GeminiACPClient via __new__ and patches the bridge +
underlying ACPClient (same pattern used by tests/agent/
test_gemini_acp_tool_wiring.py).

15 tests, all passing locally. Existing tests/agent/
test_gemini_acp_tool_bridge.py (7) and tests/agent/
test_cold_start_instrumentation.py (1) re-run clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/retry and /undo gateway commands have no effect, /reset silently loses memories

2 participants