fix(gateway): persist transcript changes in /retry, /undo and fix /reset attribute by Farukest · Pull Request #217 · NousResearch/hermes-agent

Farukest · 2026-02-28T22:33:29Z

/retry and /undo set session_entry.conversation_history which does not exist on SessionEntry. The truncated history was never written to disk, so the next message reload picked up the full unmodified transcript - both commands appeared to work but did nothing.

/reset accessed self.session_store._sessions which does not exist on SessionStore (the correct attribute is _entries). The resulting AttributeError was caught by a bare except, silently skipping the memory flush before reset.

Changes

Added SessionStore.rewrite_transcript() that persists changes to both the JSONL file and SQLite database
Updated /retry and /undo to call rewrite_transcript() instead of setting a non-existent attribute
Fixed /reset to use _entries instead of _sessions

Tests

Added 3 regression tests to tests/gateway/test_session.py:

rewrite_transcript replaces JSONL content correctly
rewrite_transcript with empty list clears transcript
SessionStore has _entries attribute, not _sessions

All 20 tests pass.

Closes #210

/retry and /undo set session_entry.conversation_history which does not exist on SessionEntry. The truncated history was never written to disk, so the next message reload picked up the full unmodified transcript. Added SessionStore.rewrite_transcript() that persists changes to both the JSONL file and SQLite database, and updated both commands to use it. /reset accessed self.session_store._sessions which does not exist on SessionStore (the correct attribute is _entries). Also replaced the hand-coded session key with _generate_session_key() to fix WhatsApp DM sessions using the wrong key format. Closes NousResearch#210

teknium1 · 2026-03-02T08:18:58Z

Merged into main via 719f2ee — thanks for the excellent work @Farukest!

We had already landed a fix for the core bugs in 698b359 (from your issue #210), but your PR brought two things we didn't have:

Regression tests — the 3 tests in tests/gateway/test_session.py are great, all 20 pass
_generate_session_key() refactor — replacing the hardcoded session key string in /reset with the existing method is a nice cleanup

We kept our SessionDB.clear_messages() approach (proper public method) over the direct _conn access, and we also caught an additional instance of the same bug in /compress that your PR didn't cover.

Final result is a combination of both contributions. Great bug hunting!

Farukest · 2026-03-03T19:03:30Z

Final result is a combination of both contributions. Great bug hunting!

Thanks for the merge and the follow up fix. Happy to contribute @teknium1

…stance (G1) (NousResearch#217, NousResearch#44) The singleton ACP transport pool landing in this branch hands one ToolBridgeServer to many AIAgents. The pre-change bridge held a single ``_dispatcher`` slot plus a single ``_snapshot`` field, both mutated by whichever AIAgent most recently asked gemini-cli to start a session. With a shared bridge that means session B's update silently overwrites session A's routing — a tool call from a connection spawned during session A would route into session B's AIAgent and read session B's tool list. Privacy-grade cross-session leak. This commit closes the leak by keying both fields off the per-connection binding instead of off the bridge instance: - ``publish(tools, *, binding, dispatcher)`` registers a per-session binding under an opaque id. Pool callers pass both; un-pooled callers omit them and keep using the legacy single-tenant slots. - The mcpServers payload now carries ``HERMES_ACP_SESSION_BINDING`` so gemini-cli's proxy receives the id and echoes it in its hello frame. - ``_handle_client`` stores ``conn_binding`` as a local on the connection task. ``_handle_tools_call`` and ``tools/list`` look the dispatcher and snapshot up under that id, never via a shared attribute. Missing binding at call-time returns an error envelope rather than routing into someone else's session. - ``unpublish(binding)`` lets the pool drop a binding when an agent releases its hold so the dispatcher closure (which pins the AIAgent) gets GC'd. - ``stop()`` clears both ``_bindings`` and ``_snapshot`` so a restarted bridge starts cold. Unpooled callers (the entire current production path) are unchanged: no binding env-var, no binding in hello, same per-instance dispatcher and snapshot. Sets up but does not enable the pool — that comes in a later commit gated by ``agent.transport_pool.enabled``. Existing tests/agent/test_gemini_acp_tool_bridge.py still passes (7/7) since the legacy path is preserved.

, NousResearch#44) New module ``agent/transports/acp_pool.py``. Phase 1 of the NousResearch#44 cold- start latency fix: lift the gemini-acp subprocess + tool bridge out of per-AIAgent ownership into a gateway-scoped singleton keyed by (provider, model, working_dir). Design notes captured in the module docstring: - ``ACPPool.get()`` returns the same ``GeminiACPClient`` for repeat calls with the same key; first miss spawns, concurrent miss attaches to the in-flight spawn future. - ``_Slot`` mirrors the NousResearch#74 TOCTOU-race pattern from ``tools.mcp_tool``: states are ``creating`` / ``ready`` / ``closed``, ``waiters`` is a loop-agnostic ``concurrent.futures.Future`` so cross-thread callers serialize on one spawn instead of racing into two subprocesses. - The spawn path eagerly calls ``_ensure_underlying()`` so the slot is only marked ``ready`` after the gemini-cli subprocess is up — without that the first real request would still pay the ~4 s boot. - Pool teardown is owned exclusively by ``ACPPool.shutdown()``; the per-agent release/close paths must NOT touch a pooled transport (a follow-up commit enforces this in ``run_agent.py``). - Currently only ``provider="gemini-acp"`` is wired — codex-app-server pooling is named in the plan but lives in its own module and is not in this phase. Calls for other providers raise ``ValueError`` so misuse fails loudly rather than silently bypassing the pool. This commit only adds the type; the gateway is not yet wiring it up. Landing it standalone keeps the diff reviewable and produces a clean no-op until the next commits flip the feature flag.

…usResearch#217) Phase 1 of NousResearch#44 cold-start latency fix. * GatewayRunner.__init__ constructs a singleton ACPPool unconditionally so its shutdown hook is always wired. Whether AIAgents actually pull from it is gated by ``agent.transport_pool.enabled`` (default False) or the ``HERMES_TRANSPORT_POOL_ENABLED`` env-var. Constructor never spawns a subprocess on its own; the pool stays cold until first ``get()`` or pre-warm. * ``_start_transport_pool_prewarm`` reads the gateway's effective default via ``_resolve_session_agent_runtime``; if the operator has configured gemini-acp as the default it spawns a daemon thread that calls ``pool.get()`` once with that (provider, model, cwd) tuple. Errors land at DEBUG only - pre-warm is a latency optimization, never a correctness requirement. Skips silently when the flag is off so prod sees no change. * Pre-warm scheduling fires in ``_start`` right after the ripgrep boot check, off the gateway boot path's critical section. * ``_stop_impl`` calls ``self._transport_pool.shutdown()`` after the idle agent cache cleanup so any in-flight tool-dispatcher closures have been released first. This is the ONLY place pooled gemini-cli subprocesses and tool-bridge sockets get torn down; per-agent close/release paths deliberately skip pooled clients. Landing this commit is a no-op until ``agent.transport_pool.enabled`` is flipped on - the pool is constructed but no AIAgent consults it yet (that wiring lands in the next commit).

…earch#217) Phase 1 of NousResearch#44 cold-start latency fix. * ``init_agent`` and ``AIAgent.__init__`` accept ``transport_pool`` + ``transport_pool_enabled``. Stashed on the agent as ``_transport_pool``, ``_transport_pool_enabled``, ``_transport_from_pool`` (bool), and ``_acp_binding`` (per-agent id). Subagents inherit nothing here yet - that wiring lands when subagent spawn paths get explicit pool forwarding (not in scope for this commit). * ``create_openai_client`` (gemini-acp branch) now consults the pool when both pool ref and the feature flag are set. Pool returns a pre-warmed singleton ``GeminiACPClient``; each AIAgent gets a unique binding token (``secrets.token_hex(16)``) and pre-registers a NULL- tools binding on the bridge so route-by-binding is wired before the first publish. Pool failure falls back to the legacy per-AIAgent spawn (and logs at ERROR) - we never ship a half-wired pooled agent. * ``_create_request_openai_client`` injects the binding + dispatcher into ``api_kwargs`` for pooled gemini-acp. The facade's ``_create_chat_completion`` pops them and forwards to ``ToolBridgeServer.publish(binding=..., dispatcher=...)`` - this is the G1 routing path the previous commit's per-connection ``_ConnectionBinding`` was designed to receive. * G2 fix (session-id leak): the facade's ``_session_id``, ``_tools_fp`` and ``_first_prompt_measured`` are now per-binding for pooled callers (``_binding_sessions`` / ``_binding_tools_fp`` / ``_binding_first_prompt_measured``). Without this two AIAgents sharing a pooled facade would collide on a single session-id and one's prompts would route into the other's gemini-cli session. Un-pooled callers (binding=None) keep using the legacy instance fields so the pre-pool path is byte-identical. * ``GeminiACPClient.release_binding`` drops a binding from the bridge (so the dispatcher closure can be GC'd) and clears its per-session state on the facade. Called from ``release_clients``, ``close``, and ``_replace_primary_openai_client`` instead of ``client.close()`` when ``_transport_from_pool`` is True - hard close of a pooled client would orphan every sibling AIAgent that shares it. The pool alone owns the subprocess lifecycle (torn down in ``GatewayRunner._stop_impl``). * All four ``AIAgent(...)`` construction sites in ``gateway/run.py`` now pass ``transport_pool=self._transport_pool`` + ``transport_pool_enabled=self._transport_pool_enabled``. Cold-start telemetry buckets (``_last_*_ms``) stay facade-wide and will race noisily across concurrent pooled turns; the consumer (commit 5's breakdown log) reads them inside the writing turn so the race is bounded to noisy numbers, not corruption. Per-binding telemetry is deferred to the metrics commit. gemini-cli has no session-close RPC, so released bindings leak a dead session-id in the subprocess's memory; documented and accepted for phase 1 (bounded by subprocess restart). Landing this commit is still a no-op until ``agent.transport_pool.enabled`` is flipped on - the legacy branch runs whenever the pool ref or the flag is missing.

…search#217) Phase 1 of NousResearch#44 cold-start latency fix. Add ``transport=pool|agent`` to the per-turn ``cold_start_breakdown`` log so operators can verify the pool is doing its job in production: filter on ``transport=pool`` and the ``spawn`` column should collapse to near-zero once the pre-warm task has run, while ``transport=agent`` turns continue to pay the full ~10s respawn cost the legacy path incurs on cache eviction. Also exposed as ``transport_from_pool`` (bool) in the structured log ``extra`` so structured-log pipelines can split histograms. No new buckets, no new schema fields - the existing five spawn/bridge/session_new/first_tool/generation columns are preserved verbatim.

…rch#217) Pins the new gateway-scoped pool's contract and guards the two privacy gates introduced alongside it: - ACPPool: identity-on-repeat-get, key separation by (provider, model, working_dir), spawn-kwargs propagation, _Slot race serialization (8 concurrent misses collapse to one spawn), no-close on per-agent release_binding, hard close at shutdown (idempotent), post-shutdown get raises, failed spawn pops the slot for retry, unsupported provider raises. - G1 (ToolBridgeServer): with one bridge serving two bindings, tools/list returns each connection's own snapshot; tools/call routes into each connection's own dispatcher; re-publishing binding A does not redirect binding B's in-flight calls; unpublish revokes a binding mid-session with an error envelope rather than a stale route. - G2 (GeminiACPClient): per-binding _session_id / _tools_fp / _first_prompt_measured track independently when one facade serves two bindings (no cross-binding session-id clobber); release_binding drops only the target binding's per-binding state and the bridge unpublish, leaves the sibling untouched, and is safe after close(). The pool tests stub GeminiACPClient via monkeypatch on the lazy import inside _spawn_gemini_acp, so no gemini-cli subprocess is involved. The G1 test runs a real ToolBridgeServer over a unix socket. The G2 test hand-builds a GeminiACPClient via __new__ and patches the bridge + underlying ACPClient (same pattern used by tests/agent/ test_gemini_acp_tool_wiring.py). 15 tests, all passing locally. Existing tests/agent/ test_gemini_acp_tool_bridge.py (7) and tests/agent/ test_cold_start_instrumentation.py (1) re-run clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Farukest force-pushed the fix/gateway-session-attribute-bugs branch from 9f550df to b7f8a17 Compare February 28, 2026 22:43

Farukest mentioned this pull request Feb 28, 2026

/retry and /undo gateway commands have no effect, /reset silently loses memories #210

Closed

teknium1 merged commit 719f2ee into NousResearch:main Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): persist transcript changes in /retry, /undo and fix /reset attribute #217

fix(gateway): persist transcript changes in /retry, /undo and fix /reset attribute #217
teknium1 merged 1 commit into
NousResearch:mainfrom
Farukest:fix/gateway-session-attribute-bugs

Farukest commented Feb 28, 2026

Uh oh!

teknium1 commented Mar 2, 2026

Uh oh!

Farukest commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Farukest commented Feb 28, 2026

Changes

Tests

Uh oh!

teknium1 commented Mar 2, 2026

Uh oh!

Farukest commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants