fix(gateway): persist transcript changes in /retry, /undo and fix /reset attribute #217
Merged
teknium1 merged 1 commit intoMar 2, 2026
Conversation
/retry and /undo set session_entry.conversation_history which does not exist on SessionEntry. The truncated history was never written to disk, so the next message reload picked up the full unmodified transcript. Added SessionStore.rewrite_transcript() that persists changes to both the JSONL file and SQLite database, and updated both commands to use it. /reset accessed self.session_store._sessions which does not exist on SessionStore (the correct attribute is _entries). Also replaced the hand-coded session key with _generate_session_key() to fix WhatsApp DM sessions using the wrong key format. Closes NousResearch#210
9f550df to
b7f8a17
Compare
Contributor
|
Merged into main via 719f2ee — thanks for the excellent work @Farukest! We had already landed a fix for the core bugs in 698b359 (from your issue #210), but your PR brought two things we didn't have:
We kept our Final result is a combination of both contributions. Great bug hunting! |
Contributor
Author
Thanks for the merge and the follow up fix. Happy to contribute @teknium1 |
kkrivonos
added a commit
to kkrivonos/hermes-agent
that referenced
this pull request
Jun 12, 2026
…stance (G1) (NousResearch#217, NousResearch#44) The singleton ACP transport pool landing in this branch hands one ToolBridgeServer to many AIAgents. The pre-change bridge held a single ``_dispatcher`` slot plus a single ``_snapshot`` field, both mutated by whichever AIAgent most recently asked gemini-cli to start a session. With a shared bridge that means session B's update silently overwrites session A's routing — a tool call from a connection spawned during session A would route into session B's AIAgent and read session B's tool list. Privacy-grade cross-session leak. This commit closes the leak by keying both fields off the per-connection binding instead of off the bridge instance: - ``publish(tools, *, binding, dispatcher)`` registers a per-session binding under an opaque id. Pool callers pass both; un-pooled callers omit them and keep using the legacy single-tenant slots. - The mcpServers payload now carries ``HERMES_ACP_SESSION_BINDING`` so gemini-cli's proxy receives the id and echoes it in its hello frame. - ``_handle_client`` stores ``conn_binding`` as a local on the connection task. ``_handle_tools_call`` and ``tools/list`` look the dispatcher and snapshot up under that id, never via a shared attribute. Missing binding at call-time returns an error envelope rather than routing into someone else's session. - ``unpublish(binding)`` lets the pool drop a binding when an agent releases its hold so the dispatcher closure (which pins the AIAgent) gets GC'd. - ``stop()`` clears both ``_bindings`` and ``_snapshot`` so a restarted bridge starts cold. Unpooled callers (the entire current production path) are unchanged: no binding env-var, no binding in hello, same per-instance dispatcher and snapshot. Sets up but does not enable the pool — that comes in a later commit gated by ``agent.transport_pool.enabled``. Existing tests/agent/test_gemini_acp_tool_bridge.py still passes (7/7) since the legacy path is preserved.
kkrivonos
added a commit
to kkrivonos/hermes-agent
that referenced
this pull request
Jun 12, 2026
, NousResearch#44) New module ``agent/transports/acp_pool.py``. Phase 1 of the NousResearch#44 cold- start latency fix: lift the gemini-acp subprocess + tool bridge out of per-AIAgent ownership into a gateway-scoped singleton keyed by (provider, model, working_dir). Design notes captured in the module docstring: - ``ACPPool.get()`` returns the same ``GeminiACPClient`` for repeat calls with the same key; first miss spawns, concurrent miss attaches to the in-flight spawn future. - ``_Slot`` mirrors the NousResearch#74 TOCTOU-race pattern from ``tools.mcp_tool``: states are ``creating`` / ``ready`` / ``closed``, ``waiters`` is a loop-agnostic ``concurrent.futures.Future`` so cross-thread callers serialize on one spawn instead of racing into two subprocesses. - The spawn path eagerly calls ``_ensure_underlying()`` so the slot is only marked ``ready`` after the gemini-cli subprocess is up — without that the first real request would still pay the ~4 s boot. - Pool teardown is owned exclusively by ``ACPPool.shutdown()``; the per-agent release/close paths must NOT touch a pooled transport (a follow-up commit enforces this in ``run_agent.py``). - Currently only ``provider="gemini-acp"`` is wired — codex-app-server pooling is named in the plan but lives in its own module and is not in this phase. Calls for other providers raise ``ValueError`` so misuse fails loudly rather than silently bypassing the pool. This commit only adds the type; the gateway is not yet wiring it up. Landing it standalone keeps the diff reviewable and produces a clean no-op until the next commits flip the feature flag.
kkrivonos
added a commit
to kkrivonos/hermes-agent
that referenced
this pull request
Jun 12, 2026
…usResearch#217) Phase 1 of NousResearch#44 cold-start latency fix. * GatewayRunner.__init__ constructs a singleton ACPPool unconditionally so its shutdown hook is always wired. Whether AIAgents actually pull from it is gated by ``agent.transport_pool.enabled`` (default False) or the ``HERMES_TRANSPORT_POOL_ENABLED`` env-var. Constructor never spawns a subprocess on its own; the pool stays cold until first ``get()`` or pre-warm. * ``_start_transport_pool_prewarm`` reads the gateway's effective default via ``_resolve_session_agent_runtime``; if the operator has configured gemini-acp as the default it spawns a daemon thread that calls ``pool.get()`` once with that (provider, model, cwd) tuple. Errors land at DEBUG only - pre-warm is a latency optimization, never a correctness requirement. Skips silently when the flag is off so prod sees no change. * Pre-warm scheduling fires in ``_start`` right after the ripgrep boot check, off the gateway boot path's critical section. * ``_stop_impl`` calls ``self._transport_pool.shutdown()`` after the idle agent cache cleanup so any in-flight tool-dispatcher closures have been released first. This is the ONLY place pooled gemini-cli subprocesses and tool-bridge sockets get torn down; per-agent close/release paths deliberately skip pooled clients. Landing this commit is a no-op until ``agent.transport_pool.enabled`` is flipped on - the pool is constructed but no AIAgent consults it yet (that wiring lands in the next commit).
kkrivonos
added a commit
to kkrivonos/hermes-agent
that referenced
this pull request
Jun 12, 2026
…earch#217) Phase 1 of NousResearch#44 cold-start latency fix. * ``init_agent`` and ``AIAgent.__init__`` accept ``transport_pool`` + ``transport_pool_enabled``. Stashed on the agent as ``_transport_pool``, ``_transport_pool_enabled``, ``_transport_from_pool`` (bool), and ``_acp_binding`` (per-agent id). Subagents inherit nothing here yet - that wiring lands when subagent spawn paths get explicit pool forwarding (not in scope for this commit). * ``create_openai_client`` (gemini-acp branch) now consults the pool when both pool ref and the feature flag are set. Pool returns a pre-warmed singleton ``GeminiACPClient``; each AIAgent gets a unique binding token (``secrets.token_hex(16)``) and pre-registers a NULL- tools binding on the bridge so route-by-binding is wired before the first publish. Pool failure falls back to the legacy per-AIAgent spawn (and logs at ERROR) - we never ship a half-wired pooled agent. * ``_create_request_openai_client`` injects the binding + dispatcher into ``api_kwargs`` for pooled gemini-acp. The facade's ``_create_chat_completion`` pops them and forwards to ``ToolBridgeServer.publish(binding=..., dispatcher=...)`` - this is the G1 routing path the previous commit's per-connection ``_ConnectionBinding`` was designed to receive. * G2 fix (session-id leak): the facade's ``_session_id``, ``_tools_fp`` and ``_first_prompt_measured`` are now per-binding for pooled callers (``_binding_sessions`` / ``_binding_tools_fp`` / ``_binding_first_prompt_measured``). Without this two AIAgents sharing a pooled facade would collide on a single session-id and one's prompts would route into the other's gemini-cli session. Un-pooled callers (binding=None) keep using the legacy instance fields so the pre-pool path is byte-identical. * ``GeminiACPClient.release_binding`` drops a binding from the bridge (so the dispatcher closure can be GC'd) and clears its per-session state on the facade. Called from ``release_clients``, ``close``, and ``_replace_primary_openai_client`` instead of ``client.close()`` when ``_transport_from_pool`` is True - hard close of a pooled client would orphan every sibling AIAgent that shares it. The pool alone owns the subprocess lifecycle (torn down in ``GatewayRunner._stop_impl``). * All four ``AIAgent(...)`` construction sites in ``gateway/run.py`` now pass ``transport_pool=self._transport_pool`` + ``transport_pool_enabled=self._transport_pool_enabled``. Cold-start telemetry buckets (``_last_*_ms``) stay facade-wide and will race noisily across concurrent pooled turns; the consumer (commit 5's breakdown log) reads them inside the writing turn so the race is bounded to noisy numbers, not corruption. Per-binding telemetry is deferred to the metrics commit. gemini-cli has no session-close RPC, so released bindings leak a dead session-id in the subprocess's memory; documented and accepted for phase 1 (bounded by subprocess restart). Landing this commit is still a no-op until ``agent.transport_pool.enabled`` is flipped on - the legacy branch runs whenever the pool ref or the flag is missing.
kkrivonos
added a commit
to kkrivonos/hermes-agent
that referenced
this pull request
Jun 12, 2026
…search#217) Phase 1 of NousResearch#44 cold-start latency fix. Add ``transport=pool|agent`` to the per-turn ``cold_start_breakdown`` log so operators can verify the pool is doing its job in production: filter on ``transport=pool`` and the ``spawn`` column should collapse to near-zero once the pre-warm task has run, while ``transport=agent`` turns continue to pay the full ~10s respawn cost the legacy path incurs on cache eviction. Also exposed as ``transport_from_pool`` (bool) in the structured log ``extra`` so structured-log pipelines can split histograms. No new buckets, no new schema fields - the existing five spawn/bridge/session_new/first_tool/generation columns are preserved verbatim.
kkrivonos
added a commit
to kkrivonos/hermes-agent
that referenced
this pull request
Jun 12, 2026
…rch#217) Pins the new gateway-scoped pool's contract and guards the two privacy gates introduced alongside it: - ACPPool: identity-on-repeat-get, key separation by (provider, model, working_dir), spawn-kwargs propagation, _Slot race serialization (8 concurrent misses collapse to one spawn), no-close on per-agent release_binding, hard close at shutdown (idempotent), post-shutdown get raises, failed spawn pops the slot for retry, unsupported provider raises. - G1 (ToolBridgeServer): with one bridge serving two bindings, tools/list returns each connection's own snapshot; tools/call routes into each connection's own dispatcher; re-publishing binding A does not redirect binding B's in-flight calls; unpublish revokes a binding mid-session with an error envelope rather than a stale route. - G2 (GeminiACPClient): per-binding _session_id / _tools_fp / _first_prompt_measured track independently when one facade serves two bindings (no cross-binding session-id clobber); release_binding drops only the target binding's per-binding state and the bridge unpublish, leaves the sibling untouched, and is safe after close(). The pool tests stub GeminiACPClient via monkeypatch on the lazy import inside _spawn_gemini_acp, so no gemini-cli subprocess is involved. The G1 test runs a real ToolBridgeServer over a unix socket. The G2 test hand-builds a GeminiACPClient via __new__ and patches the bridge + underlying ACPClient (same pattern used by tests/agent/ test_gemini_acp_tool_wiring.py). 15 tests, all passing locally. Existing tests/agent/ test_gemini_acp_tool_bridge.py (7) and tests/agent/ test_cold_start_instrumentation.py (1) re-run clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
/retryand/undosetsession_entry.conversation_historywhich does not exist onSessionEntry. The truncated history was never written to disk, so the next message reload picked up the full unmodified transcript - both commands appeared to work but did nothing./resetaccessedself.session_store._sessionswhich does not exist onSessionStore(the correct attribute is_entries). The resultingAttributeErrorwas caught by a bare except, silently skipping the memory flush before reset.Changes
SessionStore.rewrite_transcript()that persists changes to both the JSONL file and SQLite database/retryand/undoto callrewrite_transcript()instead of setting a non-existent attribute/resetto use_entriesinstead of_sessionsTests
Added 3 regression tests to
tests/gateway/test_session.py:rewrite_transcriptreplaces JSONL content correctlyrewrite_transcriptwith empty list clears transcriptSessionStorehas_entriesattribute, not_sessionsAll 20 tests pass.
Closes #210