Summary
When the gateway reuses a cached AIAgent, the per-agent SessionDB flush cursor can leak across turns.
GatewayRunner._init_cached_agent_for_turn() resets some per-turn state, but it does not realign agent._last_flushed_db_idx to the history actually passed into the new turn.
As a result, a reused agent can compute a flush_from offset that is too large for the current turn and silently skip persisting the assistant reply into state.db.
This leaves a transcript with multiple consecutive user rows and missing assistant rows, which then causes later turns to replay stale questions and produce "repeated" or blended answers.
This looks related to, but distinct from:
Observed Behavior
In a live gateway session, the user reported that each new reply kept dragging prior questions into the current answer.
Inspecting the session transcript showed a pattern like:
- assistant
- user
- user
- user
- user
- user
- user
- assistant
So the platform visibly delivered multiple assistant replies over time, but the durable SQLite transcript only retained the final one. On subsequent turns, Hermes loaded this broken history, triggered consecutive-user repair, and effectively merged several old user turns into the next prompt.
Root Cause Hypothesis
The critical pieces are:
-
gateway/run.py reuses cached agents:
if cached and cached[1] == _sig:
agent = cached[0]
self._init_cached_agent_for_turn(agent, _interrupt_depth)
-
_init_cached_agent_for_turn() currently resets only:
_last_activity_ts
_last_activity_desc
_api_call_count
It does not reset or realign _last_flushed_db_idx.
-
run_agent.py::_flush_messages_to_session_db() later computes:
start_idx = len(conversation_history) if conversation_history else 0
flush_from = max(start_idx, self._last_flushed_db_idx)
for msg in messages[flush_from:]:
... append_message(...)
If the cached agent still carries _last_flushed_db_idx from the previous turn, the new turn can start flushing from a later index than the current conversation_history boundary. Then the assistant message for this turn is silently skipped.
Why This Causes Repeated Answers
On the gateway success path, transcript writes assume the agent already persisted the DB rows:
agent_persisted = self._session_db is not None
append_to_transcript(..., skip_db=agent_persisted)
So if the agent-side flush skips the assistant row, the gateway does not backfill it. The next inbound message then reloads a transcript containing several consecutive user rows with the assistant rows missing.
That broken replay state matches the repeated-answer symptom exactly:
- Hermes repairs/merges consecutive
user messages
- old unanswered-looking questions get folded into the next prompt
- the new reply appears to repeat or drag in previous topics
Minimal Regression Shape
A focused test should simulate:
- Create a cached
AIAgent for a gateway session
- Run one turn so
_last_flushed_db_idx becomes non-zero
- Reuse the same cached agent for a second turn with freshly loaded
history
- Do not reset
_last_flushed_db_idx
- Persist the second turn
- Assert that the second turn's assistant row is missing from SessionDB
Then apply the fix and assert the assistant row is present.
Suggested Fix
When reusing a cached agent, realign the flush cursor to the history actually being replayed for this turn.
Two plausible fixes:
-
In the gateway path, after agent_history is built for the current turn, set:
agent._last_flushed_db_idx = len(agent_history)
-
Or more defensively, inside persistence, clamp / recompute flush_from so stale cached-agent state cannot skip the current turn.
The first option seems the most direct because _last_flushed_db_idx is turn-local persistence state, and cached-agent reuse is precisely where the stale value crosses turn boundaries.
Expected Invariant
For every successful gateway turn:
If a visible assistant response is produced, the session transcript for that turn must contain the corresponding assistant row.
Environment
- Hermes gateway on macOS
- Profile-scoped gateway session
- Cached-agent reuse enabled in gateway
- SessionDB (
state.db) is the canonical transcript store
Summary
When the gateway reuses a cached
AIAgent, the per-agent SessionDB flush cursor can leak across turns.GatewayRunner._init_cached_agent_for_turn()resets some per-turn state, but it does not realignagent._last_flushed_db_idxto the history actually passed into the new turn.As a result, a reused agent can compute a
flush_fromoffset that is too large for the current turn and silently skip persisting the assistant reply intostate.db.This leaves a transcript with multiple consecutive
userrows and missingassistantrows, which then causes later turns to replay stale questions and produce "repeated" or blended answers.This looks related to, but distinct from:
messagesvsconversation_historyObserved Behavior
In a live gateway session, the user reported that each new reply kept dragging prior questions into the current answer.
Inspecting the session transcript showed a pattern like:
So the platform visibly delivered multiple assistant replies over time, but the durable SQLite transcript only retained the final one. On subsequent turns, Hermes loaded this broken history, triggered consecutive-user repair, and effectively merged several old user turns into the next prompt.
Root Cause Hypothesis
The critical pieces are:
gateway/run.pyreuses cached agents:_init_cached_agent_for_turn()currently resets only:_last_activity_ts_last_activity_desc_api_call_countIt does not reset or realign
_last_flushed_db_idx.run_agent.py::_flush_messages_to_session_db()later computes:If the cached agent still carries
_last_flushed_db_idxfrom the previous turn, the new turn can start flushing from a later index than the currentconversation_historyboundary. Then the assistant message for this turn is silently skipped.Why This Causes Repeated Answers
On the gateway success path, transcript writes assume the agent already persisted the DB rows:
So if the agent-side flush skips the assistant row, the gateway does not backfill it. The next inbound message then reloads a transcript containing several consecutive
userrows with the assistant rows missing.That broken replay state matches the repeated-answer symptom exactly:
usermessagesMinimal Regression Shape
A focused test should simulate:
AIAgentfor a gateway session_last_flushed_db_idxbecomes non-zerohistory_last_flushed_db_idxThen apply the fix and assert the assistant row is present.
Suggested Fix
When reusing a cached agent, realign the flush cursor to the history actually being replayed for this turn.
Two plausible fixes:
In the gateway path, after
agent_historyis built for the current turn, set:Or more defensively, inside persistence, clamp / recompute
flush_fromso stale cached-agent state cannot skip the current turn.The first option seems the most direct because
_last_flushed_db_idxis turn-local persistence state, and cached-agent reuse is precisely where the stale value crosses turn boundaries.Expected Invariant
For every successful gateway turn:
Environment
state.db) is the canonical transcript store