Skip to content

fix(gateway,agent): persist delivered responses that recovery paths drop from the transcript#44120

Open
AIalliAI wants to merge 1 commit into
NousResearch:mainfrom
AIalliAI:fix/44100-persist-delivered-response
Open

fix(gateway,agent): persist delivered responses that recovery paths drop from the transcript#44120
AIalliAI wants to merge 1 commit into
NousResearch:mainfrom
AIalliAI:fix/44100-persist-delivered-response

Conversation

@AIalliAI

@AIalliAI AIalliAI commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Problem

Gateway delivers assistant responses to the platform (confirmed in gateway.log), but the session DB ends up with no assistant rows between the user messages. When the next message arrives, the model loads a transcript full of "unanswered" user messages and re-answers all of them in one turn.

Root cause

Two pieces interact:

  1. Agent — partial-stream recovery drops the assistant turn. In agent/conversation_loop.py, when the final assembled assistant message has no visible content but text was already streamed to the user, the recovery path sets final_response from the streamed text and breaks without appending an assistant message to messages. The turn-end _persist_session() then flushes a transcript whose tail is the user message — the user row survives (written by the turn-start crash-resilience flush), the assistant row never exists. This matches the issue's evidence exactly: response ready (…, 48 chars) logs a non-empty final_response while the DB has zero assistant rows.

  2. Gateway — every fallback write is a silent no-op. Since state.db became the canonical transcript store (spec 002), append_to_transcript(..., skip_db=True) does nothing at all. The gateway skips all post-turn DB writes via skip_db=agent_persisted to avoid the bug: SQLite session transcript accumulates duplicate messages (3-4x token inflation) #860/Bug: User messages stored twice in state.db when agent and gateway both write to SQLite #42039 duplicate-write bug — correct for messages the agent flushed, but it means the gateway cannot backfill anything the agent's flush missed. The delivered response is silently dropped with no error anywhere.

Fix

One invariant, enforced at both layers: a delivered final_response must end up in the session transcript.

The gateway backfill also covers the fallback_prior_turn_content recovery (response sourced from an earlier tool-call turn's content, transcript tail ends at a tool message) and any future agent path that returns a response without representing it in messages — with an INFO log so occurrences are visible instead of silent.

Tests

tests/gateway/ + the neighboring run_agent persistence/streaming suites pass; the handful of failures present are identical with and without this diff (pre-existing, environment-dependent: shutdown forensics/systemd, Telegram MarkdownV2 escaping).

Fixes #44100

…rop from the transcript

Gateway delivered assistant responses to the platform but never persisted
them to the session DB, so the model saw consecutive "unanswered" user
messages and re-answered all of them on the next turn (NousResearch#44100).

Two layers, one invariant — a delivered final_response must end up in the
session transcript:

1. agent: the partial-stream recovery path (final message empty/thinking-
   only but content already streamed to the user) set final_response and
   broke out of the loop WITHOUT appending an assistant message. The
   turn-end _persist_session then wrote no assistant row — only the user
   message (persisted by the turn-start crash-resilience flush) survived.
   Append the recovered text as a real assistant turn before breaking.

2. gateway: state.db is the canonical transcript store (spec 002), so
   append_to_transcript(..., skip_db=True) is a complete no-op — the
   gateway's "fallback" writes could never backfill anything. When a
   turn's new messages contain no assistant text but a response was
   delivered, write the assistant row with skip_db=False. A response
   generated this turn cannot already be in the loaded history, so the
   NousResearch#860/NousResearch#42039 duplicate-write protection (which concerns the user entry
   and agent-flushed messages) is preserved — covered by regression
   tests.

Fixes NousResearch#44100

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery labels Jun 11, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

Verification: thorough fix for transcript persistence gap with comprehensive regression tests

Reviewed both the conversation_loop.py and gateway/run.py changes — the fix correctly addresses #44100 from two complementary angles:

Agent-side fix (conversation_loop.py): When partial-stream recovery fires (empty final message but content already streamed), the recovered text is now appended as a real assistant message before break. Without this, _persist_session writes no assistant row and the transcript loses role alternation.

Gateway-side fix (gateway/run.py): Three changes:

  1. skip_db=agent_persistedskip_db=False for the assistant response write in the new_messages path — a response generated this turn cannot be in loaded history, so there's no duplicate risk
  2. A safety-net backfill when the turn's new_messages contain no assistant text — catches any recovery path the agent-side fix might miss
  3. The not new_messages fallback also uses skip_db=False

The safety-net check _turn_has_assistant_text correctly filters for role == "assistant" with non-empty content and no tool_calls, so it won't false-positive on tool-call assistant messages.

Test coverage is strong: 4 gateway tests (backfill/no-backfill/tool-turn/fallback) + 2 agent tests (recovery persistence/normal control). The _bootstrap helper correctly mocks the session store to verify skip_db values.

@AIalliAI

Copy link
Copy Markdown
Contributor Author

Requesting maintainer review — this is ready to land from my side. Standalone fork CI is pending first-run approval here; the rollup branch in #44061 carrying this session's batch is fully green on upstream CI (all test shards, typecheck, e2e).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram: assistant responses not persisted to session DB (model re-answers old messages)

3 participants