fix(gateway,agent): persist delivered responses that recovery paths drop from the transcript#44120
Conversation
…rop from the transcript Gateway delivered assistant responses to the platform but never persisted them to the session DB, so the model saw consecutive "unanswered" user messages and re-answered all of them on the next turn (NousResearch#44100). Two layers, one invariant — a delivered final_response must end up in the session transcript: 1. agent: the partial-stream recovery path (final message empty/thinking- only but content already streamed to the user) set final_response and broke out of the loop WITHOUT appending an assistant message. The turn-end _persist_session then wrote no assistant row — only the user message (persisted by the turn-start crash-resilience flush) survived. Append the recovered text as a real assistant turn before breaking. 2. gateway: state.db is the canonical transcript store (spec 002), so append_to_transcript(..., skip_db=True) is a complete no-op — the gateway's "fallback" writes could never backfill anything. When a turn's new messages contain no assistant text but a response was delivered, write the assistant row with skip_db=False. A response generated this turn cannot already be in the loaded history, so the NousResearch#860/NousResearch#42039 duplicate-write protection (which concerns the user entry and agent-flushed messages) is preserved — covered by regression tests. Fixes NousResearch#44100 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Verification: thorough fix for transcript persistence gap with comprehensive regression tests Reviewed both the Agent-side fix ( Gateway-side fix (
The safety-net check Test coverage is strong: 4 gateway tests (backfill/no-backfill/tool-turn/fallback) + 2 agent tests (recovery persistence/normal control). The |
|
Requesting maintainer review — this is ready to land from my side. Standalone fork CI is pending first-run approval here; the rollup branch in #44061 carrying this session's batch is fully green on upstream CI (all test shards, typecheck, e2e). |
Problem
Gateway delivers assistant responses to the platform (confirmed in gateway.log), but the session DB ends up with no assistant rows between the user messages. When the next message arrives, the model loads a transcript full of "unanswered" user messages and re-answers all of them in one turn.
Root cause
Two pieces interact:
Agent — partial-stream recovery drops the assistant turn. In
agent/conversation_loop.py, when the final assembled assistant message has no visible content but text was already streamed to the user, the recovery path setsfinal_responsefrom the streamed text andbreaks without appending an assistant message tomessages. The turn-end_persist_session()then flushes a transcript whose tail is the user message — the user row survives (written by the turn-start crash-resilience flush), the assistant row never exists. This matches the issue's evidence exactly:response ready (…, 48 chars)logs a non-emptyfinal_responsewhile the DB has zero assistant rows.Gateway — every fallback write is a silent no-op. Since state.db became the canonical transcript store (spec 002),
append_to_transcript(..., skip_db=True)does nothing at all. The gateway skips all post-turn DB writes viaskip_db=agent_persistedto avoid the bug: SQLite session transcript accumulates duplicate messages (3-4x token inflation) #860/Bug: User messages stored twice in state.db when agent and gateway both write to SQLite #42039 duplicate-write bug — correct for messages the agent flushed, but it means the gateway cannot backfill anything the agent's flush missed. The delivered response is silently dropped with no error anywhere.Fix
One invariant, enforced at both layers: a delivered
final_responsemust end up in the session transcript.agent/conversation_loop.py: the partial-stream recovery path appends the recovered text as a real assistant turn before breaking, so_persist_session()writes it and role alternation is preserved.gateway/run.py: when the turn's new messages contain no assistant text but a response was delivered, the gateway backfills the assistant row withskip_db=False. A response generated this turn cannot already be in the loaded history, so this cannot double-write; the bug: SQLite session transcript accumulates duplicate messages (3-4x token inflation) #860/Bug: User messages stored twice in state.db when agent and gateway both write to SQLite #42039 protections for user entries and agent-flushed messages are untouched (pinned by regression tests). Same reasoning for the existingnot new_messagesfallback branch, whose assistant write was also a no-op.The gateway backfill also covers the
fallback_prior_turn_contentrecovery (response sourced from an earlier tool-call turn's content, transcript tail ends at atoolmessage) and any future agent path that returns a response without representing it inmessages— with an INFO log so occurrences are visible instead of silent.Tests
tests/run_agent/test_44100_partial_recovery_persistence.py— partial-stream recovery appends the recovered assistant turn; normal turns unchanged (exactly one assistant message).tests/gateway/test_44100_assistant_backfill.py— backfill fires when the turn has no assistant text (plain and tool-call turns), does NOT fire when the agent persisted the message itself (bug: SQLite session transcript accumulates duplicate messages (3-4x token inflation) #860/Bug: User messages stored twice in state.db when agent and gateway both write to SQLite #42039 protection intact), and thenot new_messagesfallback writes withskip_db=False.tests/gateway/+ the neighboringrun_agentpersistence/streaming suites pass; the handful of failures present are identical with and without this diff (pre-existing, environment-dependent: shutdown forensics/systemd, Telegram MarkdownV2 escaping).Fixes #44100