Fix/repair on api copy 24187#43905
Open
ccabrerah wants to merge 3 commits into
Open
Conversation
…ical transcript repair_message_sequence() ran on the canonical messages list before each API call and rewrote it in place. When repair shrank the list (merging consecutive user turns, dropping stray tool results), the positional persistence cursor in _flush_messages_to_session_db() — anchored at len(conversation_history) — pointed past the surviving entries, so the first K rows of the newly-appended region were silently never written to SessionDB (K = entries removed by repair). In gateway deployments (fresh AIAgent per inbound message, history reloaded from SessionDB each turn) this is self-reinforcing: every lost assistant reply leaves a new user/user violation in the stored history, which makes the next repair shrink the list further. One seed violation (e.g. a turn interrupted before its reply was appended) escalated into an hour-long replay loop where the agent re-answered old topics on every new message, until a manual session reset. Observed in production: 5 consecutive user rows in state.db while gateway logs show all 5 assistant replies were generated and delivered. Fix: move the repair call onto the per-call api_messages copy, joining sanitize_api_messages() and drop_thinking_only_and_merge_users() which already follow the API-call-time-only pattern. The canonical transcript keeps the raw event sequence — consecutive user turns are legal there; only the wire copy needs strict provider alternation. The flush cursor can no longer overshoot, so assistant/tool rows always persist, and sessions already corrupted on disk self-heal after one turn. Fixes NousResearch#24187 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eal run_conversation() The repair-on-canonical fix landed with function-level regression tests (test_message_sequence_repair.py), but nothing exercised the bug through the actual run_conversation() call site — flagged as a caveat at the time. This adds that end-to-end test. test_run_conversation_gateway_loop_persists_every_reply seeds one user/user violation in a real SessionDB, then runs five normally-paced gateway turns through the real run_conversation() (fresh AIAgent per inbound message, history reloaded each turn, provider mocked following the test_token_persistence_non_cli.py pattern). It asserts every assistant reply persists and the repair counter stays flat. It is bug-sensitive: run against the pre-fix wiring (parent commit a72bb03) it fails with zero assistant rows written across all five turns — each reply generated and returned but never persisted — and the repair counter climbing 1->5, reproducing the production fingerprint through the real loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bearing details The repair-on-canonical mechanism was explained in full in four places (the repair_message_sequence docstring, the conversation_loop call site, the test_message_sequence_repair header, and the test_24187_replay_spiral module docstring). Keep the docstring as the single authoritative source; reduce the other three to their local point plus a pointer. No behavior or test change — comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
030a6b3 to
7dd8419
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes a silent data-loss bug where assistant replies were never persisted to SessionDB, sending gateway deployments into an hour-long replay loop.
repair_message_sequence()ran on the canonicalmessageslist before each API call and rewrote it in place (messages[:] = merged). When repair shrinks the list — merging consecutive user turns, dropping stray tool results — the positional persistence cursor in_flush_messages_to_session_db()(anchored atlen(conversation_history)) points past the surviving entries, so the current turn's assistant/tool rows are silently never written.In gateway deployments (fresh
AIAgentper inbound message, history reloaded from SessionDB each turn) this is self-reinforcing: each lost reply leaves a fresh user/user violation, so the next repair shrinks the list further. Observed in production (Matrix gateway): one interrupted turn seeded a single violation, after which every normally-paced turn lost its reply — 5 consecutiveuserrows instate.dbwhile the gateway logs show all 5 replies were generated and delivered, theRepaired N message-alternation violationscounter climbing 1→10, and the agent re-answering hour-old topics until a manual session reset.Why this approach:
conversation_loop.pyalready treats every other provider-shape fix as an API-call-time-only projection on the per-callapi_messagescopy (sanitize_api_messages(),drop_thinking_only_and_merge_users(), whitespace/tool-call normalization).repair_message_sequencewas the lone pass mutating the canonical list. Moving the repair call ontoapi_messagesconforms it to that existing convention: the canonical transcript keeps the raw event sequence (consecutive user turns are legal there; only the wire copy needs strict alternation), the flush cursor can no longer overshoot, and sessions already corrupted on disk self-heal after one turn.This dissolves the case the other open approaches try to patch around persistence: the cursor clamp (#24326) still yields an empty flush slice; falling back to
_last_flushed_db_idx(#24196) re-appends the whole history in the fresh-agent flow; thereplace_messages()rewrite (#24211/#24419) is consistent but canonicalizes the lossy merge into stored history on every pass. Keeping the merge off the canonical list removes the need for any of them.Related Issue
Fixes #24187
Type of Change
Changes Made
agent/conversation_loop.py— move the_repair_message_sequence()call from the canonicalmessages(pre-copy) to the per-callapi_messagescopy, right afterdrop_thinking_only_and_merge_users().agent/agent_runtime_helpers.py— docstring/comments onrepair_message_sequence(): it mutates the list it is handed; callers must pass the API copy, never the canonical transcript (with the [Bug]: SessionDB silently skips current turn when message repair shortens conversation history #24187 rationale).tests/run_agent/test_24187_replay_spiral.py— new. End-to-end test driving the realrun_conversation()across a multi-turn gateway loop against a realSessionDB.tests/run_agent/test_message_sequence_repair.py— 3 function-level regression tests pinning both halves of the contract (canonical untouched; current turn persists with violated history; the pre-fix zero-rows failure as executable documentation).How to Test
SessionDB, then runs five normally-paced turns through the realrun_conversation()(mocked provider); every reply persists and the repair counter stays flat.conversation_loop.py, temporarily point the repair call back at the canonical list (agent._repair_message_sequence(messages)before theapi_messagescopy is built) and rerun step 1.Checklist
Code
fix(agent):,test(agent):,docs(agent):)pytest tests/ -qand all tests pass : multiple unrelated tests fail, settled on running "scripts/run_tests.sh tests/run_agent/ "Documentation & Housekeeping
docs/, docstrings) — docstrings onrepair_message_sequence()updated; no README/docs/impactcli-config.yaml.exampleif I added/changed config keys — N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/AScreenshots / Logs
We are not providing real logs with this PR, but added a test that contains steps to trigger the bug.