Skip to content

Fix/repair on api copy 24187#43905

Open
ccabrerah wants to merge 3 commits into
NousResearch:mainfrom
ccabrerah:fix/repair-on-api-copy-24187
Open

Fix/repair on api copy 24187#43905
ccabrerah wants to merge 3 commits into
NousResearch:mainfrom
ccabrerah:fix/repair-on-api-copy-24187

Conversation

@ccabrerah

@ccabrerah ccabrerah commented Jun 11, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes a silent data-loss bug where assistant replies were never persisted to SessionDB, sending gateway deployments into an hour-long replay loop.

repair_message_sequence() ran on the canonical messages list before each API call and rewrote it in place (messages[:] = merged). When repair shrinks the list — merging consecutive user turns, dropping stray tool results — the positional persistence cursor in _flush_messages_to_session_db() (anchored at len(conversation_history)) points past the surviving entries, so the current turn's assistant/tool rows are silently never written.

In gateway deployments (fresh AIAgent per inbound message, history reloaded from SessionDB each turn) this is self-reinforcing: each lost reply leaves a fresh user/user violation, so the next repair shrinks the list further. Observed in production (Matrix gateway): one interrupted turn seeded a single violation, after which every normally-paced turn lost its reply — 5 consecutive user rows in state.db while the gateway logs show all 5 replies were generated and delivered, the Repaired N message-alternation violations counter climbing 1→10, and the agent re-answering hour-old topics until a manual session reset.

Why this approach: conversation_loop.py already treats every other provider-shape fix as an API-call-time-only projection on the per-call api_messages copy (sanitize_api_messages(), drop_thinking_only_and_merge_users(), whitespace/tool-call normalization). repair_message_sequence was the lone pass mutating the canonical list. Moving the repair call onto api_messages conforms it to that existing convention: the canonical transcript keeps the raw event sequence (consecutive user turns are legal there; only the wire copy needs strict alternation), the flush cursor can no longer overshoot, and sessions already corrupted on disk self-heal after one turn.

This dissolves the case the other open approaches try to patch around persistence: the cursor clamp (#24326) still yields an empty flush slice; falling back to _last_flushed_db_idx (#24196) re-appends the whole history in the fresh-agent flow; the replace_messages() rewrite (#24211/#24419) is consistent but canonicalizes the lossy merge into stored history on every pass. Keeping the merge off the canonical list removes the need for any of them.

Related Issue

Fixes #24187

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/conversation_loop.py — move the _repair_message_sequence() call from the canonical messages (pre-copy) to the per-call api_messages copy, right after drop_thinking_only_and_merge_users().
  • agent/agent_runtime_helpers.py — docstring/comments on repair_message_sequence(): it mutates the list it is handed; callers must pass the API copy, never the canonical transcript (with the [Bug]: SessionDB silently skips current turn when message repair shortens conversation history #24187 rationale).
  • tests/run_agent/test_24187_replay_spiral.pynew. End-to-end test driving the real run_conversation() across a multi-turn gateway loop against a real SessionDB.
  • tests/run_agent/test_message_sequence_repair.py — 3 function-level regression tests pinning both halves of the contract (canonical untouched; current turn persists with violated history; the pre-fix zero-rows failure as executable documentation).

How to Test

  1. Run the reproduction + contract tests:
    scripts/run_tests.sh tests/run_agent/test_24187_replay_spiral.py tests/run_agent/test_message_sequence_repair.py
    
    15 passed. The end-to-end test seeds one user/user violation in a real SessionDB, then runs five normally-paced turns through the real run_conversation() (mocked provider); every reply persists and the repair counter stays flat.
  2. Confirm the test actually catches the bug: in conversation_loop.py, temporarily point the repair call back at the canonical list (agent._repair_message_sequence(messages) before the api_messages copy is built) and rerun step 1.
  3. → the end-to-end test fails with zero assistant rows persisted across all five turns and the repair counter climbing 1→5 — the production fingerprint, reproduced through the real loop. Revert the change.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(agent):, test(agent):, docs(agent):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (4 files; verified against the merge-base)
  • I've run pytest tests/ -q and all tests pass : multiple unrelated tests fail, settled on running "scripts/run_tests.sh tests/run_agent/ "
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS Tahoe 26.5.1

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — docstrings on repair_message_sequence() updated; no README/docs/ impact
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) — N/A (pure in-memory list/persistence logic; no platform-specific paths, processes, or filesystem modes)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

We are not providing real logs with this PR, but added a test that contains steps to trigger the bug.

Carlos Cabrera H. and others added 3 commits June 10, 2026 21:14
…ical transcript

repair_message_sequence() ran on the canonical messages list before each
API call and rewrote it in place. When repair shrank the list (merging
consecutive user turns, dropping stray tool results), the positional
persistence cursor in _flush_messages_to_session_db() — anchored at
len(conversation_history) — pointed past the surviving entries, so the
first K rows of the newly-appended region were silently never written
to SessionDB (K = entries removed by repair).

In gateway deployments (fresh AIAgent per inbound message, history
reloaded from SessionDB each turn) this is self-reinforcing: every lost
assistant reply leaves a new user/user violation in the stored history,
which makes the next repair shrink the list further. One seed violation
(e.g. a turn interrupted before its reply was appended) escalated into
an hour-long replay loop where the agent re-answered old topics on
every new message, until a manual session reset. Observed in
production: 5 consecutive user rows in state.db while gateway logs
show all 5 assistant replies were generated and delivered.

Fix: move the repair call onto the per-call api_messages copy, joining
sanitize_api_messages() and drop_thinking_only_and_merge_users() which
already follow the API-call-time-only pattern. The canonical transcript
keeps the raw event sequence — consecutive user turns are legal there;
only the wire copy needs strict provider alternation. The flush cursor
can no longer overshoot, so assistant/tool rows always persist, and
sessions already corrupted on disk self-heal after one turn.

Fixes NousResearch#24187

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eal run_conversation()

The repair-on-canonical fix landed with function-level regression tests
(test_message_sequence_repair.py), but nothing exercised the bug through
the actual run_conversation() call site — flagged as a caveat at the time.

This adds that end-to-end test. test_run_conversation_gateway_loop_persists_every_reply
seeds one user/user violation in a real SessionDB, then runs five
normally-paced gateway turns through the real run_conversation() (fresh
AIAgent per inbound message, history reloaded each turn, provider mocked
following the test_token_persistence_non_cli.py pattern). It asserts every
assistant reply persists and the repair counter stays flat.

It is bug-sensitive: run against the pre-fix wiring (parent commit a72bb03)
it fails with zero assistant rows written across all five turns — each
reply generated and returned but never persisted — and the repair counter
climbing 1->5, reproducing the production fingerprint through the real loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bearing details

The repair-on-canonical mechanism was explained in full in four places
(the repair_message_sequence docstring, the conversation_loop call site,
the test_message_sequence_repair header, and the test_24187_replay_spiral
module docstring). Keep the docstring as the single authoritative source;
reduce the other three to their local point plus a pointer. No behavior or
test change — comments only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ccabrerah ccabrerah force-pushed the fix/repair-on-api-copy-24187 branch from 030a6b3 to 7dd8419 Compare June 11, 2026 02:20
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: SessionDB silently skips current turn when message repair shortens conversation history

2 participants