Skip to content

fix: preserve messages after compaction split, keep busy follow-ups as separate turns#43067

Open
KiruyaMomochi wants to merge 1 commit into
NousResearch:mainfrom
KiruyaMomochi:fix/post-compaction-message-loss
Open

fix: preserve messages after compaction split, keep busy follow-ups as separate turns#43067
KiruyaMomochi wants to merge 1 commit into
NousResearch:mainfrom
KiruyaMomochi:fix/post-compaction-message-loss

Conversation

@KiruyaMomochi

@KiruyaMomochi KiruyaMomochi commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Fixes #43066 — assistant messages lost after context compaction, and user follow-ups merged into single turns.

Root Cause

Three related persistence failures:

  1. Stale offset after compression split — compress_context() rotates to a child session but _flush_messages_to_session_db still computes start_idx from the pre-compression history length, making messages[start_idx:] empty.

  2. Stale flush cursor on cached gateway agents — A long-lived gateway AIAgent retains _last_flushed_db_idx from a longer transcript. After compaction shortens the DB, the cursor exceeds actual row count and subsequent flushes skip all new messages.

  3. No gateway-side verification — When agent_persisted=True, the gateway skips its own DB writes. If the agent flush silently fails, assistant messages are delivered but never persisted.

Fix

  1. compress_context() sets a one-shot flag; flush consumes it to write from index 0.
  2. Flush cursor clamped when exceeding DB row count (duplicate-write protection preserved).
  3. Gateway verifies DB tail after each turn; mirrors current turn if absent.
  4. Busy-mode text follow-ups route through FIFO queue instead of newline-merge.

Tests

  • test_compression_persistence.py — child session persistence
  • test_session_db_flush_cursor.py — stale cursor clamp
  • test_42039_duplicate_user_message.py — gateway fallback
  • test_busy_session_ack.py — FIFO text follow-ups

All 183 targeted tests pass.

@liuhao1024

Copy link
Copy Markdown
Contributor

Verification comment — reviewed the diff (3 files, +113/-74 lines).

The compression-split message-loss fix is well-designed:

  1. _ignore_conversation_history_on_next_flush flag — correctly placed at the split point in _compress_context() and consumed exactly once in _flush_messages_to_session_db(). The flag resets itself after use, preventing stale-offset issues on subsequent flushes.

  2. _queue_or_replace_pending_event refactor — replacing merge_pending_message_event with per-event FIFO queuing preserves individual message boundaries through compression. The test test_interrupt_mode_preserves_rapid_text_followups_as_fifo validates this with two sequential messages.

  3. Test rename (test_flush_with_stale_history_loses_messagestest_flush_with_stale_history_after_compression_preserves_messages) — correctly reflects that the bug condition is now fixed, not just documented.

One observation: the _ignore_conversation_history_on_next_flush attribute is set via direct assignment (agent._ignore_conversation_history_on_next_flush = True) without initialization in __init__. The getattr(self, ..., False) default in the consumer handles this safely, but adding an explicit False init in __init__ would make the attribute's existence discoverable. Not a blocker — the getattr default is sufficient.

Overall: clean fix for a real data-loss bug. LGTM. ✅

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery labels Jun 9, 2026
…lush cursor

Three-layer fix for post-compaction message persistence loss:

1. After compress_context() rotates to a child session, mark the next
   flush to ignore the stale conversation_history offset. The flush
   method consumes this flag once, writing the full compressed child
   transcript from index 0.

2. Clamp the cached SessionDB flush cursor when it exceeds actual DB
   row count (stale gateway AIAgent instances retaining a longer
   pre-compaction index). Compares against canonical DB rows before
   clamping to preserve duplicate-write protection.

3. Gateway-side fallback: after each agent turn, verify the assistant
   response was actually persisted to SessionDB. If the DB tail does
   not contain the current turn, mirror it directly (skip_db=False),
   preventing silent assistant message loss.

Also routes busy-mode text follow-ups through FIFO queue instead of
newline-merging, preserving separate user turns.

Closes NousResearch#43066
@liuhao1024

Copy link
Copy Markdown
Contributor

Verification review — compaction persistence, no issues found.

The _ignore_conversation_history_on_next_flush flag correctly prevents the stale pre-compression history offset from skipping the child session's summary+tail after a compression split. The cursor-clamping logic in _flush_messages_to_session_db compares persisted DB rows against the cached cursor span before clamping, so repeated finalizer calls with the same message list still deduplicate correctly. The _agent_db_has_current_turn tail-comparison in gateway/run.py catches cases where the agent's SessionDB write was incomplete (e.g. codex early-return), falling back to gateway-side SQLite mirroring. Good test coverage for both the stale-history and stale-cursor scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Context compaction loses assistant messages and merges user follow-ups

3 participants