Skip to content

fix(session): persist assistant after repair compacts message list#44838

Closed
JerryLiu369 wants to merge 2 commits into
NousResearch:mainfrom
JerryLiu369:fix/session-db-flush-after-repair
Closed

fix(session): persist assistant after repair compacts message list#44838
JerryLiu369 wants to merge 2 commits into
NousResearch:mainfrom
JerryLiu369:fix/session-db-flush-after-repair

Conversation

@JerryLiu369

@JerryLiu369 JerryLiu369 commented Jun 12, 2026

Copy link
Copy Markdown

Summary

Fixes #44837 — gateway sessions where the model sees multiple user messages merged with \n\n while platform inbound is single-message.

Rebased onto latest main. Only the delta beyond merged #44327 / #44518:

  • Clamp _last_flushed_db_idx after repair_message_sequence compacts the in-memory list.
  • Guard _flush_messages_to_session_db when the cursor overshoots len(messages).

#44518 fixes stale cursor across cached-agent turns; this PR addresses the same-turn early-persist → repair → turn-end flush gap.

Root cause

  1. turn_context early-persists inbound user → flush cursor advances.
  2. First API call merges consecutive users via repair_message_sequence → list shrinks.
  3. Turn-end flush can skip the first assistant row → orphan user in state.db.
  4. Next turn repair merges again → \n\n blob grows.

Related open PRs (same area, not duplicates)

This PR is intentionally minimal; happy to adjust if maintainers prefer a different merge order or layering.

Test plan

  • tests/run_agent/test_early_persist_repair_flush.py
  • tests/run_agent/test_860_dedup.py — no dedup regression
pytest tests/run_agent/test_early_persist_repair_flush.py \
       tests/run_agent/test_860_dedup.py -q -o addopts=

Copilot AI review requested due to automatic review settings June 12, 2026 10:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds regression coverage and cursor-handling fixes to ensure SessionDB flushing remains correct when message lists are compacted (repair/compression) after an early persist, especially for cached-agent turns.

Changes:

  • Clamp/reset _last_flushed_db_idx to prevent missing assistant/tool flushes after message-sequence compaction.
  • Reset stale flush cursor at the start of a fresh cached-agent turn (interrupt depth 0).
  • Add focused regression tests for both the repair-compaction scenario and cached-agent turn initialization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/run_agent/test_early_persist_repair_flush.py New regression tests covering early persist + repair compaction and stale flush cursor clamping.
tests/gateway/test_agent_cache.py New tests validating flush cursor reset/preservation based on interrupt_depth.
run_agent.py Adjusts SessionDB flush logic when _last_flushed_db_idx is beyond current messages length.
gateway/run.py Resets _last_flushed_db_idx on fresh cached-agent turns to avoid skipping persistence.
agent/agent_runtime_helpers.py Clamps _last_flushed_db_idx when repair_message_sequence compacts messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread run_agent.py Outdated
Comment on lines +1247 to +1251
flush_from = max(start_idx, self._last_flushed_db_idx)
# repair_message_sequence / compression may shrink `messages`
# after early persist advanced the cursor past the new length.
if self._last_flushed_db_idx > len(messages):
flush_from = start_idx

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — agreed. Replaced the `flush_from = start_idx` fallback with clamping `_last_flushed_db_idx` to `len(messages)` before computing `flush_from`, so we avoid re-appending when `start_idx == 0`. Compression tests still pass.

Comment thread agent/agent_runtime_helpers.py Outdated
Comment on lines +442 to +445
if hasattr(agent, "_last_flushed_db_idx"):
agent._last_flushed_db_idx = min(
agent._last_flushed_db_idx, len(messages)
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — clamp now runs only when the cursor is an `int` (74457d1 + follow-up commit).

Early turn-start user persist advances _last_flushed_db_idx before
repair_message_sequence merges consecutive users and compacts the list.
Turn-end flush then skipped the first assistant row, leaving orphan users
that grew via \n\n merge on the next gateway turn.

Clamp the flush cursor after repair and when idx overshoots list length;
also reset _last_flushed_db_idx on cached-agent fresh turns (NousResearch#44327).

Co-authored-by: Cursor <cursoragent@cursor.com>
@JerryLiu369 JerryLiu369 force-pushed the fix/session-db-flush-after-repair branch from 164104e to 74457d1 Compare June 12, 2026 10:38
@alt-glitch alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround labels Jun 12, 2026
@JerryLiu369

Copy link
Copy Markdown
Author

Rebased onto latest main — diff is now 3 files (~14 LOC production + regression test). Removed the duplicate #44327 hunk since that already landed in #44518.

Scope: same-turn gap after early user persist → repair_message_sequence shrinks the list → turn-end flush can skip the first assistant row. The test reproduces that shape directly.

Not claiming to supersede #24211 or #24196 — those go further on transcript rewrite / flush overshoot in related cases. This is meant as a minimal, test-backed patch for the assistant-drop path seen in production logs. Totally fine to combine, defer, or redirect if a broader approach is preferred.

Address Copilot review: resetting flush_from to start_idx when
conversation_history is None could duplicate DB rows. Clamp
_last_flushed_db_idx to len(messages) instead; guard isinstance
in repair_message_sequence clamp.

Co-authored-by: Cursor <cursoragent@cursor.com>
@liuhao1024

Copy link
Copy Markdown
Contributor

Code Review: Clean ✅

Reviewed the full diff across agent/agent_runtime_helpers.py, run_agent.py, and the new test file.

What was checked:

  • Dual clamp location: The clamp in repair_message_sequence (early, during in-place mutation) and the defensive clamp in _flush_messages_to_session_db (at flush time) are complementary, not redundant. The first handles the case where repair happens before a flush; the second handles any other code path that advances _last_flushed_db_idx past the new list length.
  • Guard semantics: isinstance(_last, int) guards against None (attribute unset), and min(_last, len(messages)) prevents index overshoot without resetting the cursor to 0 (which would re-append already-persisted rows).
  • Test coverage: Both the regression scenario (early persist → repair compaction → tool turn) and the direct cursor-clamp case are covered. The test creates a real SessionDB and verifies the actual persisted rows, not just the cursor value.

Architectural note: The _last_flushed_db_idx cursor pattern (advance-on-persist, read-at-flush) is a classic "stale pointer after compaction" bug. The fix is minimal and correctly scoped — no behavioral change to the happy path, only bounds correction when the list shrinks.

@JerryLiu369

Copy link
Copy Markdown
Author

Closing — #45260 (salvage of #44870) has already been merged with a broader fix for the same root cause. Thanks for the review @liuhao1024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session DB turn-end flush drops assistant after repair_message_sequence compacts list (orphan user → \n\n merge)

4 participants