Skip to content

fix(agent): prevent silent data loss when message repair shortens history#24196

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/session-db-repair-flush-skip
Open

fix(agent): prevent silent data loss when message repair shortens history#24196
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/session-db-repair-flush-skip

Conversation

@liuhao1024

@liuhao1024 liuhao1024 commented May 12, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

_flush_messages_to_session_db() silently skips persisting the current turn when _repair_message_sequence() or _drop_trailing_empty_response_scaffolding() shortens messages below the original conversation_history length.

Root Cause

_flush_messages_to_session_db() computes:

start_idx = len(conversation_history)  # e.g. 120
flush_from = max(start_idx, self._last_flushed_db_idx)
for msg in messages[flush_from:]:

When in-place repair (dropping orphan tool messages, merging consecutive user messages) shortens messages from 122 to 116, flush_from = 120 > 116 = len(messages), so messages[120:] returns []. The current user turn and assistant response are silently not persisted.

Gateway integrations that create a fresh AIAgent per inbound message rely on SessionDB for continuity. Once this happens, the session loads stale history and follow-up messages resolve against old context.

Related Issue

N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • See commit messages for detailed changes

How to Test

  1. Run pytest tests/ -q — all tests should pass
  2. Verify the specific scenario described above is resolved

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.4.1

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture and workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A

@M3NT8L-One

M3NT8L-One commented May 12, 2026

Copy link
Copy Markdown

CI failure is from the repository-wide Windows footgun checker, not ruff/ty and not this PR's session-db repair logic:

tools/process_registry.py:588: [bare os.killpg]
tools/process_registry.py:588: [bare signal.SIGKILL]

This is a known upstream-main false positive around an intentionally POSIX-only branch in tools/process_registry.py.

An existing upstream PR already addresses the shared blocker: #23816 (fix: suppress guarded process registry Windows footgun). That PR keeps the process-group kill behind not _IS_WINDOWS, uses getattr(signal, "SIGKILL", signal.SIGTERM), and adds the checker suppression.

So this branch likely should not carry a duplicate patch unless maintainers want each open PR to unblock itself before #23816 lands.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 12, 2026
@liuhao1024 liuhao1024 force-pushed the fix/session-db-repair-flush-skip branch 3 times, most recently from 71ecff2 to 4135350 Compare May 12, 2026 13:54
@liuhao1024

Copy link
Copy Markdown
Contributor Author

🔄 Rebased onto upstream/main

This branch has been rebased onto the latest (dd0923b) to remove unrelated mixed-in changes from the previous stale fork base.

The diff now contains only the intended fix. Please re-review.

…tory

When _repair_message_sequence() or _drop_trailing_empty_response_scaffolding()
mutates messages in-place, the list may become shorter than the original
conversation_history.  _flush_messages_to_session_db() computed
flush_from = max(len(conversation_history), _last_flushed_db_idx) which could
exceed len(messages), causing messages[flush_from:] to return an empty list
and silently skip persisting the current turn.

Fix: detect when start_idx > len(messages), cap _last_flushed_db_idx to the
current list length, and fall back to _last_flushed_db_idx as the start
boundary.  This ensures the current turn is always persisted while preserving
the dedup logic for subsequent calls.

Fixes SessionDB silently skips current turn when message repair shortens conversation history NousResearch#24187
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants