Fix/repair on api copy 24187 by ccabrerah · Pull Request #43905 · NousResearch/hermes-agent

ccabrerah · 2026-06-11T02:07:06Z

What does this PR do?

Fixes a silent data-loss bug where assistant replies were never persisted to SessionDB, sending gateway deployments into an hour-long replay loop.

repair_message_sequence() ran on the canonical messages list before each API call and rewrote it in place (messages[:] = merged). When repair shrinks the list — merging consecutive user turns, dropping stray tool results — the positional persistence cursor in _flush_messages_to_session_db() (anchored at len(conversation_history)) points past the surviving entries, so the current turn's assistant/tool rows are silently never written.

In gateway deployments (fresh AIAgent per inbound message, history reloaded from SessionDB each turn) this is self-reinforcing: each lost reply leaves a fresh user/user violation, so the next repair shrinks the list further. Observed in production (Matrix gateway): one interrupted turn seeded a single violation, after which every normally-paced turn lost its reply — 5 consecutive user rows in state.db while the gateway logs show all 5 replies were generated and delivered, the Repaired N message-alternation violations counter climbing 1→10, and the agent re-answering hour-old topics until a manual session reset.

Why this approach: conversation_loop.py already treats every other provider-shape fix as an API-call-time-only projection on the per-call api_messages copy (sanitize_api_messages(), drop_thinking_only_and_merge_users(), whitespace/tool-call normalization). repair_message_sequence was the lone pass mutating the canonical list. Moving the repair call onto api_messages conforms it to that existing convention: the canonical transcript keeps the raw event sequence (consecutive user turns are legal there; only the wire copy needs strict alternation), the flush cursor can no longer overshoot, and sessions already corrupted on disk self-heal after one turn.

This dissolves the case the other open approaches try to patch around persistence: the cursor clamp (#24326) still yields an empty flush slice; falling back to _last_flushed_db_idx (#24196) re-appends the whole history in the fresh-agent flow; the replace_messages() rewrite (#24211/#24419) is consistent but canonicalizes the lossy merge into stored history on every pass. Keeping the merge off the canonical list removes the need for any of them.

Related Issue

Fixes #24187

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

agent/conversation_loop.py — move the _repair_message_sequence() call from the canonical messages (pre-copy) to the per-call api_messages copy, right after drop_thinking_only_and_merge_users().
agent/agent_runtime_helpers.py — docstring/comments on repair_message_sequence(): it mutates the list it is handed; callers must pass the API copy, never the canonical transcript (with the [Bug]: SessionDB silently skips current turn when message repair shortens conversation history #24187 rationale).
tests/run_agent/test_24187_replay_spiral.py — new. End-to-end test driving the real run_conversation() across a multi-turn gateway loop against a real SessionDB.
tests/run_agent/test_message_sequence_repair.py — 3 function-level regression tests pinning both halves of the contract (canonical untouched; current turn persists with violated history; the pre-fix zero-rows failure as executable documentation).

How to Test

Run the reproduction + contract tests:
```
scripts/run_tests.sh tests/run_agent/test_24187_replay_spiral.py tests/run_agent/test_message_sequence_repair.py
```
→ 15 passed. The end-to-end test seeds one user/user violation in a real SessionDB, then runs five normally-paced turns through the real run_conversation() (mocked provider); every reply persists and the repair counter stays flat.
Confirm the test actually catches the bug: in conversation_loop.py, temporarily point the repair call back at the canonical list (agent._repair_message_sequence(messages) before the api_messages copy is built) and rerun step 1.
→ the end-to-end test fails with zero assistant rows persisted across all five turns and the repair counter climbing 1→5 — the production fingerprint, reproduced through the real loop. Revert the change.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(agent):, test(agent):, docs(agent):)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix (4 files; verified against the merge-base)
I've run pytest tests/ -q and all tests pass : multiple unrelated tests fail, settled on running "scripts/run_tests.sh tests/run_agent/ "
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS Tahoe 26.5.1

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — docstrings on repair_message_sequence() updated; no README/docs/ impact
I've updated cli-config.yaml.example if I added/changed config keys — N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) — N/A (pure in-memory list/persistence logic; no platform-specific paths, processes, or filesystem modes)
I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

We are not providing real logs with this PR, but added a test that contains steps to trigger the bug.

…ical transcript repair_message_sequence() ran on the canonical messages list before each API call and rewrote it in place. When repair shrank the list (merging consecutive user turns, dropping stray tool results), the positional persistence cursor in _flush_messages_to_session_db() — anchored at len(conversation_history) — pointed past the surviving entries, so the first K rows of the newly-appended region were silently never written to SessionDB (K = entries removed by repair). In gateway deployments (fresh AIAgent per inbound message, history reloaded from SessionDB each turn) this is self-reinforcing: every lost assistant reply leaves a new user/user violation in the stored history, which makes the next repair shrink the list further. One seed violation (e.g. a turn interrupted before its reply was appended) escalated into an hour-long replay loop where the agent re-answered old topics on every new message, until a manual session reset. Observed in production: 5 consecutive user rows in state.db while gateway logs show all 5 assistant replies were generated and delivered. Fix: move the repair call onto the per-call api_messages copy, joining sanitize_api_messages() and drop_thinking_only_and_merge_users() which already follow the API-call-time-only pattern. The canonical transcript keeps the raw event sequence — consecutive user turns are legal there; only the wire copy needs strict provider alternation. The flush cursor can no longer overshoot, so assistant/tool rows always persist, and sessions already corrupted on disk self-heal after one turn. Fixes NousResearch#24187 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…eal run_conversation() The repair-on-canonical fix landed with function-level regression tests (test_message_sequence_repair.py), but nothing exercised the bug through the actual run_conversation() call site — flagged as a caveat at the time. This adds that end-to-end test. test_run_conversation_gateway_loop_persists_every_reply seeds one user/user violation in a real SessionDB, then runs five normally-paced gateway turns through the real run_conversation() (fresh AIAgent per inbound message, history reloaded each turn, provider mocked following the test_token_persistence_non_cli.py pattern). It asserts every assistant reply persists and the repair counter stays flat. It is bug-sensitive: run against the pre-fix wiring (parent commit a72bb03) it fails with zero assistant rows written across all five turns — each reply generated and returned but never persisted — and the repair counter climbing 1->5, reproducing the production fingerprint through the real loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…bearing details The repair-on-canonical mechanism was explained in full in four places (the repair_message_sequence docstring, the conversation_loop call site, the test_message_sequence_repair header, and the test_24187_replay_spiral module docstring). Keep the docstring as the single authoritative source; reduce the other three to their local point plus a pointer. No behavior or test change — comments only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Carlos Cabrera H. and others added 3 commits June 10, 2026 21:14

ccabrerah mentioned this pull request Jun 11, 2026

[Bug]: SessionDB silently skips current turn when message repair shortens conversation history #24187

Open

ccabrerah force-pushed the fix/repair-on-api-copy-24187 branch from 030a6b3 to 7dd8419 Compare June 11, 2026 02:20

alt-glitch added type/bug Something isn't working comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/repair on api copy 24187#43905

Fix/repair on api copy 24187#43905
ccabrerah wants to merge 3 commits into
NousResearch:mainfrom
ccabrerah:fix/repair-on-api-copy-24187

ccabrerah commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ccabrerah commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ccabrerah commented Jun 11, 2026 •

edited

Loading