Skip to content

[Bug]: SessionDB silently skips current turn when message repair shortens conversation history #24187

@samyzhh

Description

@samyzhh

Summary

AIAgent._repair_message_sequence(messages) can shorten the in-memory messages list before persistence, but _flush_messages_to_session_db(messages, conversation_history) still uses len(conversation_history) as the skip offset.

If repair removes or merges enough messages from the historical portion, flush_from can become greater than len(messages). Python slicing then returns an empty list, so the current user turn and assistant response are silently not persisted to SessionDB.

Impact

Gateway-style integrations that create a fresh AIAgent per inbound message rely on SessionDB for continuity. Once this happens, the same session keeps loading stale history, causing follow-up messages like "yes", "check again", or "continue" to resolve against old context.

Observed symptom:

  • user asks about weather
  • assistant asks whether to check again
  • user replies "check"
  • model answers an unrelated old topic because the recent weather turn was never persisted

Root Cause

The relevant flow is:

  1. Gateway loads persisted history from SessionDB.

  2. run_conversation() builds:

    messages = list(conversation_history)
    messages.append(current_user_message)
  3. Before the API call, Hermes runs:

    repaired_seq = self._repair_message_sequence(messages)
  4. That repair can mutate messages in place by:

    • dropping stray/orphan tool messages
    • merging consecutive user messages
  5. On exit, persistence still computes:

    start_idx = len(conversation_history)
    flush_from = max(start_idx, self._last_flushed_db_idx)
    for msg in messages[flush_from:]:
        self._session_db.append_message(...)

If conversation_history had 120 entries, but repair shortens messages to 116, then:

messages[120:] == []

No exception is raised, and the current turn is skipped.

Minimal Reproduction Shape

A simplified reproduction:

history_len=120
messages before repair = 122
repair removes/merges 6 historical entries
messages after repair = 116
flush_from = len(conversation_history) = 120
messages[120:] = []
flushed_rows = 0

Observed local reproduction output:

history_len=120 before_repair=122 repairs=6 after_repair=116 flushed_rows=0

Expected Behavior

The current user message and assistant response should always be persisted, even if historical messages are repaired before the model call.

At minimum, _flush_messages_to_session_db() should not silently skip persistence when:

flush_from > len(messages)

Suggested Fix

Do not use the original len(conversation_history) as the only persistence boundary after in-place repair.

Possible approaches:

  1. Track the current-turn boundary explicitly after repair.
  2. Adjust the persistence offset when _repair_message_sequence() mutates messages.
  3. Persist the current turn separately from historical replay.
  4. Add a warning/error if flush_from > len(messages).

Suggested Regression Test

Add a test where:

  1. conversation_history contains malformed historical entries.
  2. messages = conversation_history + [current_user, assistant_reply].
  3. _repair_message_sequence(messages) shortens messages.
  4. _flush_messages_to_session_db(messages, conversation_history) is called.
  5. Assert the current user and assistant reply are still written to SessionDB.

This should prevent silent context loss in gateway integrations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions