Skip to content

fix(gateway): realign _last_flushed_db_idx on cached-agent reuse to prevent skipped transcript rows#44425

Closed
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/cached-agent-flush-cursor-realign
Closed

fix(gateway): realign _last_flushed_db_idx on cached-agent reuse to prevent skipped transcript rows#44425
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/cached-agent-flush-cursor-realign

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Realigns _last_flushed_db_idx to the current agent_history length when the gateway reuses a cached AIAgent, preventing the DB-flush cursor from a previous turn from skipping the current turn's assistant reply in the session transcript.

Related Issue

Fixes #44327

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py: After _build_gateway_agent_history() returns, set agent._last_flushed_db_idx = len(agent_history) so a stale cursor from the previous turn does not cause _flush_messages_to_session_db() to skip the new turn's assistant row.
  • tests/gateway/test_agent_cache.py: Added TestCachedAgentFlushCursorRealign with 3 tests:
    • test_stale_flush_cursor_realigns_to_agent_history: verifies the cursor is set to len(agent_history)
    • test_flush_after_realign_persists_new_turn_messages: end-to-end test showing the fix prevents message skipping
    • test_stale_cursor_without_realign_skips_messages: demonstrates the bug (without the fix, messages are silently dropped)

How to Test

  1. Run pytest tests/gateway/test_agent_cache.py::TestCachedAgentFlushCursorRealign -xvs — all 3 tests should pass
  2. Run pytest tests/gateway/test_agent_cache.py -x — all 66 tests should pass (no regressions)
  3. Run pytest tests/run_agent/test_compression_persistence.py tests/run_agent/test_860_dedup.py -x — existing persistence tests still pass

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: gateway/run.py::_process_message_background (caller of _build_gateway_agent_history and _flush_messages_to_session_db)
  • Analyzed: run_agent.py::_flush_messages_to_session_db (uses _last_flushed_db_idx to compute flush offset)
  • Blast radius: LOW — single assignment line, no control flow change
  • Related patterns: cli_commands_mixin.py:712-713 and cli.py:5921-5922 also realign _last_flushed_db_idx on session reset; this PR applies the same pattern to the gateway cached-agent path

…revent skipped transcript rows

When the gateway reuses a cached AIAgent, _last_flushed_db_idx from the
previous turn can be larger than the current turn's agent_history length.
This causes _flush_messages_to_session_db() to compute a flush_from offset
that skips the current turn's assistant reply, leaving a transcript with
consecutive user rows and no assistant response.

Fix: realign agent._last_flushed_db_idx = len(agent_history) immediately
after _build_gateway_agent_history() returns, before the agent processes
the new turn.

Fixes NousResearch#44327
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists duplicate This issue or pull request already exists labels Jun 11, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #32760 — same fix (realign _last_flushed_db_idx on cached-agent reuse, #44327). #32760 is the earlier open PR; closed twin #44354.

@liuhao1024

Copy link
Copy Markdown
Contributor Author

Thanks for the flag @alt-glitch. Comparing the two PRs:

This PR includes regression tests that verify the fix prevents the stale index from causing message skips on agent reuse. Keeping open.

@liuhao1024

Copy link
Copy Markdown
Contributor Author

Thanks for flagging @alt-glitch. Comparing the two PRs:

This PR includes regression tests verifying the behavior. Keeping open for the more complete fix.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Thanks for this @liuhao1024 — your diagnosis of the bug was spot on: a cached agent carrying a stale _last_flushed_db_idx from a longer previous turn makes _flush_messages_to_session_db compute flush_from = max(len(conversation_history), _last_flushed_db_idx) too high and silently skip the new turn's assistant row (#44327).

We landed #44518 (@kyssta-exe) for this, in a942bfd. It fixes the same bug but resets the cursor in _init_cached_agent_for_turn — the function whose job is resetting per-turn cached-agent state (right beside _api_call_count = 0) — and gates the reset on interrupt_depth == 0, so the cursor is preserved on interrupt-recursive re-entry (where the max() guard against duplicate in-turn writes still needs to hold). Setting it to 0 lets flush_from fall back to len(conversation_history), which is the exact pre-turn boundary, rather than realigning to len(agent_history) (which has to stay equal to the flush-time conversation_history length to be correct).

Your fix targets the right root cause; the difference is placement + interrupt-depth awareness. Closing as resolved by #44518 — thanks for the careful write-up, the scenario in your test description matched the real failure exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway cached-agent reuse can leak _last_flushed_db_idx across turns and skip assistant transcript rows

3 participants