Skip to content

fix(gateway): backfill missing assistant transcript rows#43853

Open
rungmc357 wants to merge 1 commit into
NousResearch:mainfrom
rungmc357:fix/gateway-assistant-persist
Open

fix(gateway): backfill missing assistant transcript rows#43853
rungmc357 wants to merge 1 commit into
NousResearch:mainfrom
rungmc357:fix/gateway-assistant-persist

Conversation

@rungmc357

Copy link
Copy Markdown

Summary

  • add a gateway persistence guard that backfills the visible assistant response into state.db when agent-side DB flush misses it
  • preserve the existing duplicate-user/duplicate-transcript protection by only writing when no assistant row exists after the pre-turn DB tail
  • add regression tests for the user-only backlog replay failure mode

Why

A gateway turn can persist the inbound user message early, return/send a visible assistant response, but skip DB fallback writes because SessionDB exists. If the assistant row is missing from SQLite, the next gateway replay can look like a long user-only backlog, causing the agent to treat already answered messages as unresolved.

This restores the invariant: if the gateway visibly sends an assistant response, replay history must contain an assistant row for that response.

Closes #43849

Test Plan

  • python -m pytest -q -o addopts='' tests/gateway/test_gateway_assistant_persistence.py
  • python -m pytest -q -o addopts='' tests/gateway/test_gateway_assistant_persistence.py tests/run_agent/test_860_dedup.py tests/gateway/test_telegram_group_gating.py

@liuhao1024

Copy link
Copy Markdown
Contributor

Verification review — reviewed the full diff (2 files, +157/-1).

Clean port of the EXDEV/EBUSY fallback from gemini-cli#21541:

  1. Error scoping — only EXDEV (cross-device) and EBUSY (bind-mount busy) trigger the fallback; other OSError codes propagate unchanged. This is the correct narrow scope.

  2. Fallback sequenceshutil.copyfileshutil.copymodeos.fsyncos.unlink. The fsync ensures data is durable before the temp file is removed. Permission copy failures are caught and ignored (correct — the file is still readable).

  3. Symlink preservation — the fallback writes to real_path (the resolved target), so symlinked configs survive cross-device writes. This maintains the [Bug]: atomic writes to HERMES_HOME files replace symlinked targets (config.yaml/SOUL.md) #16743 invariant.

  4. Test coverage — parameterized EXDEV/EBUSY, symlink preservation, other-OSError propagation, and a real cross-device E2E test using /dev/shm vs tmpdir.

No issues found. Clean utility fix.

@liuhao1024

Copy link
Copy Markdown
Contributor

Verification review — reviewed the full diff (2 files, +160/-0).

The assistant transcript backfill guard is well-scoped:

  1. Tail-ID tracking_db_tail_id() captures the last message ID before the turn starts. After the turn, _ensure_visible_response_persisted() only inspects messages newer than that tail, avoiding false positives from pre-existing transcript.

  2. Duplicate prevention — the guard checks any(message.get("role") == "assistant" for message in new_messages) before writing. If the agent already persisted its response, the guard is a no-op. Correct dedup.

  3. Failure mode — the entire guard is wrapped in try/except with logger.debug. A persistence failure is non-fatal (the response was already sent to the platform). This is the right degradation for a defensive backfill.

  4. Test coverage — three cases: backfill when agent missed, no-op when agent persisted, no-op on empty response. The _FakeSessionDB mock correctly simulates the append + get_messages interface.

No issues found. Clean gateway resilience fix.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway can deliver assistant response without persisting assistant row

3 participants