You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
maintain() calls rewriteTranscriptEntries() which rewrites the JSONL session file via branch-and-reappend, changing the file size, mtime, and entry IDs. However, conversation_bootstrap_state is only updated during bootstrap(), not after maintenance rewrites.
On the next gateway restart, bootstrap sees the stale checkpoint:
Fast path 1 (exact size+mtime) fails — file shrank by bytesFreed
Fast path 2 (append-only) fails — requires size > storedSize, but file is smaller
Falls through to reconcileSessionTail() full content-based reconcile
On conversations with many identical (role, content) pairs (empty assistant messages, repeated tool outputs — common in agent workflows), the occurrence-counting anchor matches at the wrong position → thousands of duplicate messages imported
Reproduction
Active LCM conversation with repeated patterns (empty assistant messages, identical tool outputs)
maintain() runs transcript GC (called after bootstrap, turns, compaction)
rewriteTranscriptEntries() shrinks the JSONL — checkpoint now stale
Gateway restart → both fast paths fail → full reconcile → anchor mismatch → flood
19+ flood events on a single conversation (17K messages) between March 29 – April 5, 2026. Worst case: 5,733 duplicate messages imported in one second, context_items exploded 41×, caused fleet-wide outage.
Fix
Two parts:
1. Primary: Update checkpoint after maintain() rewrite
After rewriteTranscriptEntries() returns changed: true, call upsertConversationBootstrapState() with new file size/mtime/hash. Ensures next bootstrap hits the fast path.
2. Defense-in-depth: Import cap in reconcileSessionTail()
If reconcile would import more than 20% of existing DB count (min 50), abort and log an error instead of importing. Prevents catastrophic floods even if the checkpoint fix somehow fails.
Summary
maintain()callsrewriteTranscriptEntries()which rewrites the JSONL session file via branch-and-reappend, changing the file size, mtime, and entry IDs. However,conversation_bootstrap_stateis only updated duringbootstrap(), not after maintenance rewrites.On the next gateway restart, bootstrap sees the stale checkpoint:
bytesFreedsize > storedSize, but file is smallerreconcileSessionTail()full content-based reconcile(role, content)pairs (empty assistant messages, repeated tool outputs — common in agent workflows), the occurrence-counting anchor matches at the wrong position → thousands of duplicate messages importedReproduction
maintain()runs transcript GC (called after bootstrap, turns, compaction)rewriteTranscriptEntries()shrinks the JSONL — checkpoint now staleImpact
19+ flood events on a single conversation (17K messages) between March 29 – April 5, 2026. Worst case: 5,733 duplicate messages imported in one second, context_items exploded 41×, caused fleet-wide outage.
Fix
Two parts:
1. Primary: Update checkpoint after maintain() rewrite
After
rewriteTranscriptEntries()returnschanged: true, callupsertConversationBootstrapState()with new file size/mtime/hash. Ensures next bootstrap hits the fast path.2. Defense-in-depth: Import cap in reconcileSessionTail()
If reconcile would import more than 20% of existing DB count (min 50), abort and log an error instead of importing. Prevents catastrophic floods even if the checkpoint fix somehow fails.
Related Issues
Environment