Skip to content

fix: stale bootstrap checkpoint after maintain() JSONL rewrite causes duplicate flood #276

@liu51115

Description

@liu51115

Summary

maintain() calls rewriteTranscriptEntries() which rewrites the JSONL session file via branch-and-reappend, changing the file size, mtime, and entry IDs. However, conversation_bootstrap_state is only updated during bootstrap(), not after maintenance rewrites.

On the next gateway restart, bootstrap sees the stale checkpoint:

  • Fast path 1 (exact size+mtime) fails — file shrank by bytesFreed
  • Fast path 2 (append-only) fails — requires size > storedSize, but file is smaller
  • Falls through to reconcileSessionTail() full content-based reconcile
  • On conversations with many identical (role, content) pairs (empty assistant messages, repeated tool outputs — common in agent workflows), the occurrence-counting anchor matches at the wrong position → thousands of duplicate messages imported

Reproduction

  1. Active LCM conversation with repeated patterns (empty assistant messages, identical tool outputs)
  2. maintain() runs transcript GC (called after bootstrap, turns, compaction)
  3. rewriteTranscriptEntries() shrinks the JSONL — checkpoint now stale
  4. Gateway restart → both fast paths fail → full reconcile → anchor mismatch → flood
  5. Flood inflates context → doom loop (compactFullSweep unbounded while(true) loop causes 50-minute session lockout (doom loop) #268) → cascading failure

Impact

19+ flood events on a single conversation (17K messages) between March 29 – April 5, 2026. Worst case: 5,733 duplicate messages imported in one second, context_items exploded 41×, caused fleet-wide outage.

Fix

Two parts:

1. Primary: Update checkpoint after maintain() rewrite

After rewriteTranscriptEntries() returns changed: true, call upsertConversationBootstrapState() with new file size/mtime/hash. Ensures next bootstrap hits the fast path.

2. Defense-in-depth: Import cap in reconcileSessionTail()

If reconcile would import more than 20% of existing DB count (min 50), abort and log an error instead of importing. Prevents catastrophic floods even if the checkpoint fix somehow fails.

Related Issues

Environment

  • OpenClaw 3.28, lossless-claw 0.6.1
  • macOS 26.3.1, M2, 24GB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions