Skip to content

fix: update bootstrap checkpoint after maintain() to prevent replay floods#278

Closed
liu51115 wants to merge 1 commit into
Martian-Engineering:mainfrom
liu51115:fix/bootstrap-checkpoint-stale
Closed

fix: update bootstrap checkpoint after maintain() to prevent replay floods#278
liu51115 wants to merge 1 commit into
Martian-Engineering:mainfrom
liu51115:fix/bootstrap-checkpoint-stale

Conversation

@liu51115

@liu51115 liu51115 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Root cause of #268, #271, #276

This is the root fix for the recurring bootstrap replay flood that causes doom loops (#268), exponential message accumulation (#271), and session lockouts (#276).

The bug

maintain() calls rewriteTranscriptEntries() after every successful turn. This rewrites the JSONL via branch-and-reappend — changing file size, mtime, and entry IDs. But conversation_bootstrap_state was only updated during bootstrap(), never after maintenance rewrites.

On the next gateway restart:

  1. Fast path 1 fails (size/mtime mismatch from stale checkpoint)
  2. Fast path 2 fails (entry at stored offset is now a different entry, or a non-message type like openclaw.cache-ttlreadLastJsonlEntryBeforeOffset returns null, hash comparison fails)
  3. Falls through to reconcileSessionTail() — content-based anchor matching using occurrence counting on (role, content) identity

On conversations with repeated identical messages (empty assistant turns, duplicate tool outputs, NO_REPLY patterns), the occurrence-count anchor lands thousands of entries too early, re-importing everything after the false anchor as duplicates.

Evidence from production

Our conv 1 (17K messages) experienced 19 flood events across 453 gateway restarts since LCM activation on March 29. Every flood maps 1:1 to a restart where the JSONL had been rewritten by maintain() since the last checkpoint update. The final flood (April 5) imported 5,733 duplicates in one second, inflating context to 1.7M tokens and triggering a 50-minute compaction doom loop that cascaded into a fleet-wide outage.

The doom loop (#268) is a consequence, not the cause. compactFullSweep runs correctly — the problem is that it's asked to compact 1.7M tokens that shouldn't exist. Fix the replay flood and the doom loop doesn't trigger.

Fix

Two changes:

  1. Checkpoint update after maintain() (primary fix): When rewriteTranscriptEntries() returns changed: true, stat the session file and call upsertConversationBootstrapState() with fresh size/mtime/offset/hash. Next restart hits the fast path. Wrapped in try/catch so a checkpoint failure doesn't break the rewrite.

  2. Import cap in reconcileSessionTail() (defense-in-depth): If the reconcile would import more than max(existingDbCount * 0.2, 50) messages, abort with a warning log. First bootstrap (existingDbCount === 0) is exempt. This caps damage if reconcile fires for any reason we haven't anticipated.

Testing

All 533 tests pass (546 minus 13 pinnedFiles tests on separate branch).

Related issues

maintain() calls rewriteTranscriptEntries() which rewrites the JSONL via
branch-and-reappend, changing file size, mtime, and entry IDs. But the
conversation_bootstrap_state was only updated during bootstrap(), leaving
a stale checkpoint after every maintenance rewrite.

On the next gateway restart, bootstrap saw the stale checkpoint:
- Fast path 1 (exact size+mtime match) failed (file shrank)
- Fast path 2 (append-only, size > stored) failed (file shrank)
- Fell through to reconcileSessionTail() full content-based reconcile
- On conversations with many identical (role, content) pairs, the
  occurrence-counting anchor matched at the wrong position by coincidence
- Result: thousands of duplicate messages imported in one second

Now maintain() updates the checkpoint after a successful rewrite, so the
next bootstrap hits the fast path instead of falling through to reconcile.

Also adds a defense-in-depth import cap to reconcileSessionTail(): if the
reconcile would import more than 20% of the existing DB message count
(minimum 50), it aborts and logs an error instead of blindly importing.

Root cause of 19+ bootstrap flood events across March 29 - April 5, 2026.

Fixes Martian-Engineering#271

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@liu51115 liu51115 changed the title fix: update bootstrap checkpoint after maintain() JSONL rewrite fix: update bootstrap checkpoint after maintain() to prevent replay floods Apr 5, 2026
@liu51115

liu51115 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor Author

Closing — checkpoint update alone is insufficient. Non-message entries (cache-ttl) at the stored offset still cause hash=null → full reconcile → flood. Need to also fix readLastJsonlEntryBeforeOffset to skip non-message entries. Will reopen with complete fix.

@liu51115 liu51115 closed this Apr 5, 2026
@liu51115 liu51115 reopened this Apr 5, 2026
@liu51115

liu51115 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by new PR from clean fix/bootstrap-flood-prevention branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant