fix: update bootstrap checkpoint after maintain() to prevent replay floods by liu51115 · Pull Request #278 · Martian-Engineering/lossless-claw

liu51115 · 2026-04-05T16:39:41Z

Root cause of #268, #271, #276

This is the root fix for the recurring bootstrap replay flood that causes doom loops (#268), exponential message accumulation (#271), and session lockouts (#276).

The bug

maintain() calls rewriteTranscriptEntries() after every successful turn. This rewrites the JSONL via branch-and-reappend — changing file size, mtime, and entry IDs. But conversation_bootstrap_state was only updated during bootstrap(), never after maintenance rewrites.

On the next gateway restart:

Fast path 1 fails (size/mtime mismatch from stale checkpoint)
Fast path 2 fails (entry at stored offset is now a different entry, or a non-message type like openclaw.cache-ttl — readLastJsonlEntryBeforeOffset returns null, hash comparison fails)
Falls through to reconcileSessionTail() — content-based anchor matching using occurrence counting on (role, content) identity

On conversations with repeated identical messages (empty assistant turns, duplicate tool outputs, NO_REPLY patterns), the occurrence-count anchor lands thousands of entries too early, re-importing everything after the false anchor as duplicates.

Evidence from production

Our conv 1 (17K messages) experienced 19 flood events across 453 gateway restarts since LCM activation on March 29. Every flood maps 1:1 to a restart where the JSONL had been rewritten by maintain() since the last checkpoint update. The final flood (April 5) imported 5,733 duplicates in one second, inflating context to 1.7M tokens and triggering a 50-minute compaction doom loop that cascaded into a fleet-wide outage.

The doom loop (#268) is a consequence, not the cause. compactFullSweep runs correctly — the problem is that it's asked to compact 1.7M tokens that shouldn't exist. Fix the replay flood and the doom loop doesn't trigger.

Fix

Two changes:

Checkpoint update after maintain() (primary fix): When rewriteTranscriptEntries() returns changed: true, stat the session file and call upsertConversationBootstrapState() with fresh size/mtime/offset/hash. Next restart hits the fast path. Wrapped in try/catch so a checkpoint failure doesn't break the rewrite.
Import cap in reconcileSessionTail() (defense-in-depth): If the reconcile would import more than max(existingDbCount * 0.2, 50) messages, abort with a warning log. First bootstrap (existingDbCount === 0) is exempt. This caps damage if reconcile fires for any reason we haven't anticipated.

Testing

All 533 tests pass (546 minus 13 pinnedFiles tests on separate branch).

Related issues

compactFullSweep unbounded while(true) loop causes 50-minute session lockout (doom loop) #268 — compactFullSweep doom loop (consequence of flood inflating context)
reconcileSessionTail: exponential message accumulation when session contains many identical messages (empty assistant, NO_REPLY, generic tool outputs) #271 — reconcileSessionTail exponential accumulation with identical messages (this is the mechanism)
fix: stale bootstrap checkpoint after maintain() JSONL rewrite causes duplicate flood #276 — bootstrap checkpoint stale after maintain() (this PR)
fix: stop conv replay pollution in lossless-claw #263 — afterTurn dedup fix (fixed a different code path; bootstrap path was unfixed)

maintain() calls rewriteTranscriptEntries() which rewrites the JSONL via branch-and-reappend, changing file size, mtime, and entry IDs. But the conversation_bootstrap_state was only updated during bootstrap(), leaving a stale checkpoint after every maintenance rewrite. On the next gateway restart, bootstrap saw the stale checkpoint: - Fast path 1 (exact size+mtime match) failed (file shrank) - Fast path 2 (append-only, size > stored) failed (file shrank) - Fell through to reconcileSessionTail() full content-based reconcile - On conversations with many identical (role, content) pairs, the occurrence-counting anchor matched at the wrong position by coincidence - Result: thousands of duplicate messages imported in one second Now maintain() updates the checkpoint after a successful rewrite, so the next bootstrap hits the fast path instead of falling through to reconcile. Also adds a defense-in-depth import cap to reconcileSessionTail(): if the reconcile would import more than 20% of the existing DB message count (minimum 50), it aborts and logs an error instead of blindly importing. Root cause of 19+ bootstrap flood events across March 29 - April 5, 2026. Fixes Martian-Engineering#271 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

liu51115 · 2026-04-05T16:55:51Z

Closing — checkpoint update alone is insufficient. Non-message entries (cache-ttl) at the stored offset still cause hash=null → full reconcile → flood. Need to also fix readLastJsonlEntryBeforeOffset to skip non-message entries. Will reopen with complete fix.

liu51115 · 2026-04-05T17:36:03Z

Superseded by new PR from clean fix/bootstrap-flood-prevention branch.

liu51115 changed the title ~~fix: update bootstrap checkpoint after maintain() JSONL rewrite~~ fix: update bootstrap checkpoint after maintain() to prevent replay floods Apr 5, 2026

liu51115 mentioned this pull request Apr 5, 2026

fix: cap compactFullSweep Phase 1 leaf passes to prevent doom loop #279

Closed

liu51115 closed this Apr 5, 2026

liu51115 reopened this Apr 5, 2026

liu51115 closed this Apr 5, 2026

liu51115 mentioned this pull request Apr 5, 2026

fix: prevent bootstrap replay flood after maintain() JSONL rewrite #280

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update bootstrap checkpoint after maintain() to prevent replay floods#278

fix: update bootstrap checkpoint after maintain() to prevent replay floods#278
liu51115 wants to merge 1 commit into
Martian-Engineering:mainfrom
liu51115:fix/bootstrap-checkpoint-stale

liu51115 commented Apr 5, 2026 •

edited

Loading

Uh oh!

liu51115 commented Apr 5, 2026

Uh oh!

liu51115 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liu51115 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause of #268, #271, #276

The bug

Evidence from production

Fix

Testing

Related issues

Uh oh!

liu51115 commented Apr 5, 2026

Uh oh!

liu51115 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liu51115 commented Apr 5, 2026 •

edited

Loading