fix: update bootstrap checkpoint after maintain() to prevent replay floods#278
Closed
liu51115 wants to merge 1 commit into
Closed
fix: update bootstrap checkpoint after maintain() to prevent replay floods#278liu51115 wants to merge 1 commit into
liu51115 wants to merge 1 commit into
Conversation
maintain() calls rewriteTranscriptEntries() which rewrites the JSONL via branch-and-reappend, changing file size, mtime, and entry IDs. But the conversation_bootstrap_state was only updated during bootstrap(), leaving a stale checkpoint after every maintenance rewrite. On the next gateway restart, bootstrap saw the stale checkpoint: - Fast path 1 (exact size+mtime match) failed (file shrank) - Fast path 2 (append-only, size > stored) failed (file shrank) - Fell through to reconcileSessionTail() full content-based reconcile - On conversations with many identical (role, content) pairs, the occurrence-counting anchor matched at the wrong position by coincidence - Result: thousands of duplicate messages imported in one second Now maintain() updates the checkpoint after a successful rewrite, so the next bootstrap hits the fast path instead of falling through to reconcile. Also adds a defense-in-depth import cap to reconcileSessionTail(): if the reconcile would import more than 20% of the existing DB message count (minimum 50), it aborts and logs an error instead of blindly importing. Root cause of 19+ bootstrap flood events across March 29 - April 5, 2026. Fixes Martian-Engineering#271 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
Closing — checkpoint update alone is insufficient. Non-message entries (cache-ttl) at the stored offset still cause hash=null → full reconcile → flood. Need to also fix readLastJsonlEntryBeforeOffset to skip non-message entries. Will reopen with complete fix. |
Contributor
Author
|
Superseded by new PR from clean fix/bootstrap-flood-prevention branch. |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause of #268, #271, #276
This is the root fix for the recurring bootstrap replay flood that causes doom loops (#268), exponential message accumulation (#271), and session lockouts (#276).
The bug
maintain()callsrewriteTranscriptEntries()after every successful turn. This rewrites the JSONL via branch-and-reappend — changing file size, mtime, and entry IDs. Butconversation_bootstrap_statewas only updated duringbootstrap(), never after maintenance rewrites.On the next gateway restart:
openclaw.cache-ttl—readLastJsonlEntryBeforeOffsetreturnsnull, hash comparison fails)reconcileSessionTail()— content-based anchor matching using occurrence counting on(role, content)identityOn conversations with repeated identical messages (empty assistant turns, duplicate tool outputs,
NO_REPLYpatterns), the occurrence-count anchor lands thousands of entries too early, re-importing everything after the false anchor as duplicates.Evidence from production
Our conv 1 (17K messages) experienced 19 flood events across 453 gateway restarts since LCM activation on March 29. Every flood maps 1:1 to a restart where the JSONL had been rewritten by
maintain()since the last checkpoint update. The final flood (April 5) imported 5,733 duplicates in one second, inflating context to 1.7M tokens and triggering a 50-minute compaction doom loop that cascaded into a fleet-wide outage.The doom loop (#268) is a consequence, not the cause.
compactFullSweepruns correctly — the problem is that it's asked to compact 1.7M tokens that shouldn't exist. Fix the replay flood and the doom loop doesn't trigger.Fix
Two changes:
Checkpoint update after maintain() (primary fix): When
rewriteTranscriptEntries()returnschanged: true, stat the session file and callupsertConversationBootstrapState()with fresh size/mtime/offset/hash. Next restart hits the fast path. Wrapped in try/catch so a checkpoint failure doesn't break the rewrite.Import cap in reconcileSessionTail() (defense-in-depth): If the reconcile would import more than
max(existingDbCount * 0.2, 50)messages, abort with a warning log. First bootstrap (existingDbCount === 0) is exempt. This caps damage if reconcile fires for any reason we haven't anticipated.Testing
All 533 tests pass (546 minus 13 pinnedFiles tests on separate branch).
Related issues