Skip to content

fix: prevent bootstrap replay flood after maintain() JSONL rewrite#280

Merged
jalehman merged 2 commits into
Martian-Engineering:mainfrom
liu51115:fix/bootstrap-flood-prevention
Apr 5, 2026
Merged

fix: prevent bootstrap replay flood after maintain() JSONL rewrite#280
jalehman merged 2 commits into
Martian-Engineering:mainfrom
liu51115:fix/bootstrap-flood-prevention

Conversation

@liu51115

@liu51115 liu51115 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Summary

maintain() calls rewriteTranscriptEntries() which rewrites the JSONL via branch-and-reappend, changing file size, mtime, and entry IDs. But the conversation_bootstrap_state was only updated during bootstrap(), leaving a stale checkpoint after every maintenance rewrite.

On the next gateway restart, bootstrap saw the stale checkpoint, fell through the fast paths, and hit reconcileSessionTail() where occurrence-counting anchored at the wrong position — importing thousands of duplicate messages in one second.

Three fixes

  1. Checkpoint update after maintain() — After a successful JSONL rewrite, update the bootstrap checkpoint so the next bootstrap() hits the fast path instead of falling through to reconcile.

  2. messageOnly skip in readLastJsonlEntryBeforeOffset() — The function now accepts a messageOnly flag so it skips non-message JSONL entries (cache-ttl, tool-result, session-meta) when computing the tail entry hash for checkpoint comparison. Previously, a trailing non-message entry would cause the hash check to fail even when the checkpoint was otherwise correct.

  3. Import cap in reconcileSessionTail() — If reconcile would import more than 20% of the existing DB message count (minimum 50), abort and log an error instead of blindly importing. Defense-in-depth against any future stale-checkpoint scenario.

Root cause of 19+ bootstrap flood events across March 29 – April 5, 2026.

Supersedes #278.
Fixes #271
Relates to #268, #276

Test plan

  • 11 new tests for readLastJsonlEntryBeforeOffset with messageOnly flag (test/bootstrap-message-only.test.ts)
  • All 556 existing tests pass (npm test)
  • Manual verification: restart gateway after maintain() runs, confirm bootstrap hits fast path (no reconcile log lines)

🤖 Generated with Claude Code

maintain() calls rewriteTranscriptEntries() which rewrites the JSONL via
branch-and-reappend, changing file size, mtime, and entry IDs. But the
conversation_bootstrap_state was only updated during bootstrap(), leaving
a stale checkpoint after every maintenance rewrite.

On the next gateway restart, bootstrap saw the stale checkpoint:
- Fast path 1 (exact size+mtime match) failed (file shrank)
- Fast path 2 (append-only, size > stored) failed (file shrank)
- Fell through to reconcileSessionTail() full content-based reconcile
- On conversations with many identical (role, content) pairs, the
  occurrence-counting anchor matched at the wrong position by coincidence
- Result: thousands of duplicate messages imported in one second

Three fixes:

1. Checkpoint update after maintain(): After a successful JSONL
   rewrite, update the bootstrap checkpoint so the next bootstrap()
   hits the fast path instead of falling through to reconcile.

2. messageOnly skip in readLastJsonlEntryBeforeOffset(): The function
   now accepts a messageOnly flag so it skips non-message JSONL entries
   (cache-ttl, tool-result, session-meta) when computing the tail
   entry hash for checkpoint comparison.

3. Import cap in reconcileSessionTail(): If reconcile would import
   more than 20% of the existing DB message count (minimum 50), abort
   and log an error instead of blindly importing.

Root cause of 19+ bootstrap flood events across March 29 - April 5, 2026.

Fixes Martian-Engineering#271
Relates to Martian-Engineering#268
Relates to Martian-Engineering#276

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@liu51115 liu51115 force-pushed the fix/bootstrap-flood-prevention branch from 5bf3b6b to 4e8c262 Compare April 5, 2026 17:35
Keep maintain() checkpoint writes aligned with bootstrap state semantics by normalizing the stored mtime, and prevent capped reconcile aborts from advancing bootstrap state as though the transcript had been fully processed. Add regression coverage for the unchanged maintain->bootstrap fast path and the capped reconcile retry path, plus the required patch changeset.

Regeneration-Prompt: |
  Address the two review findings on PR 280 without broadening scope. Make bootstrap checkpoint updates after transcript maintenance use the same timestamp semantics as the existing bootstrap path, and ensure the replay-safety import cap in reconcileSessionTail does not mark the transcript as fully processed when it aborts. Add targeted regression tests proving maintain() leaves the next unchanged bootstrap on the fast path and that a capped reconcile preserves the stale checkpoint so a later retry is still possible. Include the required patch changeset and rerun the repository's relevant test and packaging gates before pushing back to the contributor fork.
@jalehman jalehman merged commit 9a2c3e1 into Martian-Engineering:main Apr 5, 2026
1 check passed
@github-actions github-actions Bot mentioned this pull request Apr 5, 2026
liu51115 pushed a commit to liu51115/lossless-claw that referenced this pull request Apr 7, 2026
Round-trip integration test: create conv → maintain() rewrites JSONL → bootstrap() → assert 0 re-imports. Also tests import cap on stale checkpoint.
Covers both PR Martian-Engineering#280 fixes (checkpoint update + import cap).
liu51115 pushed a commit to liu51115/lossless-claw that referenced this pull request Apr 7, 2026


Regression test covering both fixes from PR Martian-Engineering#280:
1. maintain() updates checkpoint after rewriteTranscriptEntries() — prevents stale checkpoint on restart
2. Import cap blocks mass re-imports when checkpoint is stale (>max(existingDbCount*0.2, 50))

Tests:
- Round-trip: create conv → maintain() → bootstrap() → assert 0 re-imports
- Import cap: corrupt checkpoint → append flood messages → assert cap blocks
- Defense-in-depth: both fixes working together
jalehman pushed a commit that referenced this pull request Apr 7, 2026
Regression test covering both fixes from PR #280:
1. maintain() updates checkpoint after rewriteTranscriptEntries() — prevents stale checkpoint on restart
2. Import cap blocks mass re-imports when checkpoint is stale (>max(existingDbCount*0.2, 50))

Tests:
- Round-trip: create conv → maintain() → bootstrap() → assert 0 re-imports
- Import cap: corrupt checkpoint → append flood messages → assert cap blocks
- Defense-in-depth: both fixes working together

Co-authored-by: Claw Liu <liu51115claw@brun.taild04815.ts.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reconcileSessionTail: exponential message accumulation when session contains many identical messages (empty assistant, NO_REPLY, generic tool outputs)

2 participants