Skip to content

fix: add file-level bootstrap cache guard as defensive fallback#392

Merged
jalehman merged 2 commits into
Martian-Engineering:mainfrom
GodsBoy:fix/bootstrap-checkpoint-refresh-and-cache-guard
Apr 14, 2026
Merged

fix: add file-level bootstrap cache guard as defensive fallback#392
jalehman merged 2 commits into
Martian-Engineering:mainfrom
GodsBoy:fix/bootstrap-checkpoint-refresh-and-cache-guard

Conversation

@GodsBoy

@GodsBoy GodsBoy commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Context

The primary fix for #386 (missing refreshBootstrapState() call after afterTurn ingestion) landed in #387. This PR has been rebased on top of that merge and now only contains the second, independent layer of protection.

Problem

Even with #387 merged, the append-only fast path can still fall through to a full readLeafPathMessages() call under conditions that are invisible to the checkpoint hash:

  • Token estimator drift across versions invalidating the stored hash
  • Hash format changes or corruption
  • Any edge case where the checkpoint hash stops matching a byte-identical file

When this happens on a large (10+ MB, 2,500+ message) session, every bootstrap re-reads the entire transcript — ~22-50 seconds per call — even though the file on disk has not changed and zero new messages will be imported. Evidence from production logs:

[lcm] bootstrap: full transcript read conversation=213 ... duration=48322ms
[lcm] bootstrap: done ... importedMessages=0 reason=already bootstrapped duration=49382ms

Fix

Add an in-memory Map<conversationId, { size, mtimeMs }> that tracks file metadata from the last successful full read. Before falling through to readLeafPathMessages(), check: if the conversation is already bootstrapped and the file metadata matches the cached state, skip the full read and just refresh the bootstrap state.

This is a defence-in-depth layer on top of #387. #387 keeps the happy path fast; this guard prevents pathological cases from silently regressing to 20-50s full reads.

Cache invalidation

  • Set only after successful full reads (never after import-capped aborts)
  • Cleared on session file rotation via purgeConversationForBootstrapRotation
  • In-memory only — cleared on engine restart (worst case: one extra full read on startup)

Why this is still worth merging after #387

Scenario #387 alone #387 + this PR
Normal real-turn bootstrap ✅ Fast path (checkpoint hit) ✅ Fast path (checkpoint hit)
Checkpoint hash mismatch on unchanged file ❌ Full re-read (~30s) ✅ Cache guard skips full read
Version upgrade that drifts token estimator ❌ Full re-read per session ✅ Single full read, then cached
Engine restart ⚠️ One full read per session (expected) ⚠️ One full read per session (expected)

The checkpoint hash is a correctness signal ("did the DB frontier change"). The file (size, mtime) is an identity signal ("did the source of truth change"). Both signals agreeing is a much stronger skip condition than either alone, and it's the only one that survives hash-format changes.

Tests

Two regression tests cover the cache guard:

  • skips full read when file is unchanged and conversation is already bootstrapped — verifies the guard fires when the file metadata matches
  • file-level cache guard allows full read when file changes — verifies the guard correctly permits full reads when the file grows

The refreshes bootstrap checkpoint after afterTurn test from the original PR was removed because #387 already landed equivalent coverage.

All 146 engine tests pass.

Related

Post-Deploy Monitoring & Validation

  • Log query (guard active): bootstrap: skipped full read (file unchanged) — presence confirms the cache guard is firing on unchanged files
  • Log query (fallback): bootstrap: full transcript read — frequency should drop toward zero on persistent sessions; any sustained presence after fix: refresh bootstrap checkpoint after afterTurn message ingestion #387 + this PR indicates a new bootstrap path regression
  • Expected healthy signal: Bootstrap duration <1s on unchanged large sessions, even when the append-only fast path misses
  • Failure signal: If bootstrap duration stays in the 20-50s range on unchanged sessions, investigate whether cache invalidation is firing too aggressively
  • Validation window: 24 hours of normal activity on sessions with 500+ messages and >5MB JSONL files

@GodsBoy GodsBoy force-pushed the fix/bootstrap-checkpoint-refresh-and-cache-guard branch from 2e689ba to 5050b7d Compare April 12, 2026 07:15
@GodsBoy GodsBoy changed the title fix: refresh bootstrap checkpoint after afterTurn and add file-level cache guard fix: add file-level bootstrap cache guard as defensive fallback Apr 12, 2026
@GodsBoy

GodsBoy commented Apr 12, 2026

Copy link
Copy Markdown
Contributor Author

@jalehman — rebased on top of main now that #387 has merged. The duplicate checkpoint-refresh commit is gone; this PR has been narrowed to just the file-level cache guard.

TL;DR on why it's still worth merging as a separate, smaller PR:

  • fix: refresh bootstrap checkpoint after afterTurn message ingestion #387 fixes the common case: afterTurn now keeps the checkpoint aligned with the DB frontier, so the append-only fast path works on every real turn.
  • This PR fixes the pathological case: when the checkpoint hash stops matching a byte-identical file (token-estimator drift across versions, hash-format changes, corruption), bootstrap silently falls through to a 20-50s full readLeafPathMessages(). The (size, mtime) guard short-circuits that, so the worst case becomes "one full read after an upgrade" instead of "a full read every turn".

The two signals are doing different jobs:

  • Checkpoint hash = "did the DB frontier change" (correctness)
  • File (size, mtime) = "did the source of truth change" (identity)

Either alone is enough for the happy path, but only the combination survives a hash-format change without regressing performance. It's ~30 lines + 2 focused tests and no new surface area.

If you'd rather land this as part of a broader hardening sweep (there are related issues #390 and #391 from the same investigation), happy to fold them together instead — just let me know which way you'd prefer.

@GodsBoy

GodsBoy commented Apr 12, 2026

Copy link
Copy Markdown
Contributor Author

Ran a full review pass against the rebased branch and caught two bugs I introduced during the rebase resolution. Fixed and re-pushed:

1. afterTurn refreshBootstrapState was in the wrong position. When I resolved the rebase conflict, I kept the copy from the original PR (earlier, before the heartbeat-ack short-circuit) and deleted the copy from #387 (later, after the conversation lookup miss check). Net effect: I was silently relocating #387's code into a new position. Reverted — the diff against main now shows zero changes to afterTurn.

2. purgeConversationForBootstrapRotation became dead code. The original PR had a caller at the rotation-detection branch in bootstrap(), but the rebase auto-merge dropped it and I didn't notice. Rather than restore a 50-line cascade-delete method with a single use site, I inlined the one line that matters for this PR — this.lastFullReadFileState.delete(conversationId) — directly at the rotation branch. The invariant "cache is cleared on rotation" is now load-bearing in code rather than coincidentally true via a downstream overwrite.

Also applied:

  • Tightened the "file changes" test assertion from toBeGreaterThanOrEqual(0) (vacuous) to toBeGreaterThanOrEqual(1), and added a negative assertion that the "skipped full read" log did not fire. Regression that caused the full read to return 0 messages would now actually fail.
  • Added an explanatory comment on the rotation-branch invalidation so the reason the cache-clear belongs there is obvious to the next reader.

Final diff: 42 lines in src/engine.ts, 151 in test/engine.test.ts, down from 250/11 previously. All 146 engine tests pass.

The remaining review findings are acknowledged but not acted on in this PR:

  • The guard still calls persistBootstrapState on cache-hit, which does one extra DB SELECT + UPSERT. Measurable but the 20-50s readLeafPathMessages skip dominates. Happy to remove it in a follow-up if you'd like the fast path to be truly zero-I/O.
  • Unbounded growth of lastFullReadFileState. ~40 bytes per conversation, P3 at worst. Can add an LRU cap if desired.
  • More negative tests (engine restart, existingCount === 0, bootstrappedAt === null) would lock the gate more tightly.

Let me know if you want any of those folded in.

@jalehman jalehman self-assigned this Apr 14, 2026
GodsBoy and others added 2 commits April 14, 2026 09:05
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This
PR adds a second, independent layer of protection: an in-memory
file-level cache guard in bootstrap() that tracks the session file's
(size, mtime) from the last successful full read.

When the conversation is already bootstrapped and the JSONL file is
byte-for-byte unchanged since the last full read, the expensive
readLeafPathMessages() call is skipped entirely and the bootstrap
state is just refreshed. This guard fires even when both the
append-only checkpoint fast path and the hash-verified fast path fail
(hash corruption, cross-version token estimation drift, hash-format
changes, etc.), so a large session cannot silently regress back to
a 20+ second full-read cycle.

Cache invalidation:
- Set only after successful full reads (never after import-capped aborts)
- Cleared on session file rotation via purgeConversationForBootstrapRotation
- In-memory only — cleared on engine restart (worst case: one extra full read)

Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
Document the bootstrap fallback guard in release notes so PR Martian-Engineering#392 is merge-ready after the rebase onto origin/main.

Regeneration-Prompt: |
  Rebase the contributor branch for PR Martian-Engineering#392 onto current origin/main without changing the intended bootstrap fix. The code change should remain the in-memory file metadata guard that skips an expensive full transcript reread when checkpoint-based bootstrap fast paths miss on an unchanged session file. After rebasing, add the missing patch changeset required by this repo for user-visible behavior changes, run the test suite on the rebased branch, and keep unrelated local review artifacts out of the commit and push.
@jalehman jalehman force-pushed the fix/bootstrap-checkpoint-refresh-and-cache-guard branch from f611c68 to 8598c04 Compare April 14, 2026 16:06
@jalehman

Copy link
Copy Markdown
Contributor

Thank you!

@jalehman jalehman merged commit 00d1fa2 into Martian-Engineering:main Apr 14, 2026
1 check passed
@github-actions github-actions Bot mentioned this pull request Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Perf] Add file-level cache guard to skip redundant full bootstrap reads when JSONL unchanged

2 participants