fix: add file-level bootstrap cache guard as defensive fallback#392
Conversation
2e689ba to
5050b7d
Compare
|
@jalehman — rebased on top of TL;DR on why it's still worth merging as a separate, smaller PR:
The two signals are doing different jobs:
Either alone is enough for the happy path, but only the combination survives a hash-format change without regressing performance. It's ~30 lines + 2 focused tests and no new surface area. If you'd rather land this as part of a broader hardening sweep (there are related issues #390 and #391 from the same investigation), happy to fold them together instead — just let me know which way you'd prefer. |
5050b7d to
f611c68
Compare
|
Ran a full review pass against the rebased branch and caught two bugs I introduced during the rebase resolution. Fixed and re-pushed: 1. 2. Also applied:
Final diff: 42 lines in The remaining review findings are acknowledged but not acted on in this PR:
Let me know if you want any of those folded in. |
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This PR adds a second, independent layer of protection: an in-memory file-level cache guard in bootstrap() that tracks the session file's (size, mtime) from the last successful full read. When the conversation is already bootstrapped and the JSONL file is byte-for-byte unchanged since the last full read, the expensive readLeafPathMessages() call is skipped entirely and the bootstrap state is just refreshed. This guard fires even when both the append-only checkpoint fast path and the hash-verified fast path fail (hash corruption, cross-version token estimation drift, hash-format changes, etc.), so a large session cannot silently regress back to a 20+ second full-read cycle. Cache invalidation: - Set only after successful full reads (never after import-capped aborts) - Cleared on session file rotation via purgeConversationForBootstrapRotation - In-memory only — cleared on engine restart (worst case: one extra full read) Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
Document the bootstrap fallback guard in release notes so PR Martian-Engineering#392 is merge-ready after the rebase onto origin/main. Regeneration-Prompt: | Rebase the contributor branch for PR Martian-Engineering#392 onto current origin/main without changing the intended bootstrap fix. The code change should remain the in-memory file metadata guard that skips an expensive full transcript reread when checkpoint-based bootstrap fast paths miss on an unchanged session file. After rebasing, add the missing patch changeset required by this repo for user-visible behavior changes, run the test suite on the rebased branch, and keep unrelated local review artifacts out of the commit and push.
f611c68 to
8598c04
Compare
|
Thank you! |
Context
The primary fix for #386 (missing
refreshBootstrapState()call afterafterTurningestion) landed in #387. This PR has been rebased on top of that merge and now only contains the second, independent layer of protection.Problem
Even with #387 merged, the append-only fast path can still fall through to a full
readLeafPathMessages()call under conditions that are invisible to the checkpoint hash:When this happens on a large (10+ MB, 2,500+ message) session, every bootstrap re-reads the entire transcript — ~22-50 seconds per call — even though the file on disk has not changed and zero new messages will be imported. Evidence from production logs:
Fix
Add an in-memory
Map<conversationId, { size, mtimeMs }>that tracks file metadata from the last successful full read. Before falling through toreadLeafPathMessages(), check: if the conversation is already bootstrapped and the file metadata matches the cached state, skip the full read and just refresh the bootstrap state.This is a defence-in-depth layer on top of #387. #387 keeps the happy path fast; this guard prevents pathological cases from silently regressing to 20-50s full reads.
Cache invalidation
purgeConversationForBootstrapRotationWhy this is still worth merging after #387
The checkpoint hash is a correctness signal ("did the DB frontier change"). The file (size, mtime) is an identity signal ("did the source of truth change"). Both signals agreeing is a much stronger skip condition than either alone, and it's the only one that survives hash-format changes.
Tests
Two regression tests cover the cache guard:
skips full read when file is unchanged and conversation is already bootstrapped— verifies the guard fires when the file metadata matchesfile-level cache guard allows full read when file changes— verifies the guard correctly permits full reads when the file growsThe
refreshes bootstrap checkpoint after afterTurntest from the original PR was removed because #387 already landed equivalent coverage.All 146 engine tests pass.
Related
Post-Deploy Monitoring & Validation
bootstrap: skipped full read (file unchanged)— presence confirms the cache guard is firing on unchanged filesbootstrap: full transcript read— frequency should drop toward zero on persistent sessions; any sustained presence after fix: refresh bootstrap checkpoint after afterTurn message ingestion #387 + this PR indicates a new bootstrap path regression