fix: refresh bootstrap checkpoint after afterTurn message ingestion#387
Conversation
The append-only fast path introduced in v0.7.0 uses a DB message hash to verify the bootstrap checkpoint. refreshBootstrapState() is called after heartbeat pruning and after maintain(), but never after regular message ingestion in afterTurn(). This means every real conversation turn advances the DB frontier past the checkpoint hash, causing the next bootstrap to fall back to a full JSONL transcript read. On large sessions this adds 20+ seconds per turn. The fix adds a refreshBootstrapState() call after successful ingest, before compaction, keeping the checkpoint aligned with the DB frontier. Fixes Martian-Engineering#386
GodsBoy
left a comment
There was a problem hiding this comment.
Review: fix/refresh-bootstrap-after-ingest
Placement: correct
The call is positioned after:
ingestBatch()succeeds (line ~3470 on main) — if ingest fails,afterTurnreturns early, so the checkpoint is never stale-advanced.- The heartbeat-pruning early-return (line ~3512) — heartbeats that prune >0 messages already call
refreshBootstrapStateand return. Ifpruned === 0, or the batch is not a heartbeat, execution falls through. - The
conversationlookup guard (line ~3540) —conversationis guaranteed non-null.
And it runs before compaction evaluation, which is the correct order: checkpoint must reflect the new frontier before we decide whether to compact.
Error handling: solid
The try/catch with warn-level logging matches the exact pattern used in the heartbeat-pruning path (line ~3504) and the maintain() path. A checkpoint refresh failure is non-fatal — it just means the next bootstrap falls back to full reconcile, which is the pre-v0.7.0 behavior. Compaction is not blocked.
Codepath coverage analysis
| afterTurn codepath | refreshBootstrapState called? | Notes |
|---|---|---|
| Empty batch (dedup clears all) | N/A — returns before ingest | No DB mutation, checkpoint still valid |
| Ingest failure | N/A — returns before this code | No DB mutation, checkpoint still valid |
| Heartbeat with pruned > 0 | Yes — existing call | Returns early, new code not reached (correct) |
| Heartbeat with pruned == 0 | Yes — new code | Falls through to non-heartbeat path, now fixed |
| Heartbeat with no conversation | N/A — falls through, lookup misses, returns | No meaningful frontier change |
| Normal conversation turn | Yes — new code | The primary fix |
This looks comprehensive. The one subtle case worth noting: a heartbeat-looking batch where pruned === 0 still ingested messages via ingestBatch() above, so it does advance the frontier and does need the refresh. The fix covers this correctly.
Missing: regression test
There is no test included. The existing test suite does not appear to cover refreshBootstrapState at all (no hits in *.test.* files). A regression test would strengthen confidence, though the fix is mechanically simple (same pattern already used 4 other places in the same file). Consider adding one as a follow-up — something like:
- Ingest a batch via
afterTurn()on a session with a valid bootstrap checkpoint. - Assert that the bootstrap state's frontier hash matches the post-ingest DB state.
- Verify the next
bootstrap()call takes the fast path (no full JSONL read).
Minor nit
The sessionLabel used in the log message (afterTurn: bootstrap checkpoint refresh failed for ${sessionLabel}) is the session=X sessionKey=Y format, whereas the heartbeat-pruning path uses sessionContext (which includes conversationId). Including conversationId in the log line would make debugging easier, since by this point conversation is resolved. Not blocking.
Verdict
The fix is correct, minimal, and follows existing patterns. Approving pending the author's call on whether to add a regression test in this PR or as a follow-up.
Add a regression test for the normal afterTurn-to-bootstrap append-only fast path and include a patch changeset for the user-visible performance fix in PR Martian-Engineering#387. Regeneration-Prompt: | Follow up on lossless-claw PR Martian-Engineering#387 by addressing review findings only. Keep the code change narrow: add one direct regression test that proves a normal real-turn afterTurn refreshes the bootstrap checkpoint so the next bootstrap stays on the append-only fast path without reconcileSessionTail, and add a patch changeset because the fix changes user-visible runtime performance. Run focused engine tests for the new normal-turn case and the existing heartbeat checkpoint case before pushing back to the contributor branch if maintainer edits are allowed.
|
Thank you! |
|
Will this fix the 10+ second pause after every message were the gateway logs freeze and then after 4-6 seconds it floods the logs with 6+ LCM messages every turn? After v0.8.0 I noticed in telegram my agent will show the thinking animation immediately after being messaged but then it stops for 10+ seconds, then comes back until it responds later. Seems to add some latency every turn. |
|
Yes, this should fix that. It's now out in the latest release. |
Wow, what a massive improvement. Thanks! |
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This PR adds a second, independent layer of protection: an in-memory file-level cache guard in bootstrap() that tracks the session file's (size, mtime) from the last successful full read. When the conversation is already bootstrapped and the JSONL file is byte-for-byte unchanged since the last full read, the expensive readLeafPathMessages() call is skipped entirely and the bootstrap state is just refreshed. This guard fires even when both the append-only checkpoint fast path and the hash-verified fast path fail (hash corruption, cross-version token estimation drift, hash-format changes, etc.), so a large session cannot silently regress back to a 20+ second full-read cycle. Cache invalidation: - Set only after successful full reads (never after import-capped aborts) - Cleared on session file rotation via purgeConversationForBootstrapRotation - In-memory only — cleared on engine restart (worst case: one extra full read) Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This PR adds a second, independent layer of protection: an in-memory file-level cache guard in bootstrap() that tracks the session file's (size, mtime) from the last successful full read. When the conversation is already bootstrapped and the JSONL file is byte-for-byte unchanged since the last full read, the expensive readLeafPathMessages() call is skipped entirely and the bootstrap state is just refreshed. This guard fires even when both the append-only checkpoint fast path and the hash-verified fast path fail (hash corruption, cross-version token estimation drift, hash-format changes, etc.), so a large session cannot silently regress back to a 20+ second full-read cycle. Cache invalidation: - Set only after successful full reads (never after import-capped aborts) - Cleared on session file rotation via purgeConversationForBootstrapRotation - In-memory only — cleared on engine restart (worst case: one extra full read) Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This PR adds a second, independent layer of protection: an in-memory file-level cache guard in bootstrap() that tracks the session file's (size, mtime) from the last successful full read. When the conversation is already bootstrapped and the JSONL file is byte-for-byte unchanged since the last full read, the expensive readLeafPathMessages() call is skipped entirely and the bootstrap state is just refreshed. This guard fires even when both the append-only checkpoint fast path and the hash-verified fast path fail (hash corruption, cross-version token estimation drift, hash-format changes, etc.), so a large session cannot silently regress back to a 20+ second full-read cycle. Cache invalidation: - Set only after successful full reads (never after import-capped aborts) - Cleared on session file rotation via purgeConversationForBootstrapRotation - In-memory only — cleared on engine restart (worst case: one extra full read) Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
* fix: add file-level bootstrap cache guard as defensive fallback The primary afterTurn checkpoint refresh for #386 landed in #387. This PR adds a second, independent layer of protection: an in-memory file-level cache guard in bootstrap() that tracks the session file's (size, mtime) from the last successful full read. When the conversation is already bootstrapped and the JSONL file is byte-for-byte unchanged since the last full read, the expensive readLeafPathMessages() call is skipped entirely and the bootstrap state is just refreshed. This guard fires even when both the append-only checkpoint fast path and the hash-verified fast path fail (hash corruption, cross-version token estimation drift, hash-format changes, etc.), so a large session cannot silently regress back to a 20+ second full-read cycle. Cache invalidation: - Set only after successful full reads (never after import-capped aborts) - Cleared on session file rotation via purgeConversationForBootstrapRotation - In-memory only — cleared on engine restart (worst case: one extra full read) Related to #386. The primary fix is #387; this is defence-in-depth. * chore: add changeset for bootstrap cache guard Document the bootstrap fallback guard in release notes so PR #392 is merge-ready after the rebase onto origin/main. Regeneration-Prompt: | Rebase the contributor branch for PR #392 onto current origin/main without changing the intended bootstrap fix. The code change should remain the in-memory file metadata guard that skips an expensive full transcript reread when checkpoint-based bootstrap fast paths miss on an unchanged session file. After rebasing, add the missing patch changeset required by this repo for user-visible behavior changes, run the test suite on the rebased branch, and keep unrelated local review artifacts out of the commit and push. --------- Co-authored-by: Josh Lehman <josh@martian.engineering>
Problem
The append-only bootstrap fast path (introduced in v0.7.0, PR #329) never works after real conversation turns. Every turn triggers a full JSONL transcript read instead.
Root cause:
refreshBootstrapState()is called after heartbeat pruning and aftermaintain(), but never after regular message ingestion inafterTurn(). After ingesting new messages, the DB frontier advances past the stored checkpoint hash. The next bootstrap sees the mismatch and falls back to a full reconcile.On a 4.4 MB session JSONL with constrained hardware (1-core LXC), this adds ~22 seconds per real conversation turn. Heartbeat-only turns are unaffected because pruning ingests zero messages and the checkpoint stays valid.
Fix
Add a
refreshBootstrapState()call after successful message ingestion inafterTurn(), before compaction. The method, params, and plumbing all exist — it is already called on the heartbeat pruning path in the same function.+8 lines, no new methods or interfaces.
Evidence
Logs from a persistent session (v0.8.0, ~2,500 messages):
100% of real conversation turns hit the slow path. 100% of heartbeat-only turns hit the fast path.
Regression scope
Fixes #386