Skip to content

fix: refresh bootstrap checkpoint after afterTurn message ingestion#387

Merged
jalehman merged 2 commits into
Martian-Engineering:mainfrom
oguzbilgic:fix/refresh-bootstrap-after-ingest
Apr 11, 2026
Merged

fix: refresh bootstrap checkpoint after afterTurn message ingestion#387
jalehman merged 2 commits into
Martian-Engineering:mainfrom
oguzbilgic:fix/refresh-bootstrap-after-ingest

Conversation

@oguzbilgic

Copy link
Copy Markdown
Contributor

Problem

The append-only bootstrap fast path (introduced in v0.7.0, PR #329) never works after real conversation turns. Every turn triggers a full JSONL transcript read instead.

Root cause: refreshBootstrapState() is called after heartbeat pruning and after maintain(), but never after regular message ingestion in afterTurn(). After ingesting new messages, the DB frontier advances past the stored checkpoint hash. The next bootstrap sees the mismatch and falls back to a full reconcile.

On a 4.4 MB session JSONL with constrained hardware (1-core LXC), this adds ~22 seconds per real conversation turn. Heartbeat-only turns are unaffected because pruning ingests zero messages and the checkpoint stays valid.

Fix

Add a refreshBootstrapState() call after successful message ingestion in afterTurn(), before compaction. The method, params, and plumbing all exist — it is already called on the heartbeat pruning path in the same function.

+8 lines, no new methods or interfaces.

Evidence

Logs from a persistent session (v0.8.0, ~2,500 messages):

Turn type Bootstrap path Time
Conversation Full reconcile 22,441ms
Conversation Full reconcile 21,883ms
Heartbeat Checkpoint hit 8ms
Conversation Full reconcile 21,997ms

100% of real conversation turns hit the slow path. 100% of heartbeat-only turns hit the fast path.

Regression scope

  • v0.5.3 and earlier: Not affected (used file-stat-only checkpoint, no hash verification)
  • v0.7.0+: Affected on every real conversation turn in every persistent session

Fixes #386

The append-only fast path introduced in v0.7.0 uses a DB message hash
to verify the bootstrap checkpoint. refreshBootstrapState() is called
after heartbeat pruning and after maintain(), but never after regular
message ingestion in afterTurn().

This means every real conversation turn advances the DB frontier past
the checkpoint hash, causing the next bootstrap to fall back to a full
JSONL transcript read. On large sessions this adds 20+ seconds per turn.

The fix adds a refreshBootstrapState() call after successful ingest,
before compaction, keeping the checkpoint aligned with the DB frontier.

Fixes Martian-Engineering#386

@GodsBoy GodsBoy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix/refresh-bootstrap-after-ingest

Placement: correct

The call is positioned after:

  1. ingestBatch() succeeds (line ~3470 on main) — if ingest fails, afterTurn returns early, so the checkpoint is never stale-advanced.
  2. The heartbeat-pruning early-return (line ~3512) — heartbeats that prune >0 messages already call refreshBootstrapState and return. If pruned === 0, or the batch is not a heartbeat, execution falls through.
  3. The conversation lookup guard (line ~3540) — conversation is guaranteed non-null.

And it runs before compaction evaluation, which is the correct order: checkpoint must reflect the new frontier before we decide whether to compact.

Error handling: solid

The try/catch with warn-level logging matches the exact pattern used in the heartbeat-pruning path (line ~3504) and the maintain() path. A checkpoint refresh failure is non-fatal — it just means the next bootstrap falls back to full reconcile, which is the pre-v0.7.0 behavior. Compaction is not blocked.

Codepath coverage analysis

afterTurn codepath refreshBootstrapState called? Notes
Empty batch (dedup clears all) N/A — returns before ingest No DB mutation, checkpoint still valid
Ingest failure N/A — returns before this code No DB mutation, checkpoint still valid
Heartbeat with pruned > 0 Yes — existing call Returns early, new code not reached (correct)
Heartbeat with pruned == 0 Yes — new code Falls through to non-heartbeat path, now fixed
Heartbeat with no conversation N/A — falls through, lookup misses, returns No meaningful frontier change
Normal conversation turn Yes — new code The primary fix

This looks comprehensive. The one subtle case worth noting: a heartbeat-looking batch where pruned === 0 still ingested messages via ingestBatch() above, so it does advance the frontier and does need the refresh. The fix covers this correctly.

Missing: regression test

There is no test included. The existing test suite does not appear to cover refreshBootstrapState at all (no hits in *.test.* files). A regression test would strengthen confidence, though the fix is mechanically simple (same pattern already used 4 other places in the same file). Consider adding one as a follow-up — something like:

  1. Ingest a batch via afterTurn() on a session with a valid bootstrap checkpoint.
  2. Assert that the bootstrap state's frontier hash matches the post-ingest DB state.
  3. Verify the next bootstrap() call takes the fast path (no full JSONL read).

Minor nit

The sessionLabel used in the log message (afterTurn: bootstrap checkpoint refresh failed for ${sessionLabel}) is the session=X sessionKey=Y format, whereas the heartbeat-pruning path uses sessionContext (which includes conversationId). Including conversationId in the log line would make debugging easier, since by this point conversation is resolved. Not blocking.

Verdict

The fix is correct, minimal, and follows existing patterns. Approving pending the author's call on whether to add a regression test in this PR or as a follow-up.

Add a regression test for the normal afterTurn-to-bootstrap append-only fast path and include a patch changeset for the user-visible performance fix in PR Martian-Engineering#387.

Regeneration-Prompt: |
  Follow up on lossless-claw PR Martian-Engineering#387 by addressing review findings only. Keep the code change narrow: add one direct regression test that proves a normal real-turn afterTurn refreshes the bootstrap checkpoint so the next bootstrap stays on the append-only fast path without reconcileSessionTail, and add a patch changeset because the fix changes user-visible runtime performance. Run focused engine tests for the new normal-turn case and the existing heartbeat checkpoint case before pushing back to the contributor branch if maintainer edits are allowed.
@jalehman

Copy link
Copy Markdown
Contributor

Thank you!

@jalehman jalehman merged commit 5113044 into Martian-Engineering:main Apr 11, 2026
1 check passed
@github-actions github-actions Bot mentioned this pull request Apr 11, 2026
@DBostik

DBostik commented Apr 11, 2026

Copy link
Copy Markdown

Will this fix the 10+ second pause after every message were the gateway logs freeze and then after 4-6 seconds it floods the logs with 6+ LCM messages every turn?

After v0.8.0 I noticed in telegram my agent will show the thinking animation immediately after being messaged but then it stops for 10+ seconds, then comes back until it responds later.

Seems to add some latency every turn.

@oguzbilgic oguzbilgic deleted the fix/refresh-bootstrap-after-ingest branch April 11, 2026 17:09
@jalehman

Copy link
Copy Markdown
Contributor

Yes, this should fix that. It's now out in the latest release.

@DBostik

DBostik commented Apr 11, 2026

Copy link
Copy Markdown

Yes, this should fix that. It's now out in the latest release.

Wow, what a massive improvement. Thanks!

GodsBoy added a commit to GodsBoy/lossless-claw that referenced this pull request Apr 12, 2026
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This
PR adds a second, independent layer of protection: an in-memory
file-level cache guard in bootstrap() that tracks the session file's
(size, mtime) from the last successful full read.

When the conversation is already bootstrapped and the JSONL file is
byte-for-byte unchanged since the last full read, the expensive
readLeafPathMessages() call is skipped entirely and the bootstrap
state is just refreshed. This guard fires even when both the
append-only checkpoint fast path and the hash-verified fast path fail
(hash corruption, cross-version token estimation drift, hash-format
changes, etc.), so a large session cannot silently regress back to
a 20+ second full-read cycle.

Cache invalidation:
- Set only after successful full reads (never after import-capped aborts)
- Cleared on session file rotation via purgeConversationForBootstrapRotation
- In-memory only — cleared on engine restart (worst case: one extra full read)

Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
GodsBoy added a commit to GodsBoy/lossless-claw that referenced this pull request Apr 12, 2026
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This
PR adds a second, independent layer of protection: an in-memory
file-level cache guard in bootstrap() that tracks the session file's
(size, mtime) from the last successful full read.

When the conversation is already bootstrapped and the JSONL file is
byte-for-byte unchanged since the last full read, the expensive
readLeafPathMessages() call is skipped entirely and the bootstrap
state is just refreshed. This guard fires even when both the
append-only checkpoint fast path and the hash-verified fast path fail
(hash corruption, cross-version token estimation drift, hash-format
changes, etc.), so a large session cannot silently regress back to
a 20+ second full-read cycle.

Cache invalidation:
- Set only after successful full reads (never after import-capped aborts)
- Cleared on session file rotation via purgeConversationForBootstrapRotation
- In-memory only — cleared on engine restart (worst case: one extra full read)

Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
jalehman pushed a commit to GodsBoy/lossless-claw that referenced this pull request Apr 14, 2026
The primary afterTurn checkpoint refresh for Martian-Engineering#386 landed in Martian-Engineering#387. This
PR adds a second, independent layer of protection: an in-memory
file-level cache guard in bootstrap() that tracks the session file's
(size, mtime) from the last successful full read.

When the conversation is already bootstrapped and the JSONL file is
byte-for-byte unchanged since the last full read, the expensive
readLeafPathMessages() call is skipped entirely and the bootstrap
state is just refreshed. This guard fires even when both the
append-only checkpoint fast path and the hash-verified fast path fail
(hash corruption, cross-version token estimation drift, hash-format
changes, etc.), so a large session cannot silently regress back to
a 20+ second full-read cycle.

Cache invalidation:
- Set only after successful full reads (never after import-capped aborts)
- Cleared on session file rotation via purgeConversationForBootstrapRotation
- In-memory only — cleared on engine restart (worst case: one extra full read)

Related to Martian-Engineering#386. The primary fix is Martian-Engineering#387; this is defence-in-depth.
jalehman added a commit that referenced this pull request Apr 14, 2026
* fix: add file-level bootstrap cache guard as defensive fallback

The primary afterTurn checkpoint refresh for #386 landed in #387. This
PR adds a second, independent layer of protection: an in-memory
file-level cache guard in bootstrap() that tracks the session file's
(size, mtime) from the last successful full read.

When the conversation is already bootstrapped and the JSONL file is
byte-for-byte unchanged since the last full read, the expensive
readLeafPathMessages() call is skipped entirely and the bootstrap
state is just refreshed. This guard fires even when both the
append-only checkpoint fast path and the hash-verified fast path fail
(hash corruption, cross-version token estimation drift, hash-format
changes, etc.), so a large session cannot silently regress back to
a 20+ second full-read cycle.

Cache invalidation:
- Set only after successful full reads (never after import-capped aborts)
- Cleared on session file rotation via purgeConversationForBootstrapRotation
- In-memory only — cleared on engine restart (worst case: one extra full read)

Related to #386. The primary fix is #387; this is defence-in-depth.

* chore: add changeset for bootstrap cache guard

Document the bootstrap fallback guard in release notes so PR #392 is merge-ready after the rebase onto origin/main.

Regeneration-Prompt: |
  Rebase the contributor branch for PR #392 onto current origin/main without changing the intended bootstrap fix. The code change should remain the in-memory file metadata guard that skips an expensive full transcript reread when checkpoint-based bootstrap fast paths miss on an unchanged session file. After rebasing, add the missing patch changeset required by this repo for user-visible behavior changes, run the test suite on the rebased branch, and keep unrelated local review artifacts out of the commit and push.

---------

Co-authored-by: Josh Lehman <josh@martian.engineering>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Append-only bootstrap path always fails: DB message hash never matches JSONL checkpoint

4 participants