feat(agent): deterministic tool-result pruning + cache-TTL-aware cold-resume maintenance#3968
Merged
Merged
Conversation
added 3 commits
June 10, 2026 22:43
Stale tool results are re-derivable (files re-read, commands re-run), so eliding them is a free, lossless alternative to the paid summarize fold. Prune runs only where a cache reset is already being paid: at the compact trigger, where it skips the fold entirely when eliding alone clears the threshold, and on resume after the provider prefix cache has expired (cacheColdAfter), where rewriting history costs no extra misses and directly shrinks the full-price first request. No message is ever removed, so tool_call/result pairing and signed reasoning stay intact by construction; originals are archived like fold drops.
…r comprehension Three real-API scenarios for the prune work: seed two identical-shape fat sessions, A/B the cold-restart miss tokens with and without pruning after the provider cache has expired, and verify the model re-reads a file behind a prune placeholder instead of answering from nothing.
…default Pruning a still-cached session costs ~4x the miss tokens of leaving it alone (measured warm-cache A/B), while a threshold that is too large only forgoes a free prune. Default to 24h until the cache-ttl probe pins the real retention; never set below it.
added 2 commits
June 10, 2026 23:18
…o/path-injection) The os.Stat mtime fallback fed a user-influenced path straight into a filesystem call. Branch meta is guaranteed for every session the controller has snapshotted, so the fallback only ever covered never-saved imports — those now skip one prune until their first snapshot creates the meta.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Multiple users report runaway token consumption on 1.x (#3098 closed: ~$20/hour; one commenter measured 18M input tokens on a single task at 99% cache hit; #3615 / #3007 / #3329 circle the same pain). Root cause analysis:
context_windowand the 0.8 trigger, a session effectively never folds — the prompt grows unbounded and every tool round resends all of it.So this does not fold earlier (compression-eval already showed early folding costs more). It shrinks the prompt deterministically, and schedules rewrites into windows where the cache is already cold so they cost nothing.
What
Prune primitive (
internal/agent/prune.go): replace theContentof tool results older than the protected recent tail with a one-line placeholder. Deterministic, zero LLM calls, originals archived first. No message is ever removed, sotool_calls/tool_call_idpairing cannot break and assistant messages (including signed reasoning) are never touched.Two trigger points, both at moments where a cache reset is already being paid:
maybeCompact): prune first; skip the paid summarize call entirely when eliding alone clears the trigger.Controller.Resume): when idle time (branch-metaUpdatedAt) exceedscacheColdAfter, the cache is gone anyway — pruning is free in cache terms and directly shrinks the full-price first request. Pruned transcript is snapshotted atomically; a crash before the snapshot just replays the prune on next resume (idempotent).Between maintenance points the history stays strictly append-only; warm sessions are never rewritten.
Real-API validation (deepseek-v4-flash)
read_fileand answered with the exact value in every trial — no hallucination (benchmarks/context-maintenance-e2e comprehension).cacheColdAfterdefaults to a risk-asymmetric 24h (too small burns a live cache; too large only forgoes a free prune).benchmarks/cache-ttl-probe): unique ~18k-token prefixes re-probed after idle intervals; full hits through 30 min so far, longer intervals still collecting. Follow-up will tightencacheColdAfterfrom the measured retention and add the cold-resume A/B numbers (two 88k sessions are already seeded and going cold).Tests
compact_loop_e2e_test.go): a 20-turn tool-heavy session now stays bounded by pruning alone — zero paid folds, stuck-guard never trips; a new assistant-text-heavy variant keeps the fold path under regression.go test ./...green exceptTestModelSwitchRefreshesCustomStatusline, which fails identically on a pristineorigin/main-v2checkout in this environment (upstream, unrelated).