fix(tui): stop compacting tool outputs on session save/load to preserve LLM cache#2388
Conversation
There was a problem hiding this comment.
Code Review
This pull request removes the session tool output compaction logic (compact_session_tool_outputs) during session saving, loading, and checkpointing. By preserving the raw, large tool outputs in the saved sessions instead of replacing them with tool output receipts, the application ensures better cache fidelity when resuming LLM sessions. The corresponding unit tests have been updated to verify that raw outputs are preserved. There are no review comments to address, so I have no additional feedback to provide.
|
Thanks @LeoAlex0, this is a clean and useful cache-fidelity fix. I verified the branch merged cleanly on current main and ran: I also read Greptile’s checkpoint-size concern. That is the real tradeoff here, but the PR body calls it out and the behavior matches the goal: persistence should preserve the exact message stream for reliable resume/KV-cache behavior, while live context compaction remains separate. Merging this now; appreciate the focused fix and the bilingual writeup. |
|
@Hmbown FYI: There's still a Remaining Risk: MAX_PERSISTED_MESSAGES performs tail truncation which will break ALL LLM cache, I've edit this PR message. |
Summary / 概述
Remove
compact_session_tool_outputsfrom session save/load paths to preserve message fidelity for LLM KV-cache hits on resume.从 session 的 save/load 路径中移除
compact_session_tool_outputs,保持消息完整性以确保 resume 时 LLM KV-cache 命中。Problem / 问题
Session compaction replaces large tool outputs (>12K chars) with compact receipts during save and load. This changes message content between sessions, breaking the LLM KV-cache prefix on resume and causing near-zero cache hit rates for continued conversations.
Session 持久化时,
compact_session_tool_outputs会将超过 12K 字符的大型 tool output 替换为简短的 receipt。这导致 save/load 后消息内容发生变化,LLM 的 KV-cache 前缀不再匹配,resume 时缓存命中率几乎为零。Fix / 修复
Remove compaction from
save_session,save_checkpoint,/savecommandRemove compaction from
load_session,load_checkpoint,/loadcommandRemove the now-unused
compact_session_tool_outputswrapperLive compaction (
compact_messages_for_persistencein tui/ui.rs) is unaffected — context budgets remain protected during active conversationRemove unnecessary
muton loaded session variables移除
save_session、save_checkpoint、/save命令中的 compaction移除
load_session、load_checkpoint、/load命令中的 compaction移除已无调用的
compact_session_tool_outputs包装函数交互时的实时 compaction(
tui/ui.rs中的compact_messages_for_persistence)不受影响,对话过程中的上下文预算保护仍然有效移除 loaded session 变量上不必要的
mutTradeoffs / 权衡
Pros / 优点:
Resume 后的消息内容与原始完全一致 → 每次 continue 都能命中 LLM 前缀缓存;持久化层不再静默修改消息内容;代码更简洁。
Cons / 代价:
recover_context_overflow) on the first turn — this is reactive but bounded by the existingcap_messages(500) and cycle-archive safety nets包含大量 tool output 的会话 JSON 文件可能变大;切换到小上下文模型 resume 时首次发送可能触发紧急压缩(
recover_context_overflow),但受现有cap_messages(500) 和 cycle-archive 安全网保护,不会无限循环。MAX_PERSISTED_MESSAGESThis PR fixes one source of cache invalidation, but a bigger one remains.
本次 PR 修复了缓存失效的一个来源,但一个更大的问题依然存在。
The problem / 问题分析
MAX_PERSISTED_MESSAGES = 500(session_manager.rs:25) performs tail truncation when a session exceeds 500 messages — the oldest messages are silently dropped, while the most recent 500 are kept.Since LLM KV-cache is prefix-based (keyed by the full message sequence from the start), dropping oldest messages changes the entire cache prefix. The first message after truncation becomes a new "first message", so the cache key is completely different. Every resume of a session with >500 messages results in a 100% cache miss — even worse than the compaction issue this PR fixes.
LLM 的 KV-cache 是按前缀(prefix)构建的,缓存键基于从第一条消息开始的完整序列。
cap_messages保留最新 500 条(尾部),丢弃最旧消息(头部)。截断后的第一条消息变成了新的"第一条",缓存前缀完全改变 → 100% 缓存失效,比本 PR 修复的 compaction 问题更严重。Why 500 is arbitrary / 为什么 500 不合理
500 messages can be well under the budget for large-context models (1M tokens) → wastes cache unnecessarily
500 messages can already overflow small-context models (32K tokens) → does not actually protect them
The limit is driven by file-size concerns, not model capability
对大上下文模型(1M tokens),500 条消息远未触及预算 → 白白浪费缓存
对小上下文模型(32K tokens),500 条消息可能已经超了 → 实际上也保护不了
这个限制来自文件大小的工程考量,而非模型能力
Recommended direction / 建议方向
Replace
MAX_PERSISTED_MESSAGESwith a token-budget-aware cap that uses the saved model's context window as the bound, or move truncation to load time so the same model on resume gets the full, cacheable message list. The persistence layer should store faithfully; context limits should be enforced at send time by the engine.建议将
MAX_PERSISTED_MESSAGES替换为基于 token 预算 的智能裁剪——使用模型上下文窗口作为上限,或在 load 时按当前模型预算动态裁剪。持久化层应忠实存储,上下文限制应由 engine 在发送时按模型能力执行。Tests / 测试
Updated two existing tests to verify raw output preservation instead of compaction
cargo test -p codewhale-tui -- session_manager: 39 passed, 0 failed更新两个已有测试,验证原始输出保留而非压缩
cargo test -p codewhale-tui -- session_manager: 39 passed, 0 failed