Skip to content

fix(tui): stop compacting tool outputs on session save/load to preserve LLM cache#2388

Merged
Hmbown merged 1 commit into
Hmbown:mainfrom
LeoAlex0:main
May 31, 2026
Merged

fix(tui): stop compacting tool outputs on session save/load to preserve LLM cache#2388
Hmbown merged 1 commit into
Hmbown:mainfrom
LeoAlex0:main

Conversation

@LeoAlex0

@LeoAlex0 LeoAlex0 commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary / 概述

Remove compact_session_tool_outputs from session save/load paths to preserve message fidelity for LLM KV-cache hits on resume.

从 session 的 save/load 路径中移除 compact_session_tool_outputs,保持消息完整性以确保 resume 时 LLM KV-cache 命中。

Problem / 问题

Session compaction replaces large tool outputs (>12K chars) with compact receipts during save and load. This changes message content between sessions, breaking the LLM KV-cache prefix on resume and causing near-zero cache hit rates for continued conversations.

Session 持久化时,compact_session_tool_outputs 会将超过 12K 字符的大型 tool output 替换为简短的 receipt。这导致 save/load 后消息内容发生变化,LLM 的 KV-cache 前缀不再匹配,resume 时缓存命中率几乎为零。

Fix / 修复

  • Remove compaction from save_session, save_checkpoint, /save command

  • Remove compaction from load_session, load_checkpoint, /load command

  • Remove the now-unused compact_session_tool_outputs wrapper

  • Live compaction (compact_messages_for_persistence in tui/ui.rs) is unaffected — context budgets remain protected during active conversation

  • Remove unnecessary mut on loaded session variables

  • 移除 save_sessionsave_checkpoint/save 命令中的 compaction

  • 移除 load_sessionload_checkpoint/load 命令中的 compaction

  • 移除已无调用的 compact_session_tool_outputs 包装函数

  • 交互时的实时 compaction(tui/ui.rs 中的 compact_messages_for_persistence)不受影响,对话过程中的上下文预算保护仍然有效

  • 移除 loaded session 变量上不必要的 mut

Tradeoffs / 权衡

Pros / 优点:

  • Resumed sessions preserve exact message content → LLM prefix-cache hits on every continue
  • No silent content mutation across save/load round-trips
  • Simpler code: persistence layer stores what the user sees, context management stays in the engine

Resume 后的消息内容与原始完全一致 → 每次 continue 都能命中 LLM 前缀缓存;持久化层不再静默修改消息内容;代码更简洁。

Cons / 代价:

  • Session JSON files may grow larger for conversations with many large tool outputs
  • Resuming a session on a smaller-context model may trigger emergency compaction (recover_context_overflow) on the first turn — this is reactive but bounded by the existing cap_messages (500) and cycle-archive safety nets

包含大量 tool output 的会话 JSON 文件可能变大;切换到小上下文模型 resume 时首次发送可能触发紧急压缩(recover_context_overflow),但受现有 cap_messages(500) 和 cycle-archive 安全网保护,不会无限循环。


⚠️ Remaining Risk / 仍存在的风险: MAX_PERSISTED_MESSAGES

This PR fixes one source of cache invalidation, but a bigger one remains.

本次 PR 修复了缓存失效的一个来源,但一个更大的问题依然存在。

The problem / 问题分析

MAX_PERSISTED_MESSAGES = 500 (session_manager.rs:25) performs tail truncation when a session exceeds 500 messages — the oldest messages are silently dropped, while the most recent 500 are kept.

Since LLM KV-cache is prefix-based (keyed by the full message sequence from the start), dropping oldest messages changes the entire cache prefix. The first message after truncation becomes a new "first message", so the cache key is completely different. Every resume of a session with >500 messages results in a 100% cache miss — even worse than the compaction issue this PR fixes.

LLM 的 KV-cache 是按前缀(prefix)构建的,缓存键基于从第一条消息开始的完整序列。cap_messages 保留最新 500 条(尾部),丢弃最旧消息(头部)。截断后的第一条消息变成了新的"第一条",缓存前缀完全改变 → 100% 缓存失效,比本 PR 修复的 compaction 问题更严重。

Why 500 is arbitrary / 为什么 500 不合理

  • 500 messages can be well under the budget for large-context models (1M tokens) → wastes cache unnecessarily

  • 500 messages can already overflow small-context models (32K tokens) → does not actually protect them

  • The limit is driven by file-size concerns, not model capability

  • 对大上下文模型(1M tokens),500 条消息远未触及预算 → 白白浪费缓存

  • 对小上下文模型(32K tokens),500 条消息可能已经超了 → 实际上也保护不了

  • 这个限制来自文件大小的工程考量,而非模型能力

Recommended direction / 建议方向

Replace MAX_PERSISTED_MESSAGES with a token-budget-aware cap that uses the saved model's context window as the bound, or move truncation to load time so the same model on resume gets the full, cacheable message list. The persistence layer should store faithfully; context limits should be enforced at send time by the engine.

建议将 MAX_PERSISTED_MESSAGES 替换为基于 token 预算 的智能裁剪——使用模型上下文窗口作为上限,或在 load 时按当前模型预算动态裁剪。持久化层应忠实存储,上下文限制应由 engine 在发送时按模型能力执行。

Tests / 测试

  • Updated two existing tests to verify raw output preservation instead of compaction

  • cargo test -p codewhale-tui -- session_manager: 39 passed, 0 failed

  • 更新两个已有测试,验证原始输出保留而非压缩

  • cargo test -p codewhale-tui -- session_manager: 39 passed, 0 failed

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the session tool output compaction logic (compact_session_tool_outputs) during session saving, loading, and checkpointing. By preserving the raw, large tool outputs in the saved sessions instead of replacing them with tool output receipts, the application ensures better cache fidelity when resuming LLM sessions. The corresponding unit tests have been updated to verify that raw outputs are preserved. There are no review comments to address, so I have no additional feedback to provide.

@Hmbown

Hmbown commented May 31, 2026

Copy link
Copy Markdown
Owner

Thanks @LeoAlex0, this is a clean and useful cache-fidelity fix. I verified the branch merged cleanly on current main and ran: cargo fmt --all -- --check, git diff --check HEAD, python3 scripts/check-provider-registry.py, cargo test -p codewhale-tui -- session_manager --nocapture (39 passed), and cargo check -p codewhale-tui --all-features --locked.

I also read Greptile’s checkpoint-size concern. That is the real tradeoff here, but the PR body calls it out and the behavior matches the goal: persistence should preserve the exact message stream for reliable resume/KV-cache behavior, while live context compaction remains separate. Merging this now; appreciate the focused fix and the bilingual writeup.

@Hmbown Hmbown merged commit 85764b9 into Hmbown:main May 31, 2026
2 checks passed
@LeoAlex0

LeoAlex0 commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

@Hmbown FYI: There's still a Remaining Risk: MAX_PERSISTED_MESSAGES performs tail truncation which will break ALL LLM cache, I've edit this PR message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants