Bug Description
When context compression triggers during a session that contains large tool results (terminal output, git logs, file diffs, search results), the tail protection mechanism correctly preserves recent messages by token budget — but the budget is almost entirely consumed by the large tool outputs. The user's actual conversation messages (questions, instructions, task context) get pushed out of the tail and into the compressed summary region.
The result feels like "the conversation disappeared": after compression, the agent sees tool call history but loses the conversational context of what was being discussed. The user's most recent messages are buried in the LLM-generated summary rather than being in the active context window.
Technical Details
How the tail budget works
In agent/context_compressor.py, _find_tail_cut_by_tokens() walks backward from the end of the message list, accumulating tokens until tail_token_budget is reached:
# Derived budgets (128K context model example):
tail_token_budget = threshold_tokens × summary_target_ratio
= 64000 × 0.20
= ~12,800 tokens
# protect_last_n = 20 (hard minimum floor)
The problem
- A single large
terminal tool result (e.g. npm test output, build log, git diff) can easily be 3,000-8,000+ tokens
- The backward walk accumulates these large tool results first (they're at the end)
- After 2-3 large tool results, the entire ~12.8K token budget is exhausted
- The boundary (
cut_idx) is placed such that user/assistant conversation messages just before those tool results fall into the "middle" region — which gets summarized
_ensure_last_user_message_in_tail() only protects the single most recent user message; earlier but still-recent user messages and assistant responses are lost to the summary
Concrete scenario
messages[40]: user: "Now run the test suite and fix any failures" ← pushed to summary
messages[41]: assistant: "Running tests..." ← pushed to summary
messages[42]: tool: [terminal] npm test → 5000 lines of output ← in tail (8K tokens)
messages[43]: assistant: "3 tests failed, fixing..." ← pushed to summary
messages[44]: tool: [terminal] npm test → 5000 lines of output ← in tail (6K tokens)
messages[45]: user: "Also check the lint warnings" ← in tail (barely)
After compression, the agent sees ~14K tokens of test output but has lost the conversational thread about why tests were being run and what was being fixed.
Code References
agent/context_compressor.py — _find_tail_cut_by_tokens() (~line 420): backward walk with token budget
agent/context_compressor.py — _prune_old_tool_results(): pre-pass pruning only affects messages outside the tail boundary
agent/context_compressor.py — _ensure_last_user_message_in_tail(): only anchors the last user message
tail_token_budget derived in __init__(): int(threshold_tokens * summary_target_ratio)
Proposed Solution: Pre-truncate Tool Results in Tail Before Budget Calculation
Option A — Conversation message floor (complementary):
Add a guarantee that the last N user/assistant text messages (excluding tool results) are always preserved in the tail, regardless of tool result sizes. This acts as a safety net:
# In _find_tail_cut_by_tokens(), after the backward token walk:
# Count conversation messages (user + assistant without tool_calls) in the tail
conv_msgs_in_tail = sum(
1 for m in messages[cut_idx:]
if m["role"] in ("user", "assistant") and not m.get("tool_calls")
)
# If fewer than CONVERSATION_FLOOR, expand the tail backward
CONVERSATION_FLOOR = 6 # guarantee at least 6 conversational turns
while conv_msgs_in_tail < CONVERSATION_FLOOR and cut_idx > head_end + 1:
cut_idx -= 1
m = messages[cut_idx]
if m["role"] in ("user", "assistant") and not m.get("tool_calls"):
conv_msgs_in_tail += 1
Option B — Truncate tool results in tail before budget calculation (primary fix):
Before calculating the tail boundary, cap tool results in the tail region to a reasonable size. This ensures the budget is spent on a mix of conversation + tool context:
# Before _find_tail_cut_by_tokens():
MAX_TOOL_RESULT_TAIL_TOKENS = 2000 # per tool result
# Create a temporary view where tool results are truncated
# Use this truncated view for tail boundary calculation
# Then apply the boundary to the ORIGINAL messages (keeping full tool results in the tail)
This way:
- Full tool results are still sent to the model (they're in the tail)
- But the boundary calculation isn't skewed by oversized outputs
- Conversation messages are more likely to be included in the tail
Recommended: Implement both — Option B as the primary fix, Option A as a safety net.
Impact
- Severity: High — causes task amnesia in long sessions with heavy tool use
- Frequency: Common during SWE/coding workflows (build-fix-test loops, git operations, large file reads)
- User impact: Agent appears to "forget" what it was doing and needs task re-explanation after every compression cycle
Environment
- Any model with context compression enabled
- Most noticeable on 128K context models where
tail_token_budget ≈ 12.8K tokens
- Exacerbated by tools that produce large outputs (terminal, search_files, read_file on large files)
Related Issues
Bug Description
When context compression triggers during a session that contains large tool results (terminal output, git logs, file diffs, search results), the tail protection mechanism correctly preserves recent messages by token budget — but the budget is almost entirely consumed by the large tool outputs. The user's actual conversation messages (questions, instructions, task context) get pushed out of the tail and into the compressed summary region.
The result feels like "the conversation disappeared": after compression, the agent sees tool call history but loses the conversational context of what was being discussed. The user's most recent messages are buried in the LLM-generated summary rather than being in the active context window.
Technical Details
How the tail budget works
In
agent/context_compressor.py,_find_tail_cut_by_tokens()walks backward from the end of the message list, accumulating tokens untiltail_token_budgetis reached:The problem
terminaltool result (e.g.npm testoutput, build log,git diff) can easily be 3,000-8,000+ tokenscut_idx) is placed such that user/assistant conversation messages just before those tool results fall into the "middle" region — which gets summarized_ensure_last_user_message_in_tail()only protects the single most recent user message; earlier but still-recent user messages and assistant responses are lost to the summaryConcrete scenario
After compression, the agent sees ~14K tokens of test output but has lost the conversational thread about why tests were being run and what was being fixed.
Code References
agent/context_compressor.py—_find_tail_cut_by_tokens()(~line 420): backward walk with token budgetagent/context_compressor.py—_prune_old_tool_results(): pre-pass pruning only affects messages outside the tail boundaryagent/context_compressor.py—_ensure_last_user_message_in_tail(): only anchors the last user messagetail_token_budgetderived in__init__():int(threshold_tokens * summary_target_ratio)Proposed Solution: Pre-truncate Tool Results in Tail Before Budget Calculation
Option A — Conversation message floor (complementary):
Add a guarantee that the last N user/assistant text messages (excluding tool results) are always preserved in the tail, regardless of tool result sizes. This acts as a safety net:
Option B — Truncate tool results in tail before budget calculation (primary fix):
Before calculating the tail boundary, cap tool results in the tail region to a reasonable size. This ensures the budget is spent on a mix of conversation + tool context:
This way:
Recommended: Implement both — Option B as the primary fix, Option A as a safety net.
Impact
Environment
tail_token_budget≈ 12.8K tokensRelated Issues
_ensure_last_user_message_in_tail, but doesn't cover the multi-message case described here)