Skip to content

[Bug]: Large tool results consume entire tail token budget — conversation messages lost to summary on compression #13164

@shiro-jwjang

Description

@shiro-jwjang

Bug Description

When context compression triggers during a session that contains large tool results (terminal output, git logs, file diffs, search results), the tail protection mechanism correctly preserves recent messages by token budget — but the budget is almost entirely consumed by the large tool outputs. The user's actual conversation messages (questions, instructions, task context) get pushed out of the tail and into the compressed summary region.

The result feels like "the conversation disappeared": after compression, the agent sees tool call history but loses the conversational context of what was being discussed. The user's most recent messages are buried in the LLM-generated summary rather than being in the active context window.

Technical Details

How the tail budget works

In agent/context_compressor.py, _find_tail_cut_by_tokens() walks backward from the end of the message list, accumulating tokens until tail_token_budget is reached:

# Derived budgets (128K context model example):
tail_token_budget = threshold_tokens × summary_target_ratio
                 = 64000 × 0.20
                 = ~12,800 tokens

# protect_last_n = 20 (hard minimum floor)

The problem

  1. A single large terminal tool result (e.g. npm test output, build log, git diff) can easily be 3,000-8,000+ tokens
  2. The backward walk accumulates these large tool results first (they're at the end)
  3. After 2-3 large tool results, the entire ~12.8K token budget is exhausted
  4. The boundary (cut_idx) is placed such that user/assistant conversation messages just before those tool results fall into the "middle" region — which gets summarized
  5. _ensure_last_user_message_in_tail() only protects the single most recent user message; earlier but still-recent user messages and assistant responses are lost to the summary

Concrete scenario

messages[40]: user: "Now run the test suite and fix any failures"     ← pushed to summary
messages[41]: assistant: "Running tests..."                            ← pushed to summary  
messages[42]: tool: [terminal] npm test → 5000 lines of output        ← in tail (8K tokens)
messages[43]: assistant: "3 tests failed, fixing..."                   ← pushed to summary
messages[44]: tool: [terminal] npm test → 5000 lines of output        ← in tail (6K tokens)
messages[45]: user: "Also check the lint warnings"                     ← in tail (barely)

After compression, the agent sees ~14K tokens of test output but has lost the conversational thread about why tests were being run and what was being fixed.

Code References

  • agent/context_compressor.py_find_tail_cut_by_tokens() (~line 420): backward walk with token budget
  • agent/context_compressor.py_prune_old_tool_results(): pre-pass pruning only affects messages outside the tail boundary
  • agent/context_compressor.py_ensure_last_user_message_in_tail(): only anchors the last user message
  • tail_token_budget derived in __init__(): int(threshold_tokens * summary_target_ratio)

Proposed Solution: Pre-truncate Tool Results in Tail Before Budget Calculation

Option A — Conversation message floor (complementary):
Add a guarantee that the last N user/assistant text messages (excluding tool results) are always preserved in the tail, regardless of tool result sizes. This acts as a safety net:

# In _find_tail_cut_by_tokens(), after the backward token walk:
# Count conversation messages (user + assistant without tool_calls) in the tail
conv_msgs_in_tail = sum(
    1 for m in messages[cut_idx:]
    if m["role"] in ("user", "assistant") and not m.get("tool_calls")
)
# If fewer than CONVERSATION_FLOOR, expand the tail backward
CONVERSATION_FLOOR = 6  # guarantee at least 6 conversational turns
while conv_msgs_in_tail < CONVERSATION_FLOOR and cut_idx > head_end + 1:
    cut_idx -= 1
    m = messages[cut_idx]
    if m["role"] in ("user", "assistant") and not m.get("tool_calls"):
        conv_msgs_in_tail += 1

Option B — Truncate tool results in tail before budget calculation (primary fix):
Before calculating the tail boundary, cap tool results in the tail region to a reasonable size. This ensures the budget is spent on a mix of conversation + tool context:

# Before _find_tail_cut_by_tokens():
MAX_TOOL_RESULT_TAIL_TOKENS = 2000  # per tool result

# Create a temporary view where tool results are truncated
# Use this truncated view for tail boundary calculation
# Then apply the boundary to the ORIGINAL messages (keeping full tool results in the tail)

This way:

  • Full tool results are still sent to the model (they're in the tail)
  • But the boundary calculation isn't skewed by oversized outputs
  • Conversation messages are more likely to be included in the tail

Recommended: Implement both — Option B as the primary fix, Option A as a safety net.

Impact

  • Severity: High — causes task amnesia in long sessions with heavy tool use
  • Frequency: Common during SWE/coding workflows (build-fix-test loops, git operations, large file reads)
  • User impact: Agent appears to "forget" what it was doing and needs task re-explanation after every compression cycle

Environment

  • Any model with context compression enabled
  • Most noticeable on 128K context models where tail_token_budget ≈ 12.8K tokens
  • Exacerbated by tools that produce large outputs (terminal, search_files, read_file on large files)

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions