Skip to content

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective #28053

@laoli-no1

Description

@laoli-no1

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

Description

Bug Description

_find_tail_cut_by_tokens() in agent/context_compressor.py uses a simplified token estimation (content_len // _CHARS_PER_TOKEN + 10) that severely underestimates assistant messages with tool_calls, causing the tail protection region to grow far beyond the intended soft_ceiling. This makes compression ineffective — the "compressed" context remains 60-80% of the original size.

Steps to Reproduce

  1. Run a session with many tool calls (e.g., data analysis, file operations, debugging)
  2. Continue until context compression triggers (~139K tokens with Qwen3.6-27B-FP8, 262K context)
  3. Observe the compression result in agent.log:
Context compression triggered (139,106 tokens >= 131072 threshold)
Compressed: 322 -> 207 messages (~56,192 tokens saved, 40%)

Instead of the expected ~80% savings, only ~40% is achieved. In some cases, compression becomes nearly useless (368 -> 335, only 33 messages removed).

Expected Behavior

Compression should reduce context to ~20% of original (target_ratio: 0.2), leaving only the head + tail + summary. With a 26K tail budget, the total compressed context should be ~50K tokens (head + summary + tail).

Actual Behavior

The tail region ends up at 48K tokens instead of the intended 26K, and the summary adds another ~10K. Total compressed context is ~83K tokens — only 60% of the original. The effective compression ratio is ~40% instead of the expected ~80%.

Root Cause

_find_tail_cut_by_tokens() uses this estimation for each message:

msg_tokens = content_len // _CHARS_PER_TOKEN + 10  # line 1433

For assistant messages with tool_calls, it only adds the arguments string length:

for tc in msg.get("tool_calls") or []:
    if isinstance(tc, dict):
        args = tc.get("function", {}).get("arguments", "")
        msg_tokens += len(args) // _CHARS_PER_TOKEN  # only arguments, missing metadata

But the actual tokenizer-based estimation (_estimate_message_chars in model_metadata.py) serializes the entire message dict, including:

  • tool_calls[].id (UUID strings ~36 chars each)
  • tool_calls[].type
  • tool_calls[].function.name (function names)
  • Dict structure overhead (keys, quotes, brackets)

Measured Impact

Empirical analysis of a real session (370 messages, 137 in tail):

Role Simple Estimate Real Estimate (tokenizer) Deviation
assistant 14,407 32,715 2.27x
tool 26,325 28,318 1.08x
user 218 335 1.54x

Individual assistant messages with multiple tool_calls can deviate by 10-15x:

  • Message with 4 tool_calls: simple=73 vs real=1,090 (14.93x)
  • Message with 3 tool_calls: simple=133 vs real=1,330 (10.00x)

The cumulative 47% underestimation (simple=40,950 vs real=60,047) causes the tail to overshoot by ~20K tokens, making compression 40-50% less effective than intended.

Impact

  • Compression effectiveness drops from expected ~80% to ~40%
  • Multiple compression cycles become necessary, increasing token costs
  • In extreme cases (large sessions with many tool calls), compression can become nearly useless (368 -> 335 messages)
  • User sees "compressed" context that is still 80K+ tokens

Proposed Fix

Two options:

Option A: Use tokenizer-based estimation in _find_tail_cut_by_tokens

# Instead of simple estimate:
msg_tokens = content_len // _CHARS_PER_TOKEN + 10
# Use:
msg_tokens = _estimate_message_chars(msg) // _CHARS_PER_TOKEN + 10

This is consistent with how _prune_old_tool_results() already works and would eliminate the deviation entirely.

Option B: Apply a safety multiplier to the soft_ceiling

soft_ceiling = int(token_budget * 1.5 * 0.7)  # Reduce by 30% to compensate for underestimation

This is a quick fix but doesn't address the root cause.

Recommended: Option A for correctness, as _estimate_message_chars already handles all edge cases (multimodal, multimodal tool results, base64 images).

Environment

  • Hermes Agent: v0.8.x+ (current main)
  • Model: Qwen/Qwen3.6-27B-FP8 (262K context)
  • OS: macOS 26.4.1

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions