[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective
Description
Bug Description
_find_tail_cut_by_tokens() in agent/context_compressor.py uses a simplified token estimation (content_len // _CHARS_PER_TOKEN + 10) that severely underestimates assistant messages with tool_calls, causing the tail protection region to grow far beyond the intended soft_ceiling. This makes compression ineffective — the "compressed" context remains 60-80% of the original size.
Steps to Reproduce
- Run a session with many tool calls (e.g., data analysis, file operations, debugging)
- Continue until context compression triggers (~139K tokens with Qwen3.6-27B-FP8, 262K context)
- Observe the compression result in agent.log:
Context compression triggered (139,106 tokens >= 131072 threshold)
Compressed: 322 -> 207 messages (~56,192 tokens saved, 40%)
Instead of the expected ~80% savings, only ~40% is achieved. In some cases, compression becomes nearly useless (368 -> 335, only 33 messages removed).
Expected Behavior
Compression should reduce context to ~20% of original (target_ratio: 0.2), leaving only the head + tail + summary. With a 26K tail budget, the total compressed context should be ~50K tokens (head + summary + tail).
Actual Behavior
The tail region ends up at 48K tokens instead of the intended 26K, and the summary adds another ~10K. Total compressed context is ~83K tokens — only 60% of the original. The effective compression ratio is ~40% instead of the expected ~80%.
Root Cause
_find_tail_cut_by_tokens() uses this estimation for each message:
msg_tokens = content_len // _CHARS_PER_TOKEN + 10 # line 1433
For assistant messages with tool_calls, it only adds the arguments string length:
for tc in msg.get("tool_calls") or []:
if isinstance(tc, dict):
args = tc.get("function", {}).get("arguments", "")
msg_tokens += len(args) // _CHARS_PER_TOKEN # only arguments, missing metadata
But the actual tokenizer-based estimation (_estimate_message_chars in model_metadata.py) serializes the entire message dict, including:
tool_calls[].id (UUID strings ~36 chars each)
tool_calls[].type
tool_calls[].function.name (function names)
- Dict structure overhead (keys, quotes, brackets)
Measured Impact
Empirical analysis of a real session (370 messages, 137 in tail):
| Role |
Simple Estimate |
Real Estimate (tokenizer) |
Deviation |
| assistant |
14,407 |
32,715 |
2.27x |
| tool |
26,325 |
28,318 |
1.08x |
| user |
218 |
335 |
1.54x |
Individual assistant messages with multiple tool_calls can deviate by 10-15x:
- Message with 4 tool_calls: simple=73 vs real=1,090 (14.93x)
- Message with 3 tool_calls: simple=133 vs real=1,330 (10.00x)
The cumulative 47% underestimation (simple=40,950 vs real=60,047) causes the tail to overshoot by ~20K tokens, making compression 40-50% less effective than intended.
Impact
- Compression effectiveness drops from expected ~80% to ~40%
- Multiple compression cycles become necessary, increasing token costs
- In extreme cases (large sessions with many tool calls), compression can become nearly useless (368 -> 335 messages)
- User sees "compressed" context that is still 80K+ tokens
Proposed Fix
Two options:
Option A: Use tokenizer-based estimation in _find_tail_cut_by_tokens
# Instead of simple estimate:
msg_tokens = content_len // _CHARS_PER_TOKEN + 10
# Use:
msg_tokens = _estimate_message_chars(msg) // _CHARS_PER_TOKEN + 10
This is consistent with how _prune_old_tool_results() already works and would eliminate the deviation entirely.
Option B: Apply a safety multiplier to the soft_ceiling
soft_ceiling = int(token_budget * 1.5 * 0.7) # Reduce by 30% to compensate for underestimation
This is a quick fix but doesn't address the root cause.
Recommended: Option A for correctness, as _estimate_message_chars already handles all edge cases (multimodal, multimodal tool results, base64 images).
Environment
- Hermes Agent: v0.8.x+ (current main)
- Model: Qwen/Qwen3.6-27B-FP8 (262K context)
- OS: macOS 26.4.1
Related Issues
[Bug]:
_find_tail_cut_by_tokensunderestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffectiveDescription
Bug Description
_find_tail_cut_by_tokens()inagent/context_compressor.pyuses a simplified token estimation (content_len // _CHARS_PER_TOKEN + 10) that severely underestimates assistant messages with tool_calls, causing the tail protection region to grow far beyond the intendedsoft_ceiling. This makes compression ineffective — the "compressed" context remains 60-80% of the original size.Steps to Reproduce
Instead of the expected ~80% savings, only ~40% is achieved. In some cases, compression becomes nearly useless (368 -> 335, only 33 messages removed).
Expected Behavior
Compression should reduce context to ~20% of original (
target_ratio: 0.2), leaving only the head + tail + summary. With a 26K tail budget, the total compressed context should be ~50K tokens (head + summary + tail).Actual Behavior
The tail region ends up at 48K tokens instead of the intended 26K, and the summary adds another ~10K. Total compressed context is ~83K tokens — only 60% of the original. The effective compression ratio is ~40% instead of the expected ~80%.
Root Cause
_find_tail_cut_by_tokens()uses this estimation for each message:For assistant messages with tool_calls, it only adds the
argumentsstring length:But the actual tokenizer-based estimation (
_estimate_message_charsinmodel_metadata.py) serializes the entire message dict, including:tool_calls[].id(UUID strings ~36 chars each)tool_calls[].typetool_calls[].function.name(function names)Measured Impact
Empirical analysis of a real session (370 messages, 137 in tail):
Individual assistant messages with multiple tool_calls can deviate by 10-15x:
The cumulative 47% underestimation (simple=40,950 vs real=60,047) causes the tail to overshoot by ~20K tokens, making compression 40-50% less effective than intended.
Impact
Proposed Fix
Two options:
Option A: Use tokenizer-based estimation in
_find_tail_cut_by_tokensThis is consistent with how
_prune_old_tool_results()already works and would eliminate the deviation entirely.Option B: Apply a safety multiplier to the soft_ceiling
This is a quick fix but doesn't address the root cause.
Recommended: Option A for correctness, as
_estimate_message_charsalready handles all edge cases (multimodal, multimodal tool results, base64 images).Environment
Related Issues