[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

# [Bug]: `_find_tail_cut_by_tokens` underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

## Description

### Bug Description

`_find_tail_cut_by_tokens()` in `agent/context_compressor.py` uses a simplified token estimation (`content_len // _CHARS_PER_TOKEN + 10`) that **severely underestimates assistant messages with tool_calls**, causing the tail protection region to grow far beyond the intended `soft_ceiling`. This makes compression ineffective — the "compressed" context remains 60-80% of the original size.

### Steps to Reproduce

1. Run a session with many tool calls (e.g., data analysis, file operations, debugging)
2. Continue until context compression triggers (~139K tokens with Qwen3.6-27B-FP8, 262K context)
3. Observe the compression result in agent.log:

```
Context compression triggered (139,106 tokens >= 131072 threshold)
Compressed: 322 -> 207 messages (~56,192 tokens saved, 40%)
```

Instead of the expected ~80% savings, only ~40% is achieved. In some cases, compression becomes nearly useless (368 -> 335, only 33 messages removed).

### Expected Behavior

Compression should reduce context to ~20% of original (`target_ratio: 0.2`), leaving only the head + tail + summary. With a 26K tail budget, the total compressed context should be ~50K tokens (head + summary + tail).

### Actual Behavior

The tail region ends up at 48K tokens **instead of the intended 26K**, and the summary adds another ~10K. Total compressed context is ~83K tokens — only 60% of the original. The effective compression ratio is ~40% instead of the expected ~80%.

### Root Cause

`_find_tail_cut_by_tokens()` uses this estimation for each message:

```python
msg_tokens = content_len // _CHARS_PER_TOKEN + 10  # line 1433
```

For assistant messages with tool_calls, it only adds the `arguments` string length:

```python
for tc in msg.get("tool_calls") or []:
    if isinstance(tc, dict):
        args = tc.get("function", {}).get("arguments", "")
        msg_tokens += len(args) // _CHARS_PER_TOKEN  # only arguments, missing metadata
```

But the actual tokenizer-based estimation (`_estimate_message_chars` in `model_metadata.py`) serializes the **entire message dict**, including:
- `tool_calls[].id` (UUID strings ~36 chars each)
- `tool_calls[].type` 
- `tool_calls[].function.name` (function names)
- Dict structure overhead (keys, quotes, brackets)

### Measured Impact

Empirical analysis of a real session (370 messages, 137 in tail):

| Role | Simple Estimate | Real Estimate (tokenizer) | Deviation |
|------|----------------|---------------------------|-----------|
| assistant | 14,407 | **32,715** | **2.27x** |
| tool | 26,325 | 28,318 | 1.08x |
| user | 218 | 335 | 1.54x |

Individual assistant messages with multiple tool_calls can deviate by **10-15x**:
- Message with 4 tool_calls: simple=73 vs real=1,090 (14.93x)
- Message with 3 tool_calls: simple=133 vs real=1,330 (10.00x)

The cumulative 47% underestimation (simple=40,950 vs real=60,047) causes the tail to overshoot by ~20K tokens, making compression 40-50% less effective than intended.

### Impact

- Compression effectiveness drops from expected ~80% to ~40%
- Multiple compression cycles become necessary, increasing token costs
- In extreme cases (large sessions with many tool calls), compression can become nearly useless (368 -> 335 messages)
- User sees "compressed" context that is still 80K+ tokens

### Proposed Fix

Two options:

**Option A: Use tokenizer-based estimation in `_find_tail_cut_by_tokens`**
```python
# Instead of simple estimate:
msg_tokens = content_len // _CHARS_PER_TOKEN + 10
# Use:
msg_tokens = _estimate_message_chars(msg) // _CHARS_PER_TOKEN + 10
```
This is consistent with how `_prune_old_tool_results()` already works and would eliminate the deviation entirely.

**Option B: Apply a safety multiplier to the soft_ceiling**
```python
soft_ceiling = int(token_budget * 1.5 * 0.7)  # Reduce by 30% to compensate for underestimation
```
This is a quick fix but doesn't address the root cause.

**Recommended:** Option A for correctness, as `_estimate_message_chars` already handles all edge cases (multimodal, multimodal tool results, base64 images).

### Environment

- Hermes Agent: v0.8.x+ (current main)
- Model: Qwen/Qwen3.6-27B-FP8 (262K context)
- OS: macOS 26.4.1

### Related Issues

- #13164 (tool results consuming tail budget)
- #16087 (multimodal message estimation — different but related)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective #28053

[Bug]: `_find_tail_cut_by_tokens` underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Measured Impact

Impact

Proposed Fix

Environment

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Role	Simple Estimate	Real Estimate (tokenizer)	Deviation
assistant	14,407	32,715	2.27x
tool	26,325	28,318	1.08x
user	218	335	1.54x

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective #28053

Description

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Measured Impact

Impact

Proposed Fix

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: `_find_tail_cut_by_tokens` underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective