Skip to content

fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#43293

Open
basilalshukaili wants to merge 2 commits into
NousResearch:mainfrom
basilalshukaili:fix/compressor-tool-call-token-budget
Open

fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#43293
basilalshukaili wants to merge 2 commits into
NousResearch:mainfrom
basilalshukaili:fix/compressor-tool-call-token-budget

Conversation

@basilalshukaili

Copy link
Copy Markdown
Contributor

What & why

Fixes #28053. The compression tail-protection budget walks estimate each message's tokens from content + function.arguments only — they drop every other field on each tool_call (id, type, function.name, and the JSON structure). For assistant turns that fan out into parallel tool calls, those fields are the bulk of the real cost, so the estimate undercounts by 2–15× (the issue measures a 4-tool-call turn at 73 estimated vs ~1,090 real tokens, and ~2.27× on assistant messages overall).

The effect: the protected tail overshoots tail_token_budget (e.g. a ~26K budget ends up holding ~48K), so compression runs well below its intended ratio and context keeps growing across turns — exactly on the multi-tool turns that are Hermes's normal case.

The fix

The same walk was duplicated in three places (_prune_old_tool_results, and the two passes inside _find_tail_cut_by_tokens). This consolidates them into one helper:

def _estimate_msg_budget_tokens(msg: dict) -> int:
    content_len = _content_length_for_budget(msg.get("content") or "")
    tokens = content_len // _CHARS_PER_TOKEN + 10
    for tc in msg.get("tool_calls") or []:
        if isinstance(tc, dict):
            tokens += len(str(tc)) // _CHARS_PER_TOKEN   # full envelope, not just arguments
    return tokens

len(str(tc)) counts the whole tool_call (id/type/function.name + structure), consistent with how _estimate_message_chars already estimates message size elsewhere. Content/image handling is unchanged (_content_length_for_budget is reused as-is). Net −22/+18 in the source, and the three walks can no longer drift apart.

This does not touch the prompt-caching path or alter past context — only the local pre-compression budget estimate.

Relationship to #28074

#28074 shared this diagnosis but was closed by its author ("to focus the queue on security work… happy to reopen if maintainers want this picked up"). This picks it up with a smaller, de-duplicated implementation (one helper across all three sites rather than inlining the extra fields in each) plus a regression test that isolates the behavioral effect.

Test plan

New tests/agent/test_compressor_tool_call_budget.py:

  • the helper counts the full envelope (not just arguments) and scales with parallel-call count;
  • _find_tail_cut_by_tokens now stops on a tool-call-heavy tail where the old arguments-only walk would have protected the entire transcript;
  • plain messages and non-dict tool_calls behave unchanged / safely.

Ran on native Windows (Python 3.11): the new file plus the existing compression suite (test_context_compressor, test_compressor_image_tokens, test_compress_focus, test_context_compressor_cross_session_guard, test_infinite_compaction_loop, gateway/test_compress_plugin_engine) — 209 passed. No new dependencies.

basilalshukaili and others added 2 commits June 10, 2026 08:13
…te (NousResearch#28053)

The tail-protection budget walks estimated an assistant message's tokens from content + function.arguments only, dropping each tool_call's id, type and function.name (plus JSON structure). Assistant turns that fan out into parallel tool calls were undercounted by 2-15x (a 4-tool-call turn measures ~73 vs ~1,090 real tokens), so the protected tail overshot tail_token_budget and compression ran far below its intended ratio — context kept growing.

Consolidate the three duplicated budget walks (_prune_old_tool_results and the two passes in _find_tail_cut_by_tokens) into a single _estimate_msg_budget_tokens() helper that counts the full tool_call envelope via len(str(tc)), consistent with how _estimate_message_chars estimates message size elsewhere.

Tested on Windows: new tests/agent/test_compressor_tool_call_budget.py plus the existing compression suite (test_context_compressor, compressor_image_tokens, cross_session_guard, infinite_compaction_loop) — 209 passed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
check-attribution requires every PR author email to resolve via scripts/release.py AUTHOR_MAP. The existing entry maps the GitHub noreply form; this adds the plain commit email so attribution resolves for this and future commits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@basilalshukaili

Copy link
Copy Markdown
Contributor Author

Measured before/after on realistic multi-tool assistant turns — token counts via tiktoken (o200k_base) on the serialized tool_calls, to show the undercount concretely:

parallel tool_calls old (args-only) new (envelope) real old undercount
1 16 44 44 2.8×
2 27 83 91 3.4×
4 41 152 176 4.3×
5 50 188 227 4.5×

The old walk undercounts a 4-call turn ~4.3× (41 vs 176); the envelope estimate lands within ~16% of the tokenizer, and the gap widens with parallel-call count — precisely when tail protection matters most. These use short arguments; with large arguments the absolute miss is bigger still (matching the 73 vs ~1,090 figure in #28053).

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Jun 10, 2026
@basilalshukaili

Copy link
Copy Markdown
Contributor Author

On output quality / information loss — since this moves a compression boundary, worth being explicit rather than hand-waving it:

  • It does not change when compression fires. should_compress() gates on real last_prompt_tokens vs threshold_tokens; these budget walks only run inside compress() to place the protect/summarize boundary. Compression frequency is unchanged.
  • What changes: on tool-call-heavy turns the tail now reaches tail_token_budget at roughly the intended size instead of ~2× over it, so slightly more of that recent tail becomes eligible for summarization — i.e. it converges on the budget the compressor was designed around.
  • The floors are preserved: min_protect / protect_last_n still hold and _ensure_last_user_message_in_tail keeps the most recent user turn verbatim, so the tail can't be stripped below the floor.
  • The behavior it removes was itself a quality risk: the over-protection made compression run well below its target ratio, so context kept growing toward the hard window — where the provider truncates uncontrollably (often dropping the system prompt / oldest turns) and "lost-in-the-middle" degrades reasoning. Honoring the budget avoids that failure mode.

Net: this restores the designed tail budget rather than making compression more aggressive. If the effective ~2× over-protection turns out to be desirable, the clean lever is to raise tail_token_budget explicitly — not to lean on the estimation undercount. Happy to add that as a follow-up knob if you'd prefer.

@liuhao1024

Copy link
Copy Markdown
Contributor

Verification Review

Approach: Correct fix for a long-standing compression budget undercount. The tail-protection walks estimated tokens from content + function.arguments only, dropping the tool_call envelope (id, type, function.name, JSON structure). For parallel tool-call turns this undercounted by 2-15x, causing the protected tail to overshoot tail_token_budget.

Strengths:

  • Single extraction point: _estimate_msg_budget_tokens() replaces 3 inline copies, eliminating drift
  • Full envelope: counts len(str(tc)) for each tool_call, covering id/type/name/JSON structure
  • Defensive: non-dict tool_calls are silently skipped (no crash on malformed data)
  • Good test coverage: 5 tests covering envelope vs arguments-only comparison, scaling with parallel calls, plain messages, and non-dict edge cases
  • Real-world validation: the tail-cut test uses 20 heavy turns to demonstrate the budget now stops early

No issues found. The fix is well-scoped and the tests validate the exact regression scenario from #28053.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

3 participants