fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#43293
Conversation
…te (NousResearch#28053) The tail-protection budget walks estimated an assistant message's tokens from content + function.arguments only, dropping each tool_call's id, type and function.name (plus JSON structure). Assistant turns that fan out into parallel tool calls were undercounted by 2-15x (a 4-tool-call turn measures ~73 vs ~1,090 real tokens), so the protected tail overshot tail_token_budget and compression ran far below its intended ratio — context kept growing. Consolidate the three duplicated budget walks (_prune_old_tool_results and the two passes in _find_tail_cut_by_tokens) into a single _estimate_msg_budget_tokens() helper that counts the full tool_call envelope via len(str(tc)), consistent with how _estimate_message_chars estimates message size elsewhere. Tested on Windows: new tests/agent/test_compressor_tool_call_budget.py plus the existing compression suite (test_context_compressor, compressor_image_tokens, cross_session_guard, infinite_compaction_loop) — 209 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
check-attribution requires every PR author email to resolve via scripts/release.py AUTHOR_MAP. The existing entry maps the GitHub noreply form; this adds the plain commit email so attribution resolves for this and future commits. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Measured before/after on realistic multi-tool assistant turns — token counts via
The old walk undercounts a 4-call turn ~4.3× (41 vs 176); the envelope estimate lands within ~16% of the tokenizer, and the gap widens with parallel-call count — precisely when tail protection matters most. These use short arguments; with large arguments the absolute miss is bigger still (matching the 73 vs ~1,090 figure in #28053). |
|
On output quality / information loss — since this moves a compression boundary, worth being explicit rather than hand-waving it:
Net: this restores the designed tail budget rather than making compression more aggressive. If the effective ~2× over-protection turns out to be desirable, the clean lever is to raise |
Verification ReviewApproach: Correct fix for a long-standing compression budget undercount. The tail-protection walks estimated tokens from content + function.arguments only, dropping the tool_call envelope (id, type, function.name, JSON structure). For parallel tool-call turns this undercounted by 2-15x, causing the protected tail to overshoot tail_token_budget. Strengths:
No issues found. The fix is well-scoped and the tests validate the exact regression scenario from #28053. |
What & why
Fixes #28053. The compression tail-protection budget walks estimate each message's tokens from
content + function.argumentsonly — they drop every other field on eachtool_call(id,type,function.name, and the JSON structure). For assistant turns that fan out into parallel tool calls, those fields are the bulk of the real cost, so the estimate undercounts by 2–15× (the issue measures a 4-tool-call turn at 73 estimated vs ~1,090 real tokens, and ~2.27× on assistant messages overall).The effect: the protected tail overshoots
tail_token_budget(e.g. a ~26K budget ends up holding ~48K), so compression runs well below its intended ratio and context keeps growing across turns — exactly on the multi-tool turns that are Hermes's normal case.The fix
The same walk was duplicated in three places (
_prune_old_tool_results, and the two passes inside_find_tail_cut_by_tokens). This consolidates them into one helper:len(str(tc))counts the whole tool_call (id/type/function.name + structure), consistent with how_estimate_message_charsalready estimates message size elsewhere. Content/image handling is unchanged (_content_length_for_budgetis reused as-is). Net −22/+18 in the source, and the three walks can no longer drift apart.This does not touch the prompt-caching path or alter past context — only the local pre-compression budget estimate.
Relationship to #28074
#28074 shared this diagnosis but was closed by its author ("to focus the queue on security work… happy to reopen if maintainers want this picked up"). This picks it up with a smaller, de-duplicated implementation (one helper across all three sites rather than inlining the extra fields in each) plus a regression test that isolates the behavioral effect.
Test plan
New
tests/agent/test_compressor_tool_call_budget.py:_find_tail_cut_by_tokensnow stops on a tool-call-heavy tail where the old arguments-only walk would have protected the entire transcript;Ran on native Windows (Python 3.11): the new file plus the existing compression suite (
test_context_compressor,test_compressor_image_tokens,test_compress_focus,test_context_compressor_cross_session_guard,test_infinite_compaction_loop,gateway/test_compress_plugin_engine) — 209 passed. No new dependencies.