fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#28074
fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#28074briandevans wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds more accurate token estimation for tail-budget pruning by including tool call “envelope” overhead (id/type/function.name), and introduces regression tests to prevent undercounting.
Changes:
- Introduce
_estimate_msg_tokens_for_budgetto estimate per-message token usage including tool call metadata. - Refactor tail-pruning walks to use the shared estimator instead of counting only tool arguments.
- Add tests covering envelope accounting and SDK-like tool_call objects.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/agent/test_context_compressor.py | Adds regression/unit tests asserting tool_call envelopes contribute to token estimates. |
| agent/context_compressor.py | Adds a shared token estimator and wires it into tail-token budget walks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raw_content = msg.get("content") or "" | ||
| msg_tokens = _content_length_for_budget(raw_content) // _CHARS_PER_TOKEN + 10 | ||
| for tc in msg.get("tool_calls") or []: | ||
| # ``str(tc)`` captures the full envelope for both dict-shaped and | ||
| # SDK-object tool_calls. ``arguments`` alone misses the ~30-50 | ||
| # chars of metadata per tc, which is the dominant blind spot when | ||
| # an assistant turn fans out into multiple parallel tool calls. | ||
| msg_tokens += len(str(tc)) // _CHARS_PER_TOKEN | ||
| return msg_tokens |
| Setup: 40 messages alternating ``assistant`` (3 short tool_calls each, | ||
| ``arguments="{}"``) and ``user`` (short text). Budget=200, | ||
| soft_ceiling=300. |
| messages = [ | ||
| {"role": "user", "content": "head1"}, | ||
| {"role": "assistant", "content": "head2"}, | ||
| ] |
| # The added envelope is ~80 chars of JSON metadata → ≥ 15 tokens. | ||
| # The old args-only path would have produced diff == 0. | ||
| assert diff >= 15, ( |
|
BoardJames triage: no branch-local failure found so far. Current checks at 2026-05-18T16:04Z: Changed files are limited to |
e242974 to
6cbc358
Compare
6cbc358 to
290bf80
Compare
`_find_tail_cut_by_tokens` and `_prune_old_tool_results` both rolled their own token estimate that counted `content_length + len(arguments)` and dropped the rest of every `tool_call`: `id`, `type`, `function.name`, and the JSON-envelope syntax around them. For an assistant turn that fans out into 3-5 short tool calls — common during data analysis, file ops, or web search — that is the bulk of the real token cost. Effect on the budget walks: - `_find_tail_cut_by_tokens` underestimates the tail, walks too far back, and protects ~2-3x more messages than the soft_ceiling intends. - `_prune_old_tool_results` does the same — older tool results stay protected and compression saves ~40% instead of the ~80% target. Issue measurement (370-message session): assistant role 14,407 vs 32,715 simple→real (2.27x), individual assistant turns 10-15x off. Compression ratio drops from ~80% expected to ~40%. Fix introduces `_estimate_msg_tokens_for_budget(msg)` that keeps the existing `_content_length_for_budget` accounting (so multimodal image parts still count via `_IMAGE_TOKEN_ESTIMATE`) and folds in `len(str(tc))` for each tool_call — capturing the full envelope for both dict-shaped and SDK-object tool calls. Both call sites switch to the new helper. Tests - new `test_tool_calls_envelope_counted_in_token_estimate` — 40-message alternating assistant/user transcript; buggy walk protects 30 tail messages, fixed walk protects < 20. - new `test_estimate_msg_tokens_envelope_counted` — direct unit test asserting id/type/function.name add ≥ 15 tokens of overhead. - new `test_estimate_msg_tokens_handles_sdk_tool_calls` — guards the non-dict (openai SDK) tool_call path the old `isinstance(tc, dict)` branch silently skipped. Existing TestTokenBudgetTailProtection (12 tests, including NousResearch#16087 multimodal regression) all pass — the fix preserves the prior char-sum behaviour for content and only adds the envelope contribution. Fixes NousResearch#28053. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
290bf80 to
ed11b00
Compare
|
Closing to focus the queue on security/file-safety work where civilian merges are landing. Happy to reopen if maintainers want this picked up. |
What does this PR do?
Fixes the compression tail-budget walk to include the full
tool_callenvelope (id,type,function.name, plus JSON syntax overhead) when estimating per-message tokens. The previous estimate ofcontent_length + len(arguments)dropped every other field on eachtool_call, which for assistant turns that fan out into 3-5 parallel tool calls is the bulk of the real token cost — the issue measures 2.27× divergence on assistant messages overall and 10-15× on individual turns (4 tool_calls: 73 vs 1,090 tokens). The walks in_find_tail_cut_by_tokensand_prune_old_tool_resultsover-protect the tail (137 msgs kept where the budget targets ~25), and compression saves ~40% instead of ~80%. This PR introduces a shared_estimate_msg_tokens_for_budget(msg)helper, uses it at both sites, and folds inlen(str(tc))per tool_call to capture the envelope for both dict-shaped and SDK-object tool calls.Related Issue
Fixes #28053
Type of Change
Changes Made
agent/context_compressor.py— extract_estimate_msg_tokens_for_budget(msg)helper. Preserves_content_length_for_budget(so the multimodalimage_urlaccounting from [Bug]: underestimates token count for multimodal messages, causing oversized tail protection and ineffective context compression #16087 still works) and addslen(str(tc)) // _CHARS_PER_TOKENper tool_call._find_tail_cut_by_tokensand_prune_old_tool_resultsnow call the shared helper.tests/agent/test_context_compressor.py— new regressionTestTokenBudgetTailProtection::test_tool_calls_envelope_counted_in_token_estimate(40-msg transcript: buggy walk protects 30 tail msgs, fixed walk protects <20) + helper unit tests (test_estimate_msg_tokens_envelope_counted,test_estimate_msg_tokens_handles_sdk_tool_calls).How to Test
uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/agent/test_context_compressor.py -v— 15/15 inTestTokenBudgetTailProtectionpass.assert 30 < 20. Restore → passes.Checklist
Code
fix(scope):,feat(scope):, etc.)Documentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/ASibling Audit
Related
len(content)fix) which patched a different root cause in the same function; this PR keeps that fix intact via_content_length_for_budget._content_length_for_budgetpath.