Skip to content

fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#28074

Closed
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/compressor-tool-calls-tokens-28053
Closed

fix(compressor): count tool_call envelope in tail-budget token estimate (#28053)#28074
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/compressor-tool-calls-tokens-28053

Conversation

@briandevans

@briandevans briandevans commented May 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes the compression tail-budget walk to include the full tool_call envelope (id, type, function.name, plus JSON syntax overhead) when estimating per-message tokens. The previous estimate of content_length + len(arguments) dropped every other field on each tool_call, which for assistant turns that fan out into 3-5 parallel tool calls is the bulk of the real token cost — the issue measures 2.27× divergence on assistant messages overall and 10-15× on individual turns (4 tool_calls: 73 vs 1,090 tokens). The walks in _find_tail_cut_by_tokens and _prune_old_tool_results over-protect the tail (137 msgs kept where the budget targets ~25), and compression saves ~40% instead of ~80%. This PR introduces a shared _estimate_msg_tokens_for_budget(msg) helper, uses it at both sites, and folds in len(str(tc)) per tool_call to capture the envelope for both dict-shaped and SDK-object tool calls.

Related Issue

Fixes #28053

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/context_compressor.py — extract _estimate_msg_tokens_for_budget(msg) helper. Preserves _content_length_for_budget (so the multimodal image_url accounting from [Bug]: underestimates token count for multimodal messages, causing oversized tail protection and ineffective context compression #16087 still works) and adds len(str(tc)) // _CHARS_PER_TOKEN per tool_call. _find_tail_cut_by_tokens and _prune_old_tool_results now call the shared helper.
  • tests/agent/test_context_compressor.py — new regression TestTokenBudgetTailProtection::test_tool_calls_envelope_counted_in_token_estimate (40-msg transcript: buggy walk protects 30 tail msgs, fixed walk protects <20) + helper unit tests (test_estimate_msg_tokens_envelope_counted, test_estimate_msg_tokens_handles_sdk_tool_calls).

How to Test

  1. uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/agent/test_context_compressor.py -v — 15/15 in TestTokenBudgetTailProtection pass.
  2. Regression guard: temporarily revert the production change → new test fails with assert 30 < 20. Restore → passes.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run focused tests for the touched code and all pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.x

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Sibling Audit

The issue names _find_tail_cut_by_tokens only, but the same inline pattern lived in _prune_old_tool_results (line 681-686). Both call sites are fixed by the same helper so the invariant — token-budget walks must account for tool_call envelope — holds across the compressor. No other compressor call sites compute a per-message budget rough. No widening needed.

Related

Copilot AI review requested due to automatic review settings May 18, 2026 15:54

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds more accurate token estimation for tail-budget pruning by including tool call “envelope” overhead (id/type/function.name), and introduces regression tests to prevent undercounting.

Changes:

  • Introduce _estimate_msg_tokens_for_budget to estimate per-message token usage including tool call metadata.
  • Refactor tail-pruning walks to use the shared estimator instead of counting only tool arguments.
  • Add tests covering envelope accounting and SDK-like tool_call objects.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
tests/agent/test_context_compressor.py Adds regression/unit tests asserting tool_call envelopes contribute to token estimates.
agent/context_compressor.py Adds a shared token estimator and wires it into tail-token budget walks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +122 to +130
raw_content = msg.get("content") or ""
msg_tokens = _content_length_for_budget(raw_content) // _CHARS_PER_TOKEN + 10
for tc in msg.get("tool_calls") or []:
# ``str(tc)`` captures the full envelope for both dict-shaped and
# SDK-object tool_calls. ``arguments`` alone misses the ~30-50
# chars of metadata per tc, which is the dominant blind spot when
# an assistant turn fans out into multiple parallel tool calls.
msg_tokens += len(str(tc)) // _CHARS_PER_TOKEN
return msg_tokens
Comment on lines +1667 to +1669
Setup: 40 messages alternating ``assistant`` (3 short tool_calls each,
``arguments="{}"``) and ``user`` (short text). Budget=200,
soft_ceiling=300.
Comment on lines +1690 to +1693
messages = [
{"role": "user", "content": "head1"},
{"role": "assistant", "content": "head2"},
]
Comment on lines +1739 to +1741
# The added envelope is ~80 chars of JSON metadata → ≥ 15 tokens.
# The old args-only path would have produced diff == 0.
assert diff >= 15, (
@BoardJames-Bot

Copy link
Copy Markdown

BoardJames triage: no branch-local failure found so far. Current checks at 2026-05-18T16:04Z: Docker Build and Publish / build-arm64, build-amd64, lint, nix, supply-chain, attribution, history, and Tests / e2e are green; only Tests / test is still in progress on head 49657c79d (run 26044643860). The earlier snapshot's arm64 concern has cleared.

Changed files are limited to agent/context_compressor.py plus regression tests in tests/agent/test_context_compressor.py. Focused local validation passed in .triage/wt-28074: python -m pytest tests/agent/test_context_compressor.py::TestTokenBudgetTailProtection -q --tb=short => 15 passed. Root cause/status: currently just the shared long full-suite test job still running; no safe branch-local fix indicated unless that job later emits a concrete traceback.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 18, 2026
@briandevans briandevans force-pushed the fix/compressor-tool-calls-tokens-28053 branch 4 times, most recently from e242974 to 6cbc358 Compare May 27, 2026 10:13
@briandevans briandevans force-pushed the fix/compressor-tool-calls-tokens-28053 branch from 6cbc358 to 290bf80 Compare May 29, 2026 11:13
`_find_tail_cut_by_tokens` and `_prune_old_tool_results` both rolled
their own token estimate that counted `content_length + len(arguments)`
and dropped the rest of every `tool_call`: `id`, `type`, `function.name`,
and the JSON-envelope syntax around them.  For an assistant turn that
fans out into 3-5 short tool calls — common during data analysis, file
ops, or web search — that is the bulk of the real token cost.

Effect on the budget walks:
- `_find_tail_cut_by_tokens` underestimates the tail, walks too far
  back, and protects ~2-3x more messages than the soft_ceiling intends.
- `_prune_old_tool_results` does the same — older tool results stay
  protected and compression saves ~40% instead of the ~80% target.

Issue measurement (370-message session): assistant role
14,407 vs 32,715 simple→real (2.27x), individual assistant turns
10-15x off.  Compression ratio drops from ~80% expected to ~40%.

Fix introduces `_estimate_msg_tokens_for_budget(msg)` that keeps the
existing `_content_length_for_budget` accounting (so multimodal image
parts still count via `_IMAGE_TOKEN_ESTIMATE`) and folds in
`len(str(tc))` for each tool_call — capturing the full envelope for
both dict-shaped and SDK-object tool calls.  Both call sites switch
to the new helper.

Tests
- new `test_tool_calls_envelope_counted_in_token_estimate` — 40-message
  alternating assistant/user transcript; buggy walk protects 30 tail
  messages, fixed walk protects < 20.
- new `test_estimate_msg_tokens_envelope_counted` — direct unit test
  asserting id/type/function.name add ≥ 15 tokens of overhead.
- new `test_estimate_msg_tokens_handles_sdk_tool_calls` — guards the
  non-dict (openai SDK) tool_call path the old `isinstance(tc, dict)`
  branch silently skipped.

Existing TestTokenBudgetTailProtection (12 tests, including NousResearch#16087
multimodal regression) all pass — the fix preserves the prior char-sum
behaviour for content and only adds the envelope contribution.

Fixes NousResearch#28053.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@briandevans briandevans force-pushed the fix/compressor-tool-calls-tokens-28053 branch from 290bf80 to ed11b00 Compare May 30, 2026 11:12
@briandevans

Copy link
Copy Markdown
Contributor Author

Closing to focus the queue on security/file-safety work where civilian merges are landing. Happy to reopen if maintainers want this picked up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: _find_tail_cut_by_tokens underestimates assistant message tokens by 2-15x — tail protection overshoots and compression becomes ineffective

4 participants