Skip to content

feat(agent): context-aware tool result budgeting#6339

Open
jbarket wants to merge 6 commits into
NousResearch:mainfrom
jbarket:feat/tool-budget
Open

feat(agent): context-aware tool result budgeting#6339
jbarket wants to merge 6 commits into
NousResearch:mainfrom
jbarket:feat/tool-budget

Conversation

@jbarket

@jbarket jbarket commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds context-aware budgeting for tool results. When a tool's output would exceed the model's available context, the result is spilled to disk and the model gets a bounded preview with pagination instructions (use read_file with offset/limit for more).

Budget = max(floor, min(baseline, available_context))

On a 32K model, this prevents the exact scenario where ps aux or a large file read returns 36K tokens into a 32K window — producing an HTTP 400 "request exceeds context size" error. On 128K+ models, the budget is generous enough that it effectively never triggers.

When context is tight, the agent compacts conversation history before accepting a small budget, so the model gets a useful chunk rather than drip-feeding 5%-of-context slivers.

Scaling

Model Context Per-result baseline Behavior
Gemma 4 31B 32K ~8K tokens (~32K chars) Active — large results paginated
GPT-5.4 128K ~32K tokens (~128K chars) Rarely triggers
Claude 4 Opus 200K ~50K tokens (~200K chars) Almost never triggers
Gemini 2.5 1M ~250K tokens (~1M chars) Invisible

Related Issues

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)

Changes Made

New files

  • agent/tool_budget.pyToolBudget class: budget calculation, spill-to-disk, preview generation with pagination metadata
  • tests/agent/test_tool_budget.py — 25 unit tests for budget calculation, compaction triggers, apply/spill logic
  • tests/test_tool_budget_integration.py — 15 integration tests for agent wiring and end-to-end behavior
  • tests/tools/test_read_file_budget.py — 3 tests verifying read_file exemption removed
  • website/docs/developer-guide/tool-budgets.md — Developer documentation

Modified files

  • run_agent.py — Init ToolBudget after compressor, _apply_tool_budget() wrapper with compaction-before-spill, intercept in both concurrent and sequential dispatch paths, pass dynamic turn budget to enforce_turn_budget()
  • tools/budget_config.py — Remove read_file: float("inf") from PINNED_THRESHOLDS (budget layer makes it unnecessary)
  • tools/file_tools.py — Remove max_result_size_chars=float('inf') from read_file registration
  • tools/tool_result_storage.py — Updated existing test from infinity assertion to default threshold assertion
  • cli-config.yaml.example — Added tool_budgets config block

How to Test

Unit tests

pytest tests/agent/test_tool_budget.py tests/test_tool_budget_integration.py \
       tests/tools/test_read_file_budget.py tests/tools/test_tool_result_storage.py -v

43 new tests + 41 existing storage tests all pass. Zero regressions across 1023 locally-runnable tests.

Manual (32K model)

  1. Configure a model with ≤32K context (e.g., local Gemma 4 via llama.cpp)
  2. Run commands that produce large output (ps aux, find /, cat a big file)
  3. Verify results are paginated with read_file instructions, no HTTP 400 errors
  4. Verify the model can page through with read_file offset=N

Manual (128K+ model)

  1. Configure a large-context model
  2. Same commands — verify results pass through unchanged, budget never triggers

Manual (eviction + compaction interaction)

  1. Use a 32K model, have a long conversation
  2. Run a command producing large output when context is ~90% full
  3. Check logs — compaction should fire before spill, freeing room for a useful chunk

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (feat(agent):, fix(tools):, test(tools):, docs:)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (43 new tests across 3 test files)
  • I've tested on my platform: Ubuntu (kernel 6.17.0-19-generic), Python 3.13.7, RTX 5090 w/ Gemma 4 31B Q4_K_M via llama.cpp (32K context)

Documentation & Housekeeping

  • I've updated relevant documentation — website/docs/developer-guide/tool-budgets.md
  • I've updated cli-config.yaml.example — added tool_budgets block
  • I've considered cross-platform impact — pure Python, no OS-specific code, no new dependencies
  • N/A — no changes to tool descriptions/schemas for existing tools

Design Notes

Why centralized (not per-tool)

Hermes has a plugin-style tool ecosystem — anyone can add tools. A centralized budget layer protects ALL tools automatically, including third-party ones that don't know about context limits. Tool authors never need to think about budgets.

Why read_file for pagination

The model already knows read_file with offset/limit. No new tools to register or maintain. Improvements to read_file benefit both file reading and result pagination. One tool, one improvement path.

Prompt caching compatibility

The budget layer runs at insertion time — before a result enters the message array. Once a message is in the array, it never changes. This preserves Anthropic/OpenAI prefix cache hits across turns (addressing the concern raised in #415).

Self-regulating behavior

The feature is designed to be invisible on large-context models:

  • baseline = context_length × 0.25 — on a 200K model that's 50K tokens, far above typical tool output
  • Compaction only fires when context is genuinely tight
  • Spill only happens when the result actually exceeds available space
  • The floor ensures the model always gets a useful chunk, never a useless sliver

Made with Cursor

@jbarket jbarket force-pushed the feat/tool-budget branch from 9ccc11c to 756291b Compare April 9, 2026 00:27
jbarket added 5 commits April 9, 2026 10:23
Budget = max(floor, min(baseline, available_context))
- baseline: 25% of context window (absolute ceiling per result)
- available: remaining context * 4 chars/token (dynamic)
- floor: 2000 tokens minimum (never return useless slivers)

Oversized results spill to disk with pagination metadata.

Made-with: Cursor
- Init ToolBudget after compressor (uses real context_length)
- _apply_tool_budget() intercepts results in both dispatch paths
- Compaction-before-spill when context is tight
- Dynamic turn budget passed to enforce_turn_budget()

Made-with: Cursor
The budget layer now provides context-aware protection, making the
infinity exemption unnecessary. read_file falls back to the default
100K char inner limit with the budget layer as the outer guard.

Made-with: Cursor
- cli-config.yaml.example: tool_budgets block with result_pct,
  turn_pct, floor_tokens, compact_before_spill
- Developer guide explaining budget calculation, scaling, and
  interaction with existing systems

Made-with: Cursor
The test previously asserted read_file had float('inf') threshold.
Updated to verify it now uses DEFAULT_RESULT_SIZE_CHARS since the
budget layer provides context-aware protection.

Made-with: Cursor
@jbarket jbarket force-pushed the feat/tool-budget branch from 756291b to 048253d Compare April 9, 2026 15:23
@alt-glitch alt-glitch added P3 Low — cosmetic, nice to have type/feature New feature or request comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants