Skip to content

Feature: Lazy Tool Schema Loading — Two-Pass Tool Injection to Reduce Token Overhead #6839

@jarviszomine

Description

@jarviszomine

Problem

Every API call injects full tool schemas for ALL enabled toolsets. With 50+ tools across terminal, file, web, browser, delegate, vision, memory, and more, this consumes ~3,500-5,000 tokens per call — regardless of whether the conversation needs those tools.

On local models, tool-formatted prompts are 10x slower to process than plain text (benchmarked: 1,230 tok/s vs 134 tok/s with 8 tools — see #5544). Even on cloud providers, this is wasted tokens at scale.

For simple conversational turns ("hi", "what model are you using?"), the model doesn't need to know about browser_click or web_crawl or delegate_task. But it gets all of them anyway.

Proposed Solution

Two-pass lazy tool loading:

Pass 1 (every call): Send tool names + one-line descriptions only (~300-500 tokens vs ~4,000)
Pass 2 (on demand): When the model picks a tool, send the full schema in a follow-up call

Flow:

  1. User sends message
  2. Hermes sends system prompt + conversation history + ABBREVIATED tool list (name + 1-line description)
  3. Model either:
    a. Responds normally (no tools needed) → done in 1 API call, saved ~3,500 tokens
    b. Requests a tool by name → Hermes sends a second call with that tool's full schema injected
    c. The model executes the tool, result comes back, continues normally

Config:

tools:
  loading: lazy    # "eager" (current, default) or "lazy"

Token savings estimate:

Scenario Current Lazy Savings
Simple chat (no tools) ~5,000 tokens base ~1,500 tokens base ~70%
One tool call ~5,000 + response ~1,500 + 2,000 + response ~30%
Multi-tool session ~5,000 per call ~1,500 first call, then ~2,000 after 30-60%

Implementation sketch:

  • Add a new tool (e.g. request_tool) that accepts a tool name and returns confirmation + full schema injection
  • In run_agent.py, when lazy loading is enabled:
    • Build abbreviated tool list: just {"name": ..., "description": "Call this to ..."} for each tool
    • On first pass, inject abbreviated list + request_tool as the only real tool
    • When model calls request_tool, inject the requested tool's full schema and re-submit
  • Backward compatible: default is eager (current behavior)

Trade-offs:

  • Pro: Massive token savings on conversational turns (the majority of messages)
  • Pro: Faster on local models (less prompt processing)
  • Pro: Lower API costs on cloud providers
  • Con: +1 API round trip when tools ARE needed (adds ~1-2s latency)
  • Con: Slightly more complex agent loop

Related Issues

Environment

  • Hermes Agent: v0.8.0 (HEAD)
  • Model: GLM-5 Turbo via Z.ai
  • OS: macOS (Apple Silicon M1 Max)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildercomp/toolsTool registry, model_tools, toolsetstype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions