Feature: Lazy Tool Schema Loading — Two-Pass Tool Injection to Reduce Token Overhead

## Problem

Every API call injects full tool schemas for ALL enabled toolsets. With 50+ tools across terminal, file, web, browser, delegate, vision, memory, and more, this consumes ~3,500-5,000 tokens per call — regardless of whether the conversation needs those tools.

On local models, tool-formatted prompts are 10x slower to process than plain text (benchmarked: 1,230 tok/s vs 134 tok/s with 8 tools — see #5544). Even on cloud providers, this is wasted tokens at scale.

For simple conversational turns ("hi", "what model are you using?"), the model doesn't need to know about browser_click or web_crawl or delegate_task. But it gets all of them anyway.

## Proposed Solution

Two-pass lazy tool loading:

**Pass 1 (every call):** Send tool names + one-line descriptions only (~300-500 tokens vs ~4,000)
**Pass 2 (on demand):** When the model picks a tool, send the full schema in a follow-up call

### Flow:
1. User sends message
2. Hermes sends system prompt + conversation history + ABBREVIATED tool list (name + 1-line description)
3. Model either:
   a. Responds normally (no tools needed) → done in 1 API call, saved ~3,500 tokens
   b. Requests a tool by name → Hermes sends a second call with that tool's full schema injected
   c. The model executes the tool, result comes back, continues normally

### Config:
```yaml
tools:
  loading: lazy    # "eager" (current, default) or "lazy"
```

### Token savings estimate:
| Scenario | Current | Lazy | Savings |
|----------|---------|------|---------|
| Simple chat (no tools) | ~5,000 tokens base | ~1,500 tokens base | ~70% |
| One tool call | ~5,000 + response | ~1,500 + 2,000 + response | ~30% |
| Multi-tool session | ~5,000 per call | ~1,500 first call, then ~2,000 after | 30-60% |

### Implementation sketch:
- Add a new tool (e.g. `request_tool`) that accepts a tool name and returns confirmation + full schema injection
- In `run_agent.py`, when lazy loading is enabled:
  - Build abbreviated tool list: just `{"name": ..., "description": "Call this to ..."}` for each tool
  - On first pass, inject abbreviated list + `request_tool` as the only real tool
  - When model calls `request_tool`, inject the requested tool's full schema and re-submit
- Backward compatible: default is `eager` (current behavior)

### Trade-offs:
- **Pro:** Massive token savings on conversational turns (the majority of messages)
- **Pro:** Faster on local models (less prompt processing)
- **Pro:** Lower API costs on cloud providers
- **Con:** +1 API round trip when tools ARE needed (adds ~1-2s latency)
- **Con:** Slightly more complex agent loop

## Related Issues
- #5544 — Memory tools auto-injected, 10x latency on local models
- #2045 — Lazy skill loading (similar concept for skills, not tools)
- #499 — Context compaction quality overhaul

## Environment
- Hermes Agent: v0.8.0 (HEAD)
- Model: GLM-5 Turbo via Z.ai
- OS: macOS (Apple Silicon M1 Max)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Lazy Tool Schema Loading — Two-Pass Tool Injection to Reduce Token Overhead #6839

Problem

Proposed Solution

Flow:

Config:

Token savings estimate:

Implementation sketch:

Trade-offs:

Related Issues

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Current	Lazy	Savings
Simple chat (no tools)	~5,000 tokens base	~1,500 tokens base	~70%
One tool call	~5,000 + response	~1,500 + 2,000 + response	~30%
Multi-tool session	~5,000 per call	~1,500 first call, then ~2,000 after	30-60%

Feature: Lazy Tool Schema Loading — Two-Pass Tool Injection to Reduce Token Overhead #6839

Description

Problem

Proposed Solution

Flow:

Config:

Token savings estimate:

Implementation sketch:

Trade-offs:

Related Issues

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions