Token overhead analysis: 73% of each API call is fixed overhead (~13.9K tokens) — data + suggestions

## Summary

I built a [monitoring dashboard](https://github.com/Bichev/hermes-dashboard) to profile token consumption on a Hermes v0.6.0 deployment running Telegram + WhatsApp + Cron gateways. After analyzing 6 request dumps from `~/.hermes/sessions/`, I found that **73% of every API call is fixed overhead** that doesn't change between requests — regardless of which model or provider is used.

## Data

### Per-Request Token Breakdown (from request dumps)

| Component | Tokens | % of Avg Request |
|-----------|--------|------------------|
| Tool definitions (31 tools) | 8,759 | 46.1% |
| System prompt (SOUL.md + skills catalog) | 5,176 | 27.2% |
| Messages (conversation context) | 3,000–8,775 | 26.7% avg |
| **Total per request** | **~17,000–23,000** | |

The fixed overhead (system prompt + tool definitions) is **~13,935 tokens per API call**, paid before any conversation content is processed.

### Tool Definition Costs

All 31 tools from `_HERMES_CORE_TOOLS` are loaded for every platform:

| Tool | Tokens |
|------|--------|
| `cronjob` | 729 |
| `delegate_task` | 699 |
| `skill_manage` | 699 |
| `terminal` | 693 |
| `execute_code` | 629 |
| `session_search` | 563 |
| `memory` | 536 |
| `search_files` | 438 |
| `patch` | 375 |
| `todo` | 339 |
| 11x `browser_*` tools | 1,258 combined |

### Real-World Impact

In a single evening with 3 active gateway sessions:

| Session | Platform | Messages | Est. API Calls | Est. Input Tokens |
|---------|----------|----------|----------------|-------------------|
| Chat session | Telegram | 168 | ~84 | ~1.6M |
| Group chat | WhatsApp | 122 | ~61 | ~1.2M |
| Group chat | WhatsApp | 64 | ~32 | ~574K |

**Total estimated input: ~3.9M tokens across 10 sessions (~207 API calls)**

For agentic coding tasks with hundreds of tool calls, the overhead compounds further:

| Scenario | API Calls | Fixed Overhead Alone | Est. Total Input |
|----------|-----------|---------------------|------------------|
| Feature implementation | 100 | 1.4M tokens | ~4M tokens |
| Large refactor | 500 | 7M tokens | ~25M tokens |
| Full project build | 1,000 | 14M tokens | ~60M tokens |

The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter.

## Root Causes

### 1. All platforms share `_HERMES_CORE_TOOLS` (toolsets.py)

`hermes-telegram`, `hermes-whatsapp`, `hermes-discord`, `hermes-slack`, and `hermes-signal` all resolve to the same `_HERMES_CORE_TOOLS` list. This means a WhatsApp message in a group chat loads all 11 `browser_*` tools (1,258 tokens) even though browser automation isn't usable from a messaging platform.

The infrastructure for per-platform differentiation already exists — `platform_toolsets` in config.yaml maps platforms to toolset names, and `toolsets.py` has modular toolset definitions. But the actual toolset definitions (`hermes-telegram`, `hermes-whatsapp`, etc.) all point to the same shared list.

### 2. Skills catalog injected into every system prompt

The skills index adds ~2,200 tokens to the system prompt on every request, regardless of whether the conversation needs any skills. Skills are already accessible on-demand via `skill_view` / `skills_list` tools.

### 3. Compression threshold may be too conservative

Default `compression.threshold: 0.5` means context doesn't get compressed until it exceeds 50% of max tokens. Combined with `protect_last_n: 20`, long messaging sessions accumulate full message history for many turns before compression kicks in.

## Suggestions

These are ordered by impact-to-effort ratio:

### Quick win: Platform-aware tool filtering

Create lighter toolset variants for messaging platforms that exclude tools not usable in that context:

```python
_HERMES_MESSAGING_TOOLS = [t for t in _HERMES_CORE_TOOLS if not t.startswith("browser_")]
```

This saves **~1,258 tokens per request** on all messaging platforms with a 1-line change. The existing `platform_toolsets` config already maps platforms to toolset names, so users could override this.

### Medium effort: Lazy skills loading

Don't inject the skills index into the system prompt. The agent already has `skills_list` and `skill_view` tools — it can discover skills on demand. This saves **~2,200 tokens per request**.

### Low effort: Document compression tuning

The current defaults (`threshold: 0.5`, `protect_last_n: 20`) are conservative for messaging use cases where conversations can be 100+ messages. Documenting recommended settings for high-volume messaging (e.g., `threshold: 0.3`, `protect_last_n: 10`) would help users reduce costs.

## Methodology

Built a [transparent API proxy + dashboard](https://github.com/Bichev/hermes-dashboard) that:
1. Intercepts API calls via a transparent proxy
2. Logs token usage, cost, latency, and model to SQLite
3. Reads `~/.hermes/state.db` for session metadata (platform, message counts, tool calls)
4. Analyzes `~/.hermes/sessions/request_dump_*.json` files for per-component token breakdown

The dashboard is open source and works with any Hermes installation, any model, any provider.

## Environment

- Hermes Agent v0.6.0
- Ubuntu 24.04, DigitalOcean VPS
- Model: `anthropic/claude-sonnet-4-5` via OpenRouter (findings apply to any model — overhead is in the prompt, not model-specific)
- Gateways: Telegram, WhatsApp, Cron
- Honcho: not active (zero overhead confirmed)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token overhead analysis: 73% of each API call is fixed overhead (~13.9K tokens) — data + suggestions #4379

Summary

Data

Per-Request Token Breakdown (from request dumps)

Tool Definition Costs

Real-World Impact

Root Causes

1. All platforms share `_HERMES_CORE_TOOLS` (toolsets.py)

2. Skills catalog injected into every system prompt

3. Compression threshold may be too conservative

Suggestions

Quick win: Platform-aware tool filtering

Medium effort: Lazy skills loading

Low effort: Document compression tuning

Methodology

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Component	Tokens	% of Avg Request
Tool definitions (31 tools)	8,759	46.1%
System prompt (SOUL.md + skills catalog)	5,176	27.2%
Messages (conversation context)	3,000–8,775	26.7% avg
Total per request	~17,000–23,000

Tool	Tokens
`cronjob`	729
`delegate_task`	699
`skill_manage`	699
`terminal`	693
`execute_code`	629
`session_search`	563
`memory`	536
`search_files`	438
`patch`	375
`todo`	339
11x `browser_*` tools	1,258 combined

Session	Platform	Messages	Est. API Calls	Est. Input Tokens
Chat session	Telegram	168	~84	~1.6M
Group chat	WhatsApp	122	~61	~1.2M
Group chat	WhatsApp	64	~32	~574K

Scenario	API Calls	Fixed Overhead Alone	Est. Total Input
Feature implementation	100	1.4M tokens	~4M tokens
Large refactor	500	7M tokens	~25M tokens
Full project build	1,000	14M tokens	~60M tokens

Token overhead analysis: 73% of each API call is fixed overhead (~13.9K tokens) — data + suggestions #4379

Description

Summary

Data

Per-Request Token Breakdown (from request dumps)

Tool Definition Costs

Real-World Impact

Root Causes

1. All platforms share _HERMES_CORE_TOOLS (toolsets.py)

2. Skills catalog injected into every system prompt

3. Compression threshold may be too conservative

Suggestions

Quick win: Platform-aware tool filtering

Medium effort: Lazy skills loading

Low effort: Document compression tuning

Methodology

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. All platforms share `_HERMES_CORE_TOOLS` (toolsets.py)