Skip to content

Token overhead analysis: 73% of each API call is fixed overhead (~13.9K tokens) — data + suggestions #4379

@Bichev

Description

@Bichev

Summary

I built a monitoring dashboard to profile token consumption on a Hermes v0.6.0 deployment running Telegram + WhatsApp + Cron gateways. After analyzing 6 request dumps from ~/.hermes/sessions/, I found that 73% of every API call is fixed overhead that doesn't change between requests — regardless of which model or provider is used.

Data

Per-Request Token Breakdown (from request dumps)

Component Tokens % of Avg Request
Tool definitions (31 tools) 8,759 46.1%
System prompt (SOUL.md + skills catalog) 5,176 27.2%
Messages (conversation context) 3,000–8,775 26.7% avg
Total per request ~17,000–23,000

The fixed overhead (system prompt + tool definitions) is ~13,935 tokens per API call, paid before any conversation content is processed.

Tool Definition Costs

All 31 tools from _HERMES_CORE_TOOLS are loaded for every platform:

Tool Tokens
cronjob 729
delegate_task 699
skill_manage 699
terminal 693
execute_code 629
session_search 563
memory 536
search_files 438
patch 375
todo 339
11x browser_* tools 1,258 combined

Real-World Impact

In a single evening with 3 active gateway sessions:

Session Platform Messages Est. API Calls Est. Input Tokens
Chat session Telegram 168 ~84 ~1.6M
Group chat WhatsApp 122 ~61 ~1.2M
Group chat WhatsApp 64 ~32 ~574K

Total estimated input: ~3.9M tokens across 10 sessions (~207 API calls)

For agentic coding tasks with hundreds of tool calls, the overhead compounds further:

Scenario API Calls Fixed Overhead Alone Est. Total Input
Feature implementation 100 1.4M tokens ~4M tokens
Large refactor 500 7M tokens ~25M tokens
Full project build 1,000 14M tokens ~60M tokens

The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter.

Root Causes

1. All platforms share _HERMES_CORE_TOOLS (toolsets.py)

hermes-telegram, hermes-whatsapp, hermes-discord, hermes-slack, and hermes-signal all resolve to the same _HERMES_CORE_TOOLS list. This means a WhatsApp message in a group chat loads all 11 browser_* tools (1,258 tokens) even though browser automation isn't usable from a messaging platform.

The infrastructure for per-platform differentiation already exists — platform_toolsets in config.yaml maps platforms to toolset names, and toolsets.py has modular toolset definitions. But the actual toolset definitions (hermes-telegram, hermes-whatsapp, etc.) all point to the same shared list.

2. Skills catalog injected into every system prompt

The skills index adds ~2,200 tokens to the system prompt on every request, regardless of whether the conversation needs any skills. Skills are already accessible on-demand via skill_view / skills_list tools.

3. Compression threshold may be too conservative

Default compression.threshold: 0.5 means context doesn't get compressed until it exceeds 50% of max tokens. Combined with protect_last_n: 20, long messaging sessions accumulate full message history for many turns before compression kicks in.

Suggestions

These are ordered by impact-to-effort ratio:

Quick win: Platform-aware tool filtering

Create lighter toolset variants for messaging platforms that exclude tools not usable in that context:

_HERMES_MESSAGING_TOOLS = [t for t in _HERMES_CORE_TOOLS if not t.startswith("browser_")]

This saves ~1,258 tokens per request on all messaging platforms with a 1-line change. The existing platform_toolsets config already maps platforms to toolset names, so users could override this.

Medium effort: Lazy skills loading

Don't inject the skills index into the system prompt. The agent already has skills_list and skill_view tools — it can discover skills on demand. This saves ~2,200 tokens per request.

Low effort: Document compression tuning

The current defaults (threshold: 0.5, protect_last_n: 20) are conservative for messaging use cases where conversations can be 100+ messages. Documenting recommended settings for high-volume messaging (e.g., threshold: 0.3, protect_last_n: 10) would help users reduce costs.

Methodology

Built a transparent API proxy + dashboard that:

  1. Intercepts API calls via a transparent proxy
  2. Logs token usage, cost, latency, and model to SQLite
  3. Reads ~/.hermes/state.db for session metadata (platform, message counts, tool calls)
  4. Analyzes ~/.hermes/sessions/request_dump_*.json files for per-component token breakdown

The dashboard is open source and works with any Hermes installation, any model, any provider.

Environment

  • Hermes Agent v0.6.0
  • Ubuntu 24.04, DigitalOcean VPS
  • Model: anthropic/claude-sonnet-4-5 via OpenRouter (findings apply to any model — overhead is in the prompt, not model-specific)
  • Gateways: Telegram, WhatsApp, Cron
  • Honcho: not active (zero overhead confirmed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/agentCore agent loop, run_agent.py, prompt buildercomp/toolsTool registry, model_tools, toolsetstype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions