Summary
I built a monitoring dashboard to profile token consumption on a Hermes v0.6.0 deployment running Telegram + WhatsApp + Cron gateways. After analyzing 6 request dumps from ~/.hermes/sessions/, I found that 73% of every API call is fixed overhead that doesn't change between requests — regardless of which model or provider is used.
Data
Per-Request Token Breakdown (from request dumps)
| Component |
Tokens |
% of Avg Request |
| Tool definitions (31 tools) |
8,759 |
46.1% |
| System prompt (SOUL.md + skills catalog) |
5,176 |
27.2% |
| Messages (conversation context) |
3,000–8,775 |
26.7% avg |
| Total per request |
~17,000–23,000 |
|
The fixed overhead (system prompt + tool definitions) is ~13,935 tokens per API call, paid before any conversation content is processed.
Tool Definition Costs
All 31 tools from _HERMES_CORE_TOOLS are loaded for every platform:
| Tool |
Tokens |
cronjob |
729 |
delegate_task |
699 |
skill_manage |
699 |
terminal |
693 |
execute_code |
629 |
session_search |
563 |
memory |
536 |
search_files |
438 |
patch |
375 |
todo |
339 |
11x browser_* tools |
1,258 combined |
Real-World Impact
In a single evening with 3 active gateway sessions:
| Session |
Platform |
Messages |
Est. API Calls |
Est. Input Tokens |
| Chat session |
Telegram |
168 |
~84 |
~1.6M |
| Group chat |
WhatsApp |
122 |
~61 |
~1.2M |
| Group chat |
WhatsApp |
64 |
~32 |
~574K |
Total estimated input: ~3.9M tokens across 10 sessions (~207 API calls)
For agentic coding tasks with hundreds of tool calls, the overhead compounds further:
| Scenario |
API Calls |
Fixed Overhead Alone |
Est. Total Input |
| Feature implementation |
100 |
1.4M tokens |
~4M tokens |
| Large refactor |
500 |
7M tokens |
~25M tokens |
| Full project build |
1,000 |
14M tokens |
~60M tokens |
The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter.
Root Causes
1. All platforms share _HERMES_CORE_TOOLS (toolsets.py)
hermes-telegram, hermes-whatsapp, hermes-discord, hermes-slack, and hermes-signal all resolve to the same _HERMES_CORE_TOOLS list. This means a WhatsApp message in a group chat loads all 11 browser_* tools (1,258 tokens) even though browser automation isn't usable from a messaging platform.
The infrastructure for per-platform differentiation already exists — platform_toolsets in config.yaml maps platforms to toolset names, and toolsets.py has modular toolset definitions. But the actual toolset definitions (hermes-telegram, hermes-whatsapp, etc.) all point to the same shared list.
2. Skills catalog injected into every system prompt
The skills index adds ~2,200 tokens to the system prompt on every request, regardless of whether the conversation needs any skills. Skills are already accessible on-demand via skill_view / skills_list tools.
3. Compression threshold may be too conservative
Default compression.threshold: 0.5 means context doesn't get compressed until it exceeds 50% of max tokens. Combined with protect_last_n: 20, long messaging sessions accumulate full message history for many turns before compression kicks in.
Suggestions
These are ordered by impact-to-effort ratio:
Quick win: Platform-aware tool filtering
Create lighter toolset variants for messaging platforms that exclude tools not usable in that context:
_HERMES_MESSAGING_TOOLS = [t for t in _HERMES_CORE_TOOLS if not t.startswith("browser_")]
This saves ~1,258 tokens per request on all messaging platforms with a 1-line change. The existing platform_toolsets config already maps platforms to toolset names, so users could override this.
Medium effort: Lazy skills loading
Don't inject the skills index into the system prompt. The agent already has skills_list and skill_view tools — it can discover skills on demand. This saves ~2,200 tokens per request.
Low effort: Document compression tuning
The current defaults (threshold: 0.5, protect_last_n: 20) are conservative for messaging use cases where conversations can be 100+ messages. Documenting recommended settings for high-volume messaging (e.g., threshold: 0.3, protect_last_n: 10) would help users reduce costs.
Methodology
Built a transparent API proxy + dashboard that:
- Intercepts API calls via a transparent proxy
- Logs token usage, cost, latency, and model to SQLite
- Reads
~/.hermes/state.db for session metadata (platform, message counts, tool calls)
- Analyzes
~/.hermes/sessions/request_dump_*.json files for per-component token breakdown
The dashboard is open source and works with any Hermes installation, any model, any provider.
Environment
- Hermes Agent v0.6.0
- Ubuntu 24.04, DigitalOcean VPS
- Model:
anthropic/claude-sonnet-4-5 via OpenRouter (findings apply to any model — overhead is in the prompt, not model-specific)
- Gateways: Telegram, WhatsApp, Cron
- Honcho: not active (zero overhead confirmed)
Summary
I built a monitoring dashboard to profile token consumption on a Hermes v0.6.0 deployment running Telegram + WhatsApp + Cron gateways. After analyzing 6 request dumps from
~/.hermes/sessions/, I found that 73% of every API call is fixed overhead that doesn't change between requests — regardless of which model or provider is used.Data
Per-Request Token Breakdown (from request dumps)
The fixed overhead (system prompt + tool definitions) is ~13,935 tokens per API call, paid before any conversation content is processed.
Tool Definition Costs
All 31 tools from
_HERMES_CORE_TOOLSare loaded for every platform:cronjobdelegate_taskskill_manageterminalexecute_codesession_searchmemorysearch_filespatchtodobrowser_*toolsReal-World Impact
In a single evening with 3 active gateway sessions:
Total estimated input: ~3.9M tokens across 10 sessions (~207 API calls)
For agentic coding tasks with hundreds of tool calls, the overhead compounds further:
The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter.
Root Causes
1. All platforms share
_HERMES_CORE_TOOLS(toolsets.py)hermes-telegram,hermes-whatsapp,hermes-discord,hermes-slack, andhermes-signalall resolve to the same_HERMES_CORE_TOOLSlist. This means a WhatsApp message in a group chat loads all 11browser_*tools (1,258 tokens) even though browser automation isn't usable from a messaging platform.The infrastructure for per-platform differentiation already exists —
platform_toolsetsin config.yaml maps platforms to toolset names, andtoolsets.pyhas modular toolset definitions. But the actual toolset definitions (hermes-telegram,hermes-whatsapp, etc.) all point to the same shared list.2. Skills catalog injected into every system prompt
The skills index adds ~2,200 tokens to the system prompt on every request, regardless of whether the conversation needs any skills. Skills are already accessible on-demand via
skill_view/skills_listtools.3. Compression threshold may be too conservative
Default
compression.threshold: 0.5means context doesn't get compressed until it exceeds 50% of max tokens. Combined withprotect_last_n: 20, long messaging sessions accumulate full message history for many turns before compression kicks in.Suggestions
These are ordered by impact-to-effort ratio:
Quick win: Platform-aware tool filtering
Create lighter toolset variants for messaging platforms that exclude tools not usable in that context:
This saves ~1,258 tokens per request on all messaging platforms with a 1-line change. The existing
platform_toolsetsconfig already maps platforms to toolset names, so users could override this.Medium effort: Lazy skills loading
Don't inject the skills index into the system prompt. The agent already has
skills_listandskill_viewtools — it can discover skills on demand. This saves ~2,200 tokens per request.Low effort: Document compression tuning
The current defaults (
threshold: 0.5,protect_last_n: 20) are conservative for messaging use cases where conversations can be 100+ messages. Documenting recommended settings for high-volume messaging (e.g.,threshold: 0.3,protect_last_n: 10) would help users reduce costs.Methodology
Built a transparent API proxy + dashboard that:
~/.hermes/state.dbfor session metadata (platform, message counts, tool calls)~/.hermes/sessions/request_dump_*.jsonfiles for per-component token breakdownThe dashboard is open source and works with any Hermes installation, any model, any provider.
Environment
anthropic/claude-sonnet-4-5via OpenRouter (findings apply to any model — overhead is in the prompt, not model-specific)