fix(compression): include system prompt + tool schemas in token estimates by teknium1 · Pull Request #18265 · NousResearch/hermes-agent

teknium1 · 2026-05-01T05:51:09Z

Auto-compression banners and the post-compression last_prompt_tokens writeback now report real request pressure instead of a transcript-only char/4 estimate — which was missing the system prompt and tool schemas and could underestimate by 200x+ on sessions with many tools.

Root cause

estimate_messages_tokens_rough(messages) only counts sum(len(str(msg)) for msg in messages) / 4. With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like 45 tokens to that estimator is really ~10,550 tokens of real request pressure — a 234x gap.

User-facing symptoms this closes

#6217 (reported by @Jackten) — /compress banner shows compression triggering at a tiny number like ~4,462 tokens even though the real pressure is much higher, and can even report the post-compression count as larger than the pre-compression count because a dense handoff summary replaces many short turns. Also reported by @codecovenant on X (2026-04-30) as the trigger for this PR: 'tells you its happening at a number much lower than threshold.'

#14695 (reported by @devilardis) — last_prompt_tokens writeback after _compress_context() omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate. Compression triggers late and can exceed the model's context limit on small-context models.

Fix

Swap estimate_messages_tokens_rough() → estimate_request_tokens_rough(messages, system_prompt=..., tools=...) everywhere a user-visible number is shown or the compressor's internal tracking is updated. The correct estimator already existed for exactly this purpose.

Changes

run_agent.py — post-compression last_prompt_tokens writeback (fixes BUG: Post-compression token estimate excludes tools schema, delaying next compression cycle #14695); post-tool-call should_compress() fallback when provider usage is missing
cli.py — /compress banner + before/after summary
gateway/run.py — gateway /compress banner + summary
tui_gateway/server.py — TUI /compress status line + summary
acp_adapter/server.py — ACP /compact before/after
agent/manual_compression_feedback.py — relabel 'Rough transcript estimate' → 'Approx request size' (the metric changed)

Intentionally NOT changed

Session-hygiene fallback and the 'no agent' /status fallback in gateway/run.py — no agent is in scope to query for system prompt / tools, and the existing 30–50% overestimate wobble in hygiene is safety-accepted (see comment at gateway/run.py:5582).
Verbose-mode Request size logging — api_messages already contains the system prompt in index 0, so it's not user-visible-misleading.

Validation

E2E with realistic fixture (15KB system prompt, 30 tool schemas, 4 short messages):

	Before fix	After fix
`/compress` banner shows	`~45 tokens`	`~10,552 tokens`
Post-compression `last_prompt_tokens`	75,000	105,000
`should_compress()` at 100K threshold	`False` (delayed)	`True` (on time)

Targeted tests — all passing on this branch:

tests/cli/test_manual_compress.py — 4/4
tests/gateway/test_compress_command.py — 4/4
tests/test_cli_manual_compress.py — 1/1
tests/acp/test_server.py::test_compact_compresses_context — pass
tests/tui_gateway/ — 189/189
tests/agent/test_context_compressor.py + friends — 115/115

The 2 pre-existing failures in tests/acp/test_server.py::test_send_available_commands_update and tests/run_agent/test_concurrent_interrupt.py also fail on clean origin/main — unrelated.

Credits

Diagnosis in BUG: Post-compression token estimate excludes tools schema, delaying next compression cycle #14695 by @devilardis
Diagnosis in /compress can report higher token counts after successful compaction because the banner uses a rough transcript-only estimate #6217 by @Jackten
Report via X by @codecovenant

Closes #14695
Closes #6217

@codecovenant

…ates The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; #6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (#14695) and @Jackten (#6217); user report @codecovenant on X (2026-04-30). Closes #14695 Closes #6217

alt-glitch · 2026-05-01T06:05:15Z

Closes #14695 and #6217. Supersedes #15433 (same fix, narrower scope).

alt-glitch · 2026-05-01T06:06:21Z

Closes #14695 and #6217. Supersedes #15433 (same fix, narrower scope).

…ation Cherry-picked from NousResearch/hermes-agent: 1. f0dc919 - fix(compression): include system prompt + tool schemas in token estimates (NousResearch#18265). Replaces estimate_messages_tokens_rough() with estimate_request_tokens_rough() so that tool schema tokens (20-30K with 50+ tools) are counted, preventing compression from being skipped past its threshold. 2. c5b4c48 - fix: lazy session creation — defer DB row until first message (NousResearch#18370). Prevents empty/ghost session rows from accumulating. Adds prune_empty_ghost_sessions() for one-time cleanup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@codecovenant

…ates (NousResearch#18265) The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; NousResearch#6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (NousResearch#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217); user report @codecovenant on X (2026-04-30). Closes NousResearch#14695 Closes NousResearch#6217

@codecovenant

…ates (NousResearch#18265) The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; NousResearch#6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (NousResearch#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217); user report @codecovenant on X (2026-04-30). Closes NousResearch#14695 Closes NousResearch#6217

@codecovenant

…ates (NousResearch#18265) The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; NousResearch#6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (NousResearch#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217); user report @codecovenant on X (2026-04-30). Closes NousResearch#14695 Closes NousResearch#6217

@codecovenant

…ates (NousResearch#18265) The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; NousResearch#6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (NousResearch#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217); user report @codecovenant on X (2026-04-30). Closes NousResearch#14695 Closes NousResearch#6217

@codecovenant

…ates (NousResearch#18265) The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; NousResearch#6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (NousResearch#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217); user report @codecovenant on X (2026-04-30). Closes NousResearch#14695 Closes NousResearch#6217

teknium1 merged commit f0dc919 into main May 1, 2026
10 of 11 checks passed

teknium1 deleted the hermes/hermes-c71b9b2e branch May 1, 2026 06:03

github-actions Bot mentioned this pull request May 8, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.30 to v2026.5.7 Docker-Hub-sirmark/docker-hermes-agent#5

Merged

heathley mentioned this pull request May 11, 2026

fix(compression): separate provider-exact vs projected token state #23934

Open

Tranquil-Flow mentioned this pull request May 19, 2026

fix(agent): include tools schema in post-compression token estimate (#14695) #15433

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compression): include system prompt + tool schemas in token estimates#18265

fix(compression): include system prompt + tool schemas in token estimates#18265
teknium1 merged 1 commit into
mainfrom
hermes/hermes-c71b9b2e

teknium1 commented May 1, 2026

Uh oh!

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented May 1, 2026

Root cause

User-facing symptoms this closes

Fix

Changes

Intentionally NOT changed

Validation

Credits

Uh oh!

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants