Skip to content

fix(compression): include system prompt + tool schemas in token estimates#18265

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-c71b9b2e
May 1, 2026
Merged

fix(compression): include system prompt + tool schemas in token estimates#18265
teknium1 merged 1 commit into
mainfrom
hermes/hermes-c71b9b2e

Conversation

@teknium1

@teknium1 teknium1 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Auto-compression banners and the post-compression last_prompt_tokens writeback now report real request pressure instead of a transcript-only char/4 estimate — which was missing the system prompt and tool schemas and could underestimate by 200x+ on sessions with many tools.

Root cause

estimate_messages_tokens_rough(messages) only counts sum(len(str(msg)) for msg in messages) / 4. With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like 45 tokens to that estimator is really ~10,550 tokens of real request pressure — a 234x gap.

User-facing symptoms this closes

#6217 (reported by @Jackten) — /compress banner shows compression triggering at a tiny number like ~4,462 tokens even though the real pressure is much higher, and can even report the post-compression count as larger than the pre-compression count because a dense handoff summary replaces many short turns. Also reported by @codecovenant on X (2026-04-30) as the trigger for this PR: 'tells you its happening at a number much lower than threshold.'

#14695 (reported by @devilardis) — last_prompt_tokens writeback after _compress_context() omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate. Compression triggers late and can exceed the model's context limit on small-context models.

Fix

Swap estimate_messages_tokens_rough()estimate_request_tokens_rough(messages, system_prompt=..., tools=...) everywhere a user-visible number is shown or the compressor's internal tracking is updated. The correct estimator already existed for exactly this purpose.

Changes

  • run_agent.py — post-compression last_prompt_tokens writeback (fixes BUG: Post-compression token estimate excludes tools schema, delaying next compression cycle #14695); post-tool-call should_compress() fallback when provider usage is missing
  • cli.py/compress banner + before/after summary
  • gateway/run.py — gateway /compress banner + summary
  • tui_gateway/server.py — TUI /compress status line + summary
  • acp_adapter/server.py — ACP /compact before/after
  • agent/manual_compression_feedback.py — relabel 'Rough transcript estimate' → 'Approx request size' (the metric changed)

Intentionally NOT changed

  • Session-hygiene fallback and the 'no agent' /status fallback in gateway/run.py — no agent is in scope to query for system prompt / tools, and the existing 30–50% overestimate wobble in hygiene is safety-accepted (see comment at gateway/run.py:5582).
  • Verbose-mode Request size logging — api_messages already contains the system prompt in index 0, so it's not user-visible-misleading.

Validation

E2E with realistic fixture (15KB system prompt, 30 tool schemas, 4 short messages):

Before fix After fix
/compress banner shows ~45 tokens ~10,552 tokens
Post-compression last_prompt_tokens 75,000 105,000
should_compress() at 100K threshold False (delayed) True (on time)

Targeted tests — all passing on this branch:

  • tests/cli/test_manual_compress.py — 4/4
  • tests/gateway/test_compress_command.py — 4/4
  • tests/test_cli_manual_compress.py — 1/1
  • tests/acp/test_server.py::test_compact_compresses_context — pass
  • tests/tui_gateway/ — 189/189
  • tests/agent/test_context_compressor.py + friends — 115/115

The 2 pre-existing failures in tests/acp/test_server.py::test_send_available_commands_update and tests/run_agent/test_concurrent_interrupt.py also fail on clean origin/main — unrelated.

Credits

Closes #14695
Closes #6217

…ates

The user-visible /compress banner and the post-compression last_prompt_tokens
writeback both counted only the raw message transcript (chars/4). With a 15KB
system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks
like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of
request pressure — a 234x gap.

Two user-facing consequences:
- Banner shows 'Compressing … (~45 tokens)…' while compression is actually
  firing on 10K+ tokens of real pressure, confusing users about why
  compression triggered (reported by @codecovenant on X; #6217).
- Post-compression last_prompt_tokens writeback omits tool schemas, so the
  next should_compress() check compares real usage against a stale
  underestimate — compression triggers late, potentially past the model's
  context limit on small-context models (#14695).

Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough()
at every user-visible banner and at the post-compression writeback.
estimate_request_tokens_rough() already existed for exactly this purpose
and includes system prompt + tool schemas.

Touched call sites:
- run_agent.py: post-compression last_prompt_tokens writeback, post-tool
  call should_compress() fallback when provider usage is missing
- cli.py: /compress banner + summary
- gateway/run.py: gateway /compress banner + summary
- tui_gateway/server.py: TUI /compress status + summary
- acp_adapter/server.py: ACP /compact before/after

Left intentionally alone:
- Session-hygiene fallback and the 'no agent' /status path in gateway/run.py
  — no agent instance is in scope to query for system prompt/tools, and the
  existing 30-50% overestimate wobble on hygiene is safety-accepted.
- Verbose-mode 'Request size' logging — informational only, already counts
  system prompt via api_messages[0].

Also relabels the feedback line from 'Rough transcript estimate' to
'Approx request size' so the metric label matches what it actually measures.

Credits: diagnoses from @devilardis (#14695) and @Jackten (#6217);
user report @codecovenant on X (2026-04-30).

Closes #14695
Closes #6217
@teknium1 teknium1 merged commit f0dc919 into main May 1, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-c71b9b2e branch May 1, 2026 06:03
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/tui Terminal UI (ui-tui/ + tui_gateway/) labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Closes #14695 and #6217. Supersedes #15433 (same fix, narrower scope).

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Closes #14695 and #6217. Supersedes #15433 (same fix, narrower scope).

beamind added a commit to beamind/hermes-agent that referenced this pull request May 2, 2026
…ation

Cherry-picked from NousResearch/hermes-agent:

1. f0dc919 - fix(compression): include system prompt + tool schemas in
   token estimates (NousResearch#18265). Replaces estimate_messages_tokens_rough()
   with estimate_request_tokens_rough() so that tool schema tokens
   (20-30K with 50+ tools) are counted, preventing compression from
   being skipped past its threshold.

2. c5b4c48 - fix: lazy session creation — defer DB row until first
   message (NousResearch#18370). Prevents empty/ghost session rows from
   accumulating. Adds prune_empty_ghost_sessions() for one-time cleanup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
nickdlkk pushed a commit to nickdlkk/hermes-agent that referenced this pull request May 11, 2026
…ates (NousResearch#18265)

The user-visible /compress banner and the post-compression last_prompt_tokens
writeback both counted only the raw message transcript (chars/4). With a 15KB
system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks
like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of
request pressure — a 234x gap.

Two user-facing consequences:
- Banner shows 'Compressing … (~45 tokens)…' while compression is actually
  firing on 10K+ tokens of real pressure, confusing users about why
  compression triggered (reported by @codecovenant on X; NousResearch#6217).
- Post-compression last_prompt_tokens writeback omits tool schemas, so the
  next should_compress() check compares real usage against a stale
  underestimate — compression triggers late, potentially past the model's
  context limit on small-context models (NousResearch#14695).

Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough()
at every user-visible banner and at the post-compression writeback.
estimate_request_tokens_rough() already existed for exactly this purpose
and includes system prompt + tool schemas.

Touched call sites:
- run_agent.py: post-compression last_prompt_tokens writeback, post-tool
  call should_compress() fallback when provider usage is missing
- cli.py: /compress banner + summary
- gateway/run.py: gateway /compress banner + summary
- tui_gateway/server.py: TUI /compress status + summary
- acp_adapter/server.py: ACP /compact before/after

Left intentionally alone:
- Session-hygiene fallback and the 'no agent' /status path in gateway/run.py
  — no agent instance is in scope to query for system prompt/tools, and the
  existing 30-50% overestimate wobble on hygiene is safety-accepted.
- Verbose-mode 'Request size' logging — informational only, already counts
  system prompt via api_messages[0].

Also relabels the feedback line from 'Rough transcript estimate' to
'Approx request size' so the metric label matches what it actually measures.

Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217);
user report @codecovenant on X (2026-04-30).

Closes NousResearch#14695
Closes NousResearch#6217
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
…ates (NousResearch#18265)

The user-visible /compress banner and the post-compression last_prompt_tokens
writeback both counted only the raw message transcript (chars/4). With a 15KB
system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks
like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of
request pressure — a 234x gap.

Two user-facing consequences:
- Banner shows 'Compressing … (~45 tokens)…' while compression is actually
  firing on 10K+ tokens of real pressure, confusing users about why
  compression triggered (reported by @codecovenant on X; NousResearch#6217).
- Post-compression last_prompt_tokens writeback omits tool schemas, so the
  next should_compress() check compares real usage against a stale
  underestimate — compression triggers late, potentially past the model's
  context limit on small-context models (NousResearch#14695).

Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough()
at every user-visible banner and at the post-compression writeback.
estimate_request_tokens_rough() already existed for exactly this purpose
and includes system prompt + tool schemas.

Touched call sites:
- run_agent.py: post-compression last_prompt_tokens writeback, post-tool
  call should_compress() fallback when provider usage is missing
- cli.py: /compress banner + summary
- gateway/run.py: gateway /compress banner + summary
- tui_gateway/server.py: TUI /compress status + summary
- acp_adapter/server.py: ACP /compact before/after

Left intentionally alone:
- Session-hygiene fallback and the 'no agent' /status path in gateway/run.py
  — no agent instance is in scope to query for system prompt/tools, and the
  existing 30-50% overestimate wobble on hygiene is safety-accepted.
- Verbose-mode 'Request size' logging — informational only, already counts
  system prompt via api_messages[0].

Also relabels the feedback line from 'Rough transcript estimate' to
'Approx request size' so the metric label matches what it actually measures.

Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217);
user report @codecovenant on X (2026-04-30).

Closes NousResearch#14695
Closes NousResearch#6217
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…ates (NousResearch#18265)

The user-visible /compress banner and the post-compression last_prompt_tokens
writeback both counted only the raw message transcript (chars/4). With a 15KB
system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks
like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of
request pressure — a 234x gap.

Two user-facing consequences:
- Banner shows 'Compressing … (~45 tokens)…' while compression is actually
  firing on 10K+ tokens of real pressure, confusing users about why
  compression triggered (reported by @codecovenant on X; NousResearch#6217).
- Post-compression last_prompt_tokens writeback omits tool schemas, so the
  next should_compress() check compares real usage against a stale
  underestimate — compression triggers late, potentially past the model's
  context limit on small-context models (NousResearch#14695).

Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough()
at every user-visible banner and at the post-compression writeback.
estimate_request_tokens_rough() already existed for exactly this purpose
and includes system prompt + tool schemas.

Touched call sites:
- run_agent.py: post-compression last_prompt_tokens writeback, post-tool
  call should_compress() fallback when provider usage is missing
- cli.py: /compress banner + summary
- gateway/run.py: gateway /compress banner + summary
- tui_gateway/server.py: TUI /compress status + summary
- acp_adapter/server.py: ACP /compact before/after

Left intentionally alone:
- Session-hygiene fallback and the 'no agent' /status path in gateway/run.py
  — no agent instance is in scope to query for system prompt/tools, and the
  existing 30-50% overestimate wobble on hygiene is safety-accepted.
- Verbose-mode 'Request size' logging — informational only, already counts
  system prompt via api_messages[0].

Also relabels the feedback line from 'Rough transcript estimate' to
'Approx request size' so the metric label matches what it actually measures.

Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217);
user report @codecovenant on X (2026-04-30).

Closes NousResearch#14695
Closes NousResearch#6217
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…ates (NousResearch#18265)

The user-visible /compress banner and the post-compression last_prompt_tokens
writeback both counted only the raw message transcript (chars/4). With a 15KB
system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks
like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of
request pressure — a 234x gap.

Two user-facing consequences:
- Banner shows 'Compressing … (~45 tokens)…' while compression is actually
  firing on 10K+ tokens of real pressure, confusing users about why
  compression triggered (reported by @codecovenant on X; NousResearch#6217).
- Post-compression last_prompt_tokens writeback omits tool schemas, so the
  next should_compress() check compares real usage against a stale
  underestimate — compression triggers late, potentially past the model's
  context limit on small-context models (NousResearch#14695).

Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough()
at every user-visible banner and at the post-compression writeback.
estimate_request_tokens_rough() already existed for exactly this purpose
and includes system prompt + tool schemas.

Touched call sites:
- run_agent.py: post-compression last_prompt_tokens writeback, post-tool
  call should_compress() fallback when provider usage is missing
- cli.py: /compress banner + summary
- gateway/run.py: gateway /compress banner + summary
- tui_gateway/server.py: TUI /compress status + summary
- acp_adapter/server.py: ACP /compact before/after

Left intentionally alone:
- Session-hygiene fallback and the 'no agent' /status path in gateway/run.py
  — no agent instance is in scope to query for system prompt/tools, and the
  existing 30-50% overestimate wobble on hygiene is safety-accepted.
- Verbose-mode 'Request size' logging — informational only, already counts
  system prompt via api_messages[0].

Also relabels the feedback line from 'Rough transcript estimate' to
'Approx request size' so the metric label matches what it actually measures.

Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217);
user report @codecovenant on X (2026-04-30).

Closes NousResearch#14695
Closes NousResearch#6217
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…ates (NousResearch#18265)

The user-visible /compress banner and the post-compression last_prompt_tokens
writeback both counted only the raw message transcript (chars/4). With a 15KB
system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks
like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of
request pressure — a 234x gap.

Two user-facing consequences:
- Banner shows 'Compressing … (~45 tokens)…' while compression is actually
  firing on 10K+ tokens of real pressure, confusing users about why
  compression triggered (reported by @codecovenant on X; NousResearch#6217).
- Post-compression last_prompt_tokens writeback omits tool schemas, so the
  next should_compress() check compares real usage against a stale
  underestimate — compression triggers late, potentially past the model's
  context limit on small-context models (NousResearch#14695).

Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough()
at every user-visible banner and at the post-compression writeback.
estimate_request_tokens_rough() already existed for exactly this purpose
and includes system prompt + tool schemas.

Touched call sites:
- run_agent.py: post-compression last_prompt_tokens writeback, post-tool
  call should_compress() fallback when provider usage is missing
- cli.py: /compress banner + summary
- gateway/run.py: gateway /compress banner + summary
- tui_gateway/server.py: TUI /compress status + summary
- acp_adapter/server.py: ACP /compact before/after

Left intentionally alone:
- Session-hygiene fallback and the 'no agent' /status path in gateway/run.py
  — no agent instance is in scope to query for system prompt/tools, and the
  existing 30-50% overestimate wobble on hygiene is safety-accepted.
- Verbose-mode 'Request size' logging — informational only, already counts
  system prompt via api_messages[0].

Also relabels the feedback line from 'Rough transcript estimate' to
'Approx request size' so the metric label matches what it actually measures.

Credits: diagnoses from @devilardis (NousResearch#14695) and @Jackten (NousResearch#6217);
user report @codecovenant on X (2026-04-30).

Closes NousResearch#14695
Closes NousResearch#6217
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/tui Terminal UI (ui-tui/ + tui_gateway/) P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

2 participants