fix: token accounting fallback + compression trigger for reasoning models by kshitijk4poor · Pull Request #12028 · NousResearch/hermes-agent

kshitijk4poor · 2026-04-18T07:20:34Z

Summary

Two fixes for MiniMax/GLM-5.1 provider issues discovered during live CLI sessions on April 17, 2026 where these models kept failing due to empty token accounting and premature compression deaths.

Fix 1: Token estimation fallback when streaming returns no usage data

Closes #12023

Problem: Providers like MiniMax via OpenRouter silently ignore stream_options: {"include_usage": true}. Every streaming chunk has usage=None. The token accounting block in run_agent.py is gated by if response.usage: with no else branch — when falsy, the entire block is skipped: no normalize_usage(), no session token accumulation, no update_token_counts() DB persistence, no cost estimation, no API call logging. Sessions permanently record 0/0 tokens.

Production evidence: Session 20260417_141814_9519aa — 85 messages, 35 tool calls, input_tokens=0, output_tokens=0.

Fix: Added an else branch that falls back to estimate_messages_tokens_rough() / estimate_tokens_rough() for approximate token accounting, updates the context compressor, persists to session DB, and logs a WARNING so the degradation is visible in logs.

Fix 2: Subtract reasoning tokens from compression trigger

Closes #12026

Problem: The compression trigger fed raw completion_tokens (including internal reasoning tokens from thinking models) to the context compressor. For reasoning models (GLM-5.1, QwQ, DeepSeek-R1), completion_tokens includes hidden chain-of-thought reasoning that is NOT re-sent on subsequent turns and does NOT consume context window space. This caused premature compression: e.g. 85K prompt + 20K completion (15K reasoning) = 105K, exceeding the 101K threshold despite only 42% actual context usage.

Teknium's review feedback (addressed): The original PR dropped completion_tokens entirely from the formula. Teknium correctly noted that Hermes re-sends reasoning_content on subsequent turns (lines 9140-9146), so completion tokens should contribute when reasoning IS captured and replayed. However, session data analysis showed GLM-5.1 only captures reasoning in 3-15% of assistant messages (8 out of 99 in TD Promo #2), with total stored reasoning of ~1300 tokens — yet completion_tokens included 15-20K of phantom reasoning.

Improved fix: Instead of dropping all completion tokens, we now use the API-provided completion_tokens_details.reasoning_tokens (already extracted by normalize_usage() into canonical_usage.reasoning_tokens) to subtract only the reasoning portion before feeding the compressor:

_reasoning_toks = canonical_usage.reasoning_tokens
_content_completion = max(0, completion_tokens - _reasoning_toks)
# Feed _content_completion to compressor instead of raw completion_tokens

This is surgically correct:

Uses data the API already provides (OpenAI, OpenRouter, DeepSeek all return completion_tokens_details.reasoning_tokens)
Only subtracts tokens known to be internal reasoning
When reasoning_tokens is 0 (non-thinking models), the formula is identical to the old behavior
When reasoning IS captured and re-sent, it appears in prompt_tokens on the next call, self-correcting
Session-level billing counters still use the full completion_tokens (no impact on cost tracking)

Research: OpenCode (opencode-ai/opencode) has the identical bug — CompletionTokens + PromptTokens for auto-compact at 95% threshold, never subtracts reasoning_tokens, and doesn't replay reasoning on subsequent turns (tui.go:335-341, agent.go:506).

Production evidence: 6 consecutive compression-triggered session splits in a single GLM-5.1 workflow:

TD Promo Support passing morph snapshot id #2 through Fix VM instance sharing across tasks #6 all ended with end_reason=compression
Each split destroyed conversation continuity and wasted tokens replaying compressed context (cumulative 9.5M input tokens)
Session analysis: only 3-15% of assistant messages had reasoning stored, zero had <think> blocks in content

Files changed

File	Change
`run_agent.py`	+60/-1 — reasoning token subtraction for compressor + token fallback else branch
`tests/run_agent/test_token_accounting_fallback.py`	+175 (new) — 6 regression tests

Test results

All 28 compression-related tests pass
All 6 usage_pricing tests pass
E2E verified: reasoning subtraction correctly prevents compression at 42% prompt usage with 20K reasoning tokens
E2E verified: non-thinking models (reasoning_tokens=0) see zero behavior change
E2E verified: token estimation fallback produces non-zero values when usage=None

teknium1 · 2026-04-18T21:45:42Z

Thanks for digging into this, and for the detailed production evidence.

Fix 1 (token fallback) looks good — confirmed the if response.usage: block on current main has no else, and a rough-estimate fallback is the right pragmatic response when a provider silently ignores stream_options.include_usage. Worth adding a regression test in tests/run_agent/test_413_compression.py (or nearby) that feeds a usage=None response and asserts session counters get non-zero deltas and the compressor sees non-zero last_prompt_tokens.

Fix 2 (drop completion_tokens from the compression trigger) I'd like to push back on before we take it. The premise in the PR body is that reasoning tokens are "ephemeral output — they don't consume context window space for the next API call." But hermes actually does the opposite by design: at run_agent.py ~9094-9100, we explicitly re-send reasoning back on every subsequent turn:

# For ALL assistant messages, pass reasoning back to the API
# This ensures multi-turn reasoning context is preserved
if msg.get("role") == "assistant":
    reasoning_text = msg.get("reasoning")
    if reasoning_text:
        api_msg["reasoning_content"] = reasoning_text

reasoning_details is also preserved intentionally (the comment at ~9117 spells out OpenRouter multi-turn reasoning continuity), and Codex keeps codex_reasoning_items for the same reason. So in our design, completion_tokens from turn N should show up inside prompt_tokens on turn N+1, and last_prompt + last_completion is a tight lower bound for the next prompt — not an overestimate. Compression firing at 85K + 20K = 105K isn't premature under that model, it's a one-turn-early preflight of what the next call will actually report.

The "42% actual context usage" reading only holds if, on your GLM-5.1 / OpenRouter setup, one of these is true:

OpenRouter / GLM-5.1 accepts reasoning_content + reasoning_details on input but doesn't bill them as input tokens (provider-side).
Our extraction misses GLM's reasoning format so msg["reasoning"] is empty and nothing goes back.
Something is stripping think-tag content between turns before the next request.

Any of those is a real bug worth fixing — but the fix lives in _extract_reasoning / _build_assistant_message / the outgoing api_msg construction, not in the compression formula. The compression formula only becomes wrong if we've already lost the reasoning upstream.

A quick empirical check from your existing session logs: in your 6 compressed GLM-5.1 sessions, does last_prompt_tokens (or the provider-reported prompt_tokens on each API call) grow turn-over-turn by roughly last_completion_tokens? If yes, reasoning is being preserved and compression at 105K is actually correct. If it stays flat across turns while completion keeps producing 15K+ reasoning, that's the bug and it's upstream of the compression trigger.

Happy to help dig into that side of it if you want to pull a few turns of usage numbers from one of those session DBs. For now I'd suggest splitting this PR into the token-fallback fix alone (plus a test) and parking the compression change pending the investigation above.

Fix 1 — Token estimation fallback (closes #12023): When providers like MiniMax via OpenRouter silently ignore stream_options.include_usage, response.usage is None and the token accounting block is skipped entirely. Added an else branch that falls back to estimate_messages_tokens_rough() / estimate_tokens_rough() so sessions don't permanently record 0/0 tokens. Fix 2 — Subtract reasoning tokens from compression trigger (closes #12026): The compression trigger fed raw completion_tokens (including internal reasoning tokens) to the context compressor. For thinking models (GLM-5.1, QwQ, DeepSeek-R1), completion_tokens includes reasoning that is NOT re-sent on subsequent turns and doesn't consume context window space. Now subtracts canonical_usage.reasoning_tokens (from completion_tokens_details.reasoning_tokens) before feeding the compressor, so only content tokens count toward the threshold. This addresses Teknium's review feedback on #12028: rather than dropping all completion_tokens (which would be wrong when reasoning IS re-sent), we use the API-provided reasoning_tokens breakdown to subtract only the phantom tokens. Non-thinking models (reasoning_tokens=0) see zero behavior change. Production evidence: 6 consecutive GLM-5.1 sessions ended with premature compression (TD Promo #2-#6, April 17 2026). Only 3-15% of assistant messages had reasoning captured; total stored reasoning was ~150-2500 tokens per session — yet completion_tokens included 15-20K of hidden reasoning that inflated the trigger past the 101K threshold. Research: OpenCode has the identical bug (tui.go:335-341, completion + prompt without reasoning subtraction). The OpenAI/OpenRouter APIs provide completion_tokens_details.reasoning_tokens for exactly this purpose; Hermes already extracts it via normalize_usage() but never used it in compression. Tests: 6 new regression tests covering reasoning subtraction, premature compression prevention, threshold still firing when truly full, zero reasoning passthrough, and fallback estimation.

kshitijk4poor · 2026-04-19T05:56:45Z

You were right to push back — dropping completion_tokens entirely was wrong. I dug into the actual session data and found the upstream issue you predicted.

Session analysis confirms your scenario #2: _extract_reasoning() mostly misses GLM-5.1's reasoning output. Across all 6 compression-death sessions:

Session	Assistant msgs	With reasoning	Total reasoning tokens
TD Promo #2	99	8 (8%)	~1,300
TD Promo #3	67	5 (7%)	~2,000
TD Promo #4	76	12 (15%)	~2,455
TD Promo #5	91	5 (5%)	~297
TD Promo #6	104	6 (5%)	~2,161

Zero messages had reasoning_details. Zero had <think> tags in content. GLM-5.1 via OpenRouter either returns reasoning in a format we don't capture, or the reasoning is provider-internal and never surfaces.

The improved fix uses the API-provided completion_tokens_details.reasoning_tokens (which we already extract in normalize_usage() into canonical_usage.reasoning_tokens) to subtract only the phantom reasoning portion before feeding the compressor:

_reasoning_toks = canonical_usage.reasoning_tokens
_content_completion = max(0, completion_tokens - _reasoning_toks)

This way:

Non-thinking models (reasoning_tokens=0) → zero behavior change
Thinking models with captured reasoning → reasoning shows up in prompt_tokens on the next call (self-correcting)
Thinking models with uncaptured reasoning → phantom tokens no longer inflate the trigger

I also confirmed OpenCode has the identical bug (same formula, no reasoning subtraction, same premature auto-compact).

The token fallback fix (Fix 1) is unchanged — added a regression test as you suggested. Force-pushed with both fixes + 6 new tests.

teknium1 · 2026-04-20T12:12:23Z

Closed in favor of PR #13006 #13006 which fixes the same issue. Your reasoning-subtraction approach was more precise but bundled an unrelated fix (#12023). Thanks @kshitijk4poor!

kshitijk4poor force-pushed the fix/minimax-glm-token-compression branch from efd4524 to f41fb32 Compare April 18, 2026 07:26

WwNeXst mentioned this pull request Apr 18, 2026

[Bug]: Context compression failure uses static placeholder instead of preserved message tail — context permanently lost #12131

Open

kshitijk4poor force-pushed the fix/minimax-glm-token-compression branch from f41fb32 to eba720f Compare April 19, 2026 05:55

teknium1 mentioned this pull request Apr 20, 2026

fix(compression): exclude completion tokens from compression trigger (#12026) #13006

Merged

teknium1 closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: token accounting fallback + compression trigger for reasoning models#12028

fix: token accounting fallback + compression trigger for reasoning models#12028
kshitijk4poor wants to merge 1 commit into
mainfrom
fix/minimax-glm-token-compression

kshitijk4poor commented Apr 18, 2026 •

edited

Loading

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

kshitijk4poor commented Apr 19, 2026

Uh oh!

teknium1 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix 1: Token estimation fallback when streaming returns no usage data

Fix 2: Subtract reasoning tokens from compression trigger

Files changed

Test results

Uh oh!

teknium1 commented Apr 18, 2026

Uh oh!

kshitijk4poor commented Apr 19, 2026

Uh oh!

teknium1 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kshitijk4poor commented Apr 18, 2026 •

edited

Loading