Skip to content

fix: token accounting fallback + compression trigger for reasoning models#12028

Closed
kshitijk4poor wants to merge 1 commit into
mainfrom
fix/minimax-glm-token-compression
Closed

fix: token accounting fallback + compression trigger for reasoning models#12028
kshitijk4poor wants to merge 1 commit into
mainfrom
fix/minimax-glm-token-compression

Conversation

@kshitijk4poor

@kshitijk4poor kshitijk4poor commented Apr 18, 2026

Copy link
Copy Markdown
Collaborator

Summary

Two fixes for MiniMax/GLM-5.1 provider issues discovered during live CLI sessions on April 17, 2026 where these models kept failing due to empty token accounting and premature compression deaths.

Fix 1: Token estimation fallback when streaming returns no usage data

Closes #12023

Problem: Providers like MiniMax via OpenRouter silently ignore stream_options: {"include_usage": true}. Every streaming chunk has usage=None. The token accounting block in run_agent.py is gated by if response.usage: with no else branch — when falsy, the entire block is skipped: no normalize_usage(), no session token accumulation, no update_token_counts() DB persistence, no cost estimation, no API call logging. Sessions permanently record 0/0 tokens.

Production evidence: Session 20260417_141814_9519aa — 85 messages, 35 tool calls, input_tokens=0, output_tokens=0.

Fix: Added an else branch that falls back to estimate_messages_tokens_rough() / estimate_tokens_rough() for approximate token accounting, updates the context compressor, persists to session DB, and logs a WARNING so the degradation is visible in logs.

Fix 2: Subtract reasoning tokens from compression trigger

Closes #12026

Problem: The compression trigger fed raw completion_tokens (including internal reasoning tokens from thinking models) to the context compressor. For reasoning models (GLM-5.1, QwQ, DeepSeek-R1), completion_tokens includes hidden chain-of-thought reasoning that is NOT re-sent on subsequent turns and does NOT consume context window space. This caused premature compression: e.g. 85K prompt + 20K completion (15K reasoning) = 105K, exceeding the 101K threshold despite only 42% actual context usage.

Teknium's review feedback (addressed): The original PR dropped completion_tokens entirely from the formula. Teknium correctly noted that Hermes re-sends reasoning_content on subsequent turns (lines 9140-9146), so completion tokens should contribute when reasoning IS captured and replayed. However, session data analysis showed GLM-5.1 only captures reasoning in 3-15% of assistant messages (8 out of 99 in TD Promo #2), with total stored reasoning of ~1300 tokens — yet completion_tokens included 15-20K of phantom reasoning.

Improved fix: Instead of dropping all completion tokens, we now use the API-provided completion_tokens_details.reasoning_tokens (already extracted by normalize_usage() into canonical_usage.reasoning_tokens) to subtract only the reasoning portion before feeding the compressor:

_reasoning_toks = canonical_usage.reasoning_tokens
_content_completion = max(0, completion_tokens - _reasoning_toks)
# Feed _content_completion to compressor instead of raw completion_tokens

This is surgically correct:

  • Uses data the API already provides (OpenAI, OpenRouter, DeepSeek all return completion_tokens_details.reasoning_tokens)
  • Only subtracts tokens known to be internal reasoning
  • When reasoning_tokens is 0 (non-thinking models), the formula is identical to the old behavior
  • When reasoning IS captured and re-sent, it appears in prompt_tokens on the next call, self-correcting
  • Session-level billing counters still use the full completion_tokens (no impact on cost tracking)

Research: OpenCode (opencode-ai/opencode) has the identical bug — CompletionTokens + PromptTokens for auto-compact at 95% threshold, never subtracts reasoning_tokens, and doesn't replay reasoning on subsequent turns (tui.go:335-341, agent.go:506).

Production evidence: 6 consecutive compression-triggered session splits in a single GLM-5.1 workflow:

Files changed

File Change
run_agent.py +60/-1 — reasoning token subtraction for compressor + token fallback else branch
tests/run_agent/test_token_accounting_fallback.py +175 (new) — 6 regression tests

Test results

  • All 28 compression-related tests pass
  • All 6 usage_pricing tests pass
  • E2E verified: reasoning subtraction correctly prevents compression at 42% prompt usage with 20K reasoning tokens
  • E2E verified: non-thinking models (reasoning_tokens=0) see zero behavior change
  • E2E verified: token estimation fallback produces non-zero values when usage=None

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for digging into this, and for the detailed production evidence.

Fix 1 (token fallback) looks good — confirmed the if response.usage: block on current main has no else, and a rough-estimate fallback is the right pragmatic response when a provider silently ignores stream_options.include_usage. Worth adding a regression test in tests/run_agent/test_413_compression.py (or nearby) that feeds a usage=None response and asserts session counters get non-zero deltas and the compressor sees non-zero last_prompt_tokens.

Fix 2 (drop completion_tokens from the compression trigger) I'd like to push back on before we take it. The premise in the PR body is that reasoning tokens are "ephemeral output — they don't consume context window space for the next API call." But hermes actually does the opposite by design: at run_agent.py ~9094-9100, we explicitly re-send reasoning back on every subsequent turn:

# For ALL assistant messages, pass reasoning back to the API
# This ensures multi-turn reasoning context is preserved
if msg.get("role") == "assistant":
    reasoning_text = msg.get("reasoning")
    if reasoning_text:
        api_msg["reasoning_content"] = reasoning_text

reasoning_details is also preserved intentionally (the comment at ~9117 spells out OpenRouter multi-turn reasoning continuity), and Codex keeps codex_reasoning_items for the same reason. So in our design, completion_tokens from turn N should show up inside prompt_tokens on turn N+1, and last_prompt + last_completion is a tight lower bound for the next prompt — not an overestimate. Compression firing at 85K + 20K = 105K isn't premature under that model, it's a one-turn-early preflight of what the next call will actually report.

The "42% actual context usage" reading only holds if, on your GLM-5.1 / OpenRouter setup, one of these is true:

  1. OpenRouter / GLM-5.1 accepts reasoning_content + reasoning_details on input but doesn't bill them as input tokens (provider-side).
  2. Our extraction misses GLM's reasoning format so msg["reasoning"] is empty and nothing goes back.
  3. Something is stripping think-tag content between turns before the next request.

Any of those is a real bug worth fixing — but the fix lives in _extract_reasoning / _build_assistant_message / the outgoing api_msg construction, not in the compression formula. The compression formula only becomes wrong if we've already lost the reasoning upstream.

A quick empirical check from your existing session logs: in your 6 compressed GLM-5.1 sessions, does last_prompt_tokens (or the provider-reported prompt_tokens on each API call) grow turn-over-turn by roughly last_completion_tokens? If yes, reasoning is being preserved and compression at 105K is actually correct. If it stays flat across turns while completion keeps producing 15K+ reasoning, that's the bug and it's upstream of the compression trigger.

Happy to help dig into that side of it if you want to pull a few turns of usage numbers from one of those session DBs. For now I'd suggest splitting this PR into the token-fallback fix alone (plus a test) and parking the compression change pending the investigation above.

Fix 1 — Token estimation fallback (closes #12023):
When providers like MiniMax via OpenRouter silently ignore
stream_options.include_usage, response.usage is None and the token
accounting block is skipped entirely.  Added an else branch that falls
back to estimate_messages_tokens_rough() / estimate_tokens_rough() so
sessions don't permanently record 0/0 tokens.

Fix 2 — Subtract reasoning tokens from compression trigger (closes #12026):
The compression trigger fed raw completion_tokens (including internal
reasoning tokens) to the context compressor.  For thinking models
(GLM-5.1, QwQ, DeepSeek-R1), completion_tokens includes reasoning
that is NOT re-sent on subsequent turns and doesn't consume context
window space.  Now subtracts canonical_usage.reasoning_tokens (from
completion_tokens_details.reasoning_tokens) before feeding the
compressor, so only content tokens count toward the threshold.

This addresses Teknium's review feedback on #12028: rather than
dropping all completion_tokens (which would be wrong when reasoning IS
re-sent), we use the API-provided reasoning_tokens breakdown to
subtract only the phantom tokens.  Non-thinking models (reasoning_tokens=0)
see zero behavior change.

Production evidence: 6 consecutive GLM-5.1 sessions ended with
premature compression (TD Promo #2-#6, April 17 2026).  Only 3-15% of
assistant messages had reasoning captured; total stored reasoning was
~150-2500 tokens per session — yet completion_tokens included 15-20K
of hidden reasoning that inflated the trigger past the 101K threshold.

Research: OpenCode has the identical bug (tui.go:335-341, completion +
prompt without reasoning subtraction).  The OpenAI/OpenRouter APIs
provide completion_tokens_details.reasoning_tokens for exactly this
purpose; Hermes already extracts it via normalize_usage() but never
used it in compression.

Tests: 6 new regression tests covering reasoning subtraction, premature
compression prevention, threshold still firing when truly full, zero
reasoning passthrough, and fallback estimation.
@kshitijk4poor kshitijk4poor force-pushed the fix/minimax-glm-token-compression branch from f41fb32 to eba720f Compare April 19, 2026 05:55
@kshitijk4poor

Copy link
Copy Markdown
Collaborator Author

You were right to push back — dropping completion_tokens entirely was wrong. I dug into the actual session data and found the upstream issue you predicted.

Session analysis confirms your scenario #2: _extract_reasoning() mostly misses GLM-5.1's reasoning output. Across all 6 compression-death sessions:

Session Assistant msgs With reasoning Total reasoning tokens
TD Promo #2 99 8 (8%) ~1,300
TD Promo #3 67 5 (7%) ~2,000
TD Promo #4 76 12 (15%) ~2,455
TD Promo #5 91 5 (5%) ~297
TD Promo #6 104 6 (5%) ~2,161

Zero messages had reasoning_details. Zero had <think> tags in content. GLM-5.1 via OpenRouter either returns reasoning in a format we don't capture, or the reasoning is provider-internal and never surfaces.

The improved fix uses the API-provided completion_tokens_details.reasoning_tokens (which we already extract in normalize_usage() into canonical_usage.reasoning_tokens) to subtract only the phantom reasoning portion before feeding the compressor:

_reasoning_toks = canonical_usage.reasoning_tokens
_content_completion = max(0, completion_tokens - _reasoning_toks)

This way:

  • Non-thinking models (reasoning_tokens=0) → zero behavior change
  • Thinking models with captured reasoning → reasoning shows up in prompt_tokens on the next call (self-correcting)
  • Thinking models with uncaptured reasoning → phantom tokens no longer inflate the trigger

I also confirmed OpenCode has the identical bug (same formula, no reasoning subtraction, same premature auto-compact).

The token fallback fix (Fix 1) is unchanged — added a regression test as you suggested. Force-pushed with both fixes + 6 new tests.

@teknium1

Copy link
Copy Markdown
Contributor

Closed in favor of PR #13006 #13006 which fixes the same issue. Your reasoning-subtraction approach was more precise but bundled an unrelated fix (#12023). Thanks @kshitijk4poor!

@teknium1 teknium1 closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants