fix: token accounting fallback + compression trigger for reasoning models#12028
fix: token accounting fallback + compression trigger for reasoning models#12028kshitijk4poor wants to merge 1 commit into
Conversation
efd4524 to
f41fb32
Compare
|
Thanks for digging into this, and for the detailed production evidence. Fix 1 (token fallback) looks good — confirmed the Fix 2 (drop completion_tokens from the compression trigger) I'd like to push back on before we take it. The premise in the PR body is that reasoning tokens are "ephemeral output — they don't consume context window space for the next API call." But hermes actually does the opposite by design: at # For ALL assistant messages, pass reasoning back to the API
# This ensures multi-turn reasoning context is preserved
if msg.get("role") == "assistant":
reasoning_text = msg.get("reasoning")
if reasoning_text:
api_msg["reasoning_content"] = reasoning_text
The "42% actual context usage" reading only holds if, on your GLM-5.1 / OpenRouter setup, one of these is true:
Any of those is a real bug worth fixing — but the fix lives in A quick empirical check from your existing session logs: in your 6 compressed GLM-5.1 sessions, does Happy to help dig into that side of it if you want to pull a few turns of usage numbers from one of those session DBs. For now I'd suggest splitting this PR into the token-fallback fix alone (plus a test) and parking the compression change pending the investigation above. |
Fix 1 — Token estimation fallback (closes #12023): When providers like MiniMax via OpenRouter silently ignore stream_options.include_usage, response.usage is None and the token accounting block is skipped entirely. Added an else branch that falls back to estimate_messages_tokens_rough() / estimate_tokens_rough() so sessions don't permanently record 0/0 tokens. Fix 2 — Subtract reasoning tokens from compression trigger (closes #12026): The compression trigger fed raw completion_tokens (including internal reasoning tokens) to the context compressor. For thinking models (GLM-5.1, QwQ, DeepSeek-R1), completion_tokens includes reasoning that is NOT re-sent on subsequent turns and doesn't consume context window space. Now subtracts canonical_usage.reasoning_tokens (from completion_tokens_details.reasoning_tokens) before feeding the compressor, so only content tokens count toward the threshold. This addresses Teknium's review feedback on #12028: rather than dropping all completion_tokens (which would be wrong when reasoning IS re-sent), we use the API-provided reasoning_tokens breakdown to subtract only the phantom tokens. Non-thinking models (reasoning_tokens=0) see zero behavior change. Production evidence: 6 consecutive GLM-5.1 sessions ended with premature compression (TD Promo #2-#6, April 17 2026). Only 3-15% of assistant messages had reasoning captured; total stored reasoning was ~150-2500 tokens per session — yet completion_tokens included 15-20K of hidden reasoning that inflated the trigger past the 101K threshold. Research: OpenCode has the identical bug (tui.go:335-341, completion + prompt without reasoning subtraction). The OpenAI/OpenRouter APIs provide completion_tokens_details.reasoning_tokens for exactly this purpose; Hermes already extracts it via normalize_usage() but never used it in compression. Tests: 6 new regression tests covering reasoning subtraction, premature compression prevention, threshold still firing when truly full, zero reasoning passthrough, and fallback estimation.
f41fb32 to
eba720f
Compare
|
You were right to push back — dropping completion_tokens entirely was wrong. I dug into the actual session data and found the upstream issue you predicted. Session analysis confirms your scenario #2:
Zero messages had The improved fix uses the API-provided _reasoning_toks = canonical_usage.reasoning_tokens
_content_completion = max(0, completion_tokens - _reasoning_toks)This way:
I also confirmed OpenCode has the identical bug (same formula, no reasoning subtraction, same premature auto-compact). The token fallback fix (Fix 1) is unchanged — added a regression test as you suggested. Force-pushed with both fixes + 6 new tests. |
|
Closed in favor of PR #13006 #13006 which fixes the same issue. Your reasoning-subtraction approach was more precise but bundled an unrelated fix (#12023). Thanks @kshitijk4poor! |
Summary
Two fixes for MiniMax/GLM-5.1 provider issues discovered during live CLI sessions on April 17, 2026 where these models kept failing due to empty token accounting and premature compression deaths.
Fix 1: Token estimation fallback when streaming returns no usage data
Closes #12023
Problem: Providers like MiniMax via OpenRouter silently ignore
stream_options: {"include_usage": true}. Every streaming chunk hasusage=None. The token accounting block inrun_agent.pyis gated byif response.usage:with no else branch — when falsy, the entire block is skipped: nonormalize_usage(), no session token accumulation, noupdate_token_counts()DB persistence, no cost estimation, no API call logging. Sessions permanently record 0/0 tokens.Production evidence: Session
20260417_141814_9519aa— 85 messages, 35 tool calls, input_tokens=0, output_tokens=0.Fix: Added an
elsebranch that falls back toestimate_messages_tokens_rough()/estimate_tokens_rough()for approximate token accounting, updates the context compressor, persists to session DB, and logs aWARNINGso the degradation is visible in logs.Fix 2: Subtract reasoning tokens from compression trigger
Closes #12026
Problem: The compression trigger fed raw
completion_tokens(including internal reasoning tokens from thinking models) to the context compressor. For reasoning models (GLM-5.1, QwQ, DeepSeek-R1),completion_tokensincludes hidden chain-of-thought reasoning that is NOT re-sent on subsequent turns and does NOT consume context window space. This caused premature compression: e.g. 85K prompt + 20K completion (15K reasoning) = 105K, exceeding the 101K threshold despite only 42% actual context usage.Teknium's review feedback (addressed): The original PR dropped completion_tokens entirely from the formula. Teknium correctly noted that Hermes re-sends reasoning_content on subsequent turns (lines 9140-9146), so completion tokens should contribute when reasoning IS captured and replayed. However, session data analysis showed GLM-5.1 only captures reasoning in 3-15% of assistant messages (8 out of 99 in TD Promo #2), with total stored reasoning of ~1300 tokens — yet completion_tokens included 15-20K of phantom reasoning.
Improved fix: Instead of dropping all completion tokens, we now use the API-provided
completion_tokens_details.reasoning_tokens(already extracted bynormalize_usage()intocanonical_usage.reasoning_tokens) to subtract only the reasoning portion before feeding the compressor:This is surgically correct:
completion_tokens_details.reasoning_tokens)reasoning_tokensis 0 (non-thinking models), the formula is identical to the old behaviorprompt_tokenson the next call, self-correctingcompletion_tokens(no impact on cost tracking)Research: OpenCode (opencode-ai/opencode) has the identical bug —
CompletionTokens + PromptTokensfor auto-compact at 95% threshold, never subtracts reasoning_tokens, and doesn't replay reasoning on subsequent turns (tui.go:335-341, agent.go:506).Production evidence: 6 consecutive compression-triggered session splits in a single GLM-5.1 workflow:
end_reason=compression<think>blocks in contentFiles changed
run_agent.pytests/run_agent/test_token_accounting_fallback.pyTest results
usage=None