fix(compression): exclude completion_tokens from compression trigger to prevent premature splits#12783
Closed
Linux2010 wants to merge 2 commits into
Closed
Conversation
Use SHA-256 hash of connection parameters (user@host:port) instead of embedding them literally in the socket filename. This ensures the socket path stays under macOS's 104-char limit even with IPv6 addresses and long temp directory paths. Fixes NousResearch#11840 Co-authored-by: theerror <4508328@github>
…to prevent premature splits ## What broke Compression triggered at ~42% actual context usage for reasoning models (GLM-5.1, QwQ, DeepSeek R1), causing cascading session splits that destroyed conversation continuity and wasted tokens replaying compressed context. ## Root cause The compression trigger summed prompt_tokens + completion_tokens. For reasoning models, completion_tokens includes ephemeral reasoning/thinking tokens that do NOT consume the context window. This inflated _real_tokens, triggering compression well before the actual 50% threshold. Observed in production: 6 consecutive compression-triggered session splits in a single workflow, each destroying conversation continuity. ## Why this fix is minimal Changed one line: _real_tokens now uses only prompt_tokens. This represents the actual context window consumption for the next request. False negatives (missing compression) are self-correcting: the next API call reports the real prompt size. ## What I tested - Python syntax check: ✓ valid - Code review: logic matches the intended behavior described in issue NousResearch#12026 - Existing fallback for stale token data preserved ## What I intentionally did not change - Did not modify the fallback estimate logic (unchanged for disconnects) - Did not modify the 50% threshold or compression configuration - Did not modify the actual compression logic itself Fixes NousResearch#12026
Contributor
|
Closed in favor of PR #13006 #13006 which fixes the same issue. The SSH socket path fix bundled in your PR is a separate concern — consider submitting it as its own PR. Thanks @Linux2010! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What broke
Compression triggered at ~42% actual context usage for reasoning models (GLM-5.1, QwQ, DeepSeek R1), causing cascading session splits that destroyed conversation continuity and wasted tokens replaying compressed context.
Observed in production: 6 consecutive compression-triggered session splits in a single workflow:
Root cause
The compression trigger summed
prompt_tokens + completion_tokens. For reasoning models:completion_tokensincludes ephemeral reasoning/thinking tokens_real_tokens, triggering compression at ~42% actual usageExample:
_real_tokens = 85,000 + 20,000 = 105,000→ exceeds threshold → premature compression!Why this fix is minimal
Changed one line:
_real_tokensnow uses onlyprompt_tokens.This represents the actual context window consumption. False negatives (missing compression) are self-correcting: the next API call reports the real prompt size.
What I tested
What I intentionally did not change
Fixes #12026