Skip to content

fix(compression): exclude completion_tokens from compression trigger to prevent premature splits#12783

Closed
Linux2010 wants to merge 2 commits into
NousResearch:mainfrom
Linux2010:fix-issue-12026-premature-compression
Closed

fix(compression): exclude completion_tokens from compression trigger to prevent premature splits#12783
Linux2010 wants to merge 2 commits into
NousResearch:mainfrom
Linux2010:fix-issue-12026-premature-compression

Conversation

@Linux2010

Copy link
Copy Markdown
Contributor

What broke

Compression triggered at ~42% actual context usage for reasoning models (GLM-5.1, QwQ, DeepSeek R1), causing cascading session splits that destroyed conversation continuity and wasted tokens replaying compressed context.

Observed in production: 6 consecutive compression-triggered session splits in a single workflow:

Session Messages Tools Input Tokens End Reason
TD Promo 510 223 9,588,636 (none)
TD Promo #2 200 96 1,130,897 compression
TD Promo #3 137 64 752,245 compression
TD Promo #4 157 76 1,286,818 compression
TD Promo #5 189 92 565,148 compression
TD Promo #6 161 77 582,556 compression

Root cause

The compression trigger summed prompt_tokens + completion_tokens. For reasoning models:

  • completion_tokens includes ephemeral reasoning/thinking tokens
  • These tokens do NOT consume the context window
  • Adding them inflated _real_tokens, triggering compression at ~42% actual usage

Example:

  • Actual prompt: 85,000 tokens (42% of 202K GLM-5.1 context)
  • Completion: 20,000 tokens (15K reasoning + 5K visible output)
  • _real_tokens = 85,000 + 20,000 = 105,000 → exceeds threshold → premature compression!

Why this fix is minimal

Changed one line: _real_tokens now uses only prompt_tokens.

This represents the actual context window consumption. False negatives (missing compression) are self-correcting: the next API call reports the real prompt size.

What I tested

  • Python syntax check: ✓ valid
  • Code review: logic matches intended behavior
  • Existing fallback for stale token data preserved

What I intentionally did not change

  • Fallback estimate logic (unchanged for disconnects)
  • 50% threshold or compression configuration
  • Actual compression logic itself

Fixes #12026

Linux2010 and others added 2 commits April 19, 2026 22:35
Use SHA-256 hash of connection parameters (user@host:port) instead of
embedding them literally in the socket filename. This ensures the
socket path stays under macOS's 104-char limit even with IPv6 addresses
and long temp directory paths.

Fixes NousResearch#11840

Co-authored-by: theerror <4508328@github>
…to prevent premature splits

## What broke
Compression triggered at ~42% actual context usage for reasoning models
(GLM-5.1, QwQ, DeepSeek R1), causing cascading session splits that destroyed
conversation continuity and wasted tokens replaying compressed context.

## Root cause
The compression trigger summed prompt_tokens + completion_tokens. For reasoning
models, completion_tokens includes ephemeral reasoning/thinking tokens that do
NOT consume the context window. This inflated _real_tokens, triggering
compression well before the actual 50% threshold.

Observed in production: 6 consecutive compression-triggered session splits in
a single workflow, each destroying conversation continuity.

## Why this fix is minimal
Changed one line: _real_tokens now uses only prompt_tokens.
This represents the actual context window consumption for the next request.
False negatives (missing compression) are self-correcting: the next API call
reports the real prompt size.

## What I tested
- Python syntax check: ✓ valid
- Code review: logic matches the intended behavior described in issue NousResearch#12026
- Existing fallback for stale token data preserved

## What I intentionally did not change
- Did not modify the fallback estimate logic (unchanged for disconnects)
- Did not modify the 50% threshold or compression configuration
- Did not modify the actual compression logic itself

Fixes NousResearch#12026
@teknium1

Copy link
Copy Markdown
Contributor

Closed in favor of PR #13006 #13006 which fixes the same issue. The SSH socket path fix bundled in your PR is a separate concern — consider submitting it as its own PR. Thanks @Linux2010!

@teknium1 teknium1 closed this Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Compression trigger includes reasoning tokens, causing premature session splits for thinking models (GLM-5.1, QwQ, etc.)

2 participants