Bug Description
Context auto-compression never triggers when the model's context_length equals MINIMUM_CONTEXT_LENGTH (64000 tokens). This causes conversations to grow until they hit the model's context limit and get forcefully degraded, instead of being automatically compressed at the configured threshold.
Root Cause
In agent/context_compressor.py, the threshold_tokens calculation uses MINIMUM_CONTEXT_LENGTH (64000) as an absolute floor:
# Lines 356-359 (__init__) and 316-319 (update_model)
self.threshold_tokens = max(
int(self.context_length * threshold_percent), # e.g., int(64000 * 0.7) = 44800
MINIMUM_CONTEXT_LENGTH, # 64000
)
When context_length == 64000 (e.g., a local model with 192K context split across 3 parallel slots = 64K per slot), the floor value dominates:
max(44800, 64000) = 64000 → threshold = 100% of context window
should_compress() checks prompt_tokens >= threshold_tokens, but the API errors out before prompt_tokens can reach 64000
- Compression never fires, regardless of the configured threshold percentage
This affects any configuration where context_length <= MINIMUM_CONTEXT_LENGTH / threshold_percent:
context_length=64000, threshold=0.7 → threshold_tokens = 64000 (100%) ❌
context_length=64000, threshold=0.5 → threshold_tokens = 64000 (100%) ❌
context_length=64000, threshold=0.85 → threshold_tokens = 64000 (100%) ❌
context_length=80000, threshold=0.7 → threshold_tokens = 64000 (80%) — works but threshold is higher than configured
context_length=128000, threshold=0.7 → threshold_tokens = 89600 (70%) ✅ correct
Reproduction
- Configure a local model with
context_length: 64000 in config.yaml
- Set
compression.threshold: 0.7
- Start a long conversation and observe that context grows past 70% without triggering compression
- Conversation eventually hits the context limit and gets forcefully degraded
Config:
model:
context_length: 64000
compression:
enabled: true
threshold: 0.7
Fix
Add a safety check after the max() calculation — if the floor value pushes the threshold to 100% or beyond, fall back to the percentage-based value:
self.threshold_tokens = max(
int(self.context_length * threshold_percent),
MINIMUM_CONTEXT_LENGTH,
)
if self.threshold_tokens >= self.context_length:
self.threshold_tokens = int(self.context_length * threshold_percent)
This preserves the original intent of the floor (preventing premature compression on large-context models) while ensuring compression can actually trigger when context_length is at or near the minimum.
Related Design Issues (not bugs, but worth noting)
-
Anti-thrashing has no auto-recovery: _ineffective_compression_count >= 2 causes should_compress() to permanently return False until /new resets the session. No decay or timeout mechanism exists.
-
Post-compression token estimate excludes tools schema: After compression, last_prompt_tokens is set to a rough estimate (len(str)//4) that does not include tools schema tokens (potentially 20-30K), causing the next compression cycle to trigger later than configured.
Environment
- Hermes Agent version: latest main (ce08916)
- Model: Qwen3.6-35B-A3B (local llama.cpp, 192K context / 3 parallel slots = 64K per slot)
- OS: Linux (ROCm)
Bug Description
Context auto-compression never triggers when the model's
context_lengthequalsMINIMUM_CONTEXT_LENGTH(64000 tokens). This causes conversations to grow until they hit the model's context limit and get forcefully degraded, instead of being automatically compressed at the configured threshold.Root Cause
In
agent/context_compressor.py, thethreshold_tokenscalculation usesMINIMUM_CONTEXT_LENGTH(64000) as an absolute floor:When
context_length == 64000(e.g., a local model with 192K context split across 3 parallel slots = 64K per slot), the floor value dominates:max(44800, 64000) = 64000→ threshold = 100% of context windowshould_compress()checksprompt_tokens >= threshold_tokens, but the API errors out beforeprompt_tokenscan reach 64000This affects any configuration where
context_length <= MINIMUM_CONTEXT_LENGTH / threshold_percent:context_length=64000, threshold=0.7→ threshold_tokens = 64000 (100%) ❌context_length=64000, threshold=0.5→ threshold_tokens = 64000 (100%) ❌context_length=64000, threshold=0.85→ threshold_tokens = 64000 (100%) ❌context_length=80000, threshold=0.7→ threshold_tokens = 64000 (80%) — works but threshold is higher than configuredcontext_length=128000, threshold=0.7→ threshold_tokens = 89600 (70%) ✅ correctReproduction
context_length: 64000inconfig.yamlcompression.threshold: 0.7Config:
Fix
Add a safety check after the
max()calculation — if the floor value pushes the threshold to 100% or beyond, fall back to the percentage-based value:This preserves the original intent of the floor (preventing premature compression on large-context models) while ensuring compression can actually trigger when
context_lengthis at or near the minimum.Related Design Issues (not bugs, but worth noting)
Anti-thrashing has no auto-recovery:
_ineffective_compression_count >= 2causesshould_compress()to permanently returnFalseuntil/newresets the session. No decay or timeout mechanism exists.Post-compression token estimate excludes tools schema: After compression,
last_prompt_tokensis set to a rough estimate (len(str)//4) that does not include tools schema tokens (potentially 20-30K), causing the next compression cycle to trigger later than configured.Environment