Bug
The gateway's pre-agent hygiene compressor uses get_model_context_length(model) without passing base_url, which causes it to miss the persistent context length cache and fall through to the 2M default for unrecognized model names (e.g. qwen3.5:27b-q4_K_M, any Ollama model tag).
This sets the compression threshold at 85% of 2M = 1.7M tokens, making compression effectively unreachable. Sessions grow unbounded until they hit the actual context limit, at which point the model starts failing or degrading.
Root Cause
Two issues in get_model_context_length() resolution:
-
context_length_cache.yaml format mismatch — _load_context_cache() expects data nested under a context_lengths: key, but the file can end up as a flat dict (possibly from an older writer or manual edit). When this happens, cache lookups always return empty.
-
Gateway doesn't pass base_url — gateway/run.py:1548 calls get_model_context_length(_hyg_model) with no base_url. The persistent cache keys are model@base_url, so without base_url the cache is skipped entirely (line if base_url: guard in get_model_context_length).
-
Config model.context_length is ignored — The gateway already reads config.yaml and resolves model.default, but never reads the user-configured model.context_length. This is the most reliable source for local models where the user explicitly sets their context window.
Impact
- Any user running local models (Ollama, llama.cpp, vLLM, etc.) with custom model tags will never get gateway hygiene compression
- Sessions can grow to hundreds of messages / 100K+ tokens without compression
- The agent's internal compressor (at 50% threshold) may still work if it gets the right context length, but the gateway safety net is broken
Reproduction
from agent.model_metadata import get_model_context_length
# Returns 2M instead of configured 49152
print(get_model_context_length("qwen3.5:27b-q4_K_M"))
# Returns correct value with base_url (if cache format is fixed)
print(get_model_context_length("qwen3.5:27b-q4_K_M", base_url="http://localhost:11434/v1"))
Fix
Patch gateway/run.py to:
- Read
model.context_length from the already-loaded config.yaml
- Pass
OPENAI_BASE_URL to get_model_context_length as fallback
- _hyg_context_length = get_model_context_length(_hyg_model)
+ _hyg_base_url = os.environ.get("OPENAI_BASE_URL", "")
+ _cfg_ctx = 0
+ try:
+ _m = _hyg_data.get("model", {})
+ if isinstance(_m, dict):
+ _cfg_ctx = _m.get("context_length", 0)
+ except NameError:
+ pass
+ _hyg_context_length = _cfg_ctx or get_model_context_length(_hyg_model, base_url=_hyg_base_url)
Also consider:
- Making
_load_context_cache() handle both flat and nested formats
- Removing the
if base_url: guard in get_model_context_length so bare model names can still match cache entries
- Querying Ollama's
/api/show endpoint for actual context length when the provider is a local Ollama instance
Bug
The gateway's pre-agent hygiene compressor uses
get_model_context_length(model)without passingbase_url, which causes it to miss the persistent context length cache and fall through to the 2M default for unrecognized model names (e.g.qwen3.5:27b-q4_K_M, any Ollama model tag).This sets the compression threshold at 85% of 2M = 1.7M tokens, making compression effectively unreachable. Sessions grow unbounded until they hit the actual context limit, at which point the model starts failing or degrading.
Root Cause
Two issues in
get_model_context_length()resolution:context_length_cache.yamlformat mismatch —_load_context_cache()expects data nested under acontext_lengths:key, but the file can end up as a flat dict (possibly from an older writer or manual edit). When this happens, cache lookups always return empty.Gateway doesn't pass
base_url—gateway/run.py:1548callsget_model_context_length(_hyg_model)with nobase_url. The persistent cache keys aremodel@base_url, so withoutbase_urlthe cache is skipped entirely (lineif base_url:guard inget_model_context_length).Config
model.context_lengthis ignored — The gateway already readsconfig.yamland resolvesmodel.default, but never reads the user-configuredmodel.context_length. This is the most reliable source for local models where the user explicitly sets their context window.Impact
Reproduction
Fix
Patch
gateway/run.pyto:model.context_lengthfrom the already-loadedconfig.yamlOPENAI_BASE_URLtoget_model_context_lengthas fallbackAlso consider:
_load_context_cache()handle both flat and nested formatsif base_url:guard inget_model_context_lengthso bare model names can still match cache entries/api/showendpoint for actual context length when the provider is a local Ollama instance