Skip to content

Gateway hygiene compression never triggers for local/Ollama models #2708

@amiller

Description

@amiller

Bug

The gateway's pre-agent hygiene compressor uses get_model_context_length(model) without passing base_url, which causes it to miss the persistent context length cache and fall through to the 2M default for unrecognized model names (e.g. qwen3.5:27b-q4_K_M, any Ollama model tag).

This sets the compression threshold at 85% of 2M = 1.7M tokens, making compression effectively unreachable. Sessions grow unbounded until they hit the actual context limit, at which point the model starts failing or degrading.

Root Cause

Two issues in get_model_context_length() resolution:

  1. context_length_cache.yaml format mismatch_load_context_cache() expects data nested under a context_lengths: key, but the file can end up as a flat dict (possibly from an older writer or manual edit). When this happens, cache lookups always return empty.

  2. Gateway doesn't pass base_urlgateway/run.py:1548 calls get_model_context_length(_hyg_model) with no base_url. The persistent cache keys are model@base_url, so without base_url the cache is skipped entirely (line if base_url: guard in get_model_context_length).

  3. Config model.context_length is ignored — The gateway already reads config.yaml and resolves model.default, but never reads the user-configured model.context_length. This is the most reliable source for local models where the user explicitly sets their context window.

Impact

  • Any user running local models (Ollama, llama.cpp, vLLM, etc.) with custom model tags will never get gateway hygiene compression
  • Sessions can grow to hundreds of messages / 100K+ tokens without compression
  • The agent's internal compressor (at 50% threshold) may still work if it gets the right context length, but the gateway safety net is broken

Reproduction

from agent.model_metadata import get_model_context_length
# Returns 2M instead of configured 49152
print(get_model_context_length("qwen3.5:27b-q4_K_M"))
# Returns correct value with base_url (if cache format is fixed)
print(get_model_context_length("qwen3.5:27b-q4_K_M", base_url="http://localhost:11434/v1"))

Fix

Patch gateway/run.py to:

  1. Read model.context_length from the already-loaded config.yaml
  2. Pass OPENAI_BASE_URL to get_model_context_length as fallback
-                _hyg_context_length = get_model_context_length(_hyg_model)
+                _hyg_base_url = os.environ.get("OPENAI_BASE_URL", "")
+                _cfg_ctx = 0
+                try:
+                    _m = _hyg_data.get("model", {})
+                    if isinstance(_m, dict):
+                        _cfg_ctx = _m.get("context_length", 0)
+                except NameError:
+                    pass
+                _hyg_context_length = _cfg_ctx or get_model_context_length(_hyg_model, base_url=_hyg_base_url)

Also consider:

  • Making _load_context_cache() handle both flat and nested formats
  • Removing the if base_url: guard in get_model_context_length so bare model names can still match cache entries
  • Querying Ollama's /api/show endpoint for actual context length when the provider is a local Ollama instance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions