Gateway hygiene compression never triggers for local/Ollama models

## Bug

The gateway's pre-agent hygiene compressor uses `get_model_context_length(model)` without passing `base_url`, which causes it to miss the persistent context length cache and fall through to the 2M default for unrecognized model names (e.g. `qwen3.5:27b-q4_K_M`, any Ollama model tag).

This sets the compression threshold at 85% of 2M = **1.7M tokens**, making compression effectively unreachable. Sessions grow unbounded until they hit the actual context limit, at which point the model starts failing or degrading.

## Root Cause

Two issues in `get_model_context_length()` resolution:

1. **`context_length_cache.yaml` format mismatch** — `_load_context_cache()` expects data nested under a `context_lengths:` key, but the file can end up as a flat dict (possibly from an older writer or manual edit). When this happens, cache lookups always return empty.

2. **Gateway doesn't pass `base_url`** — `gateway/run.py:1548` calls `get_model_context_length(_hyg_model)` with no `base_url`. The persistent cache keys are `model@base_url`, so without `base_url` the cache is skipped entirely (line `if base_url:` guard in `get_model_context_length`).

3. **Config `model.context_length` is ignored** — The gateway already reads `config.yaml` and resolves `model.default`, but never reads the user-configured `model.context_length`. This is the most reliable source for local models where the user explicitly sets their context window.

## Impact

- Any user running local models (Ollama, llama.cpp, vLLM, etc.) with custom model tags will never get gateway hygiene compression
- Sessions can grow to hundreds of messages / 100K+ tokens without compression
- The agent's internal compressor (at 50% threshold) may still work if it gets the right context length, but the gateway safety net is broken

## Reproduction

```python
from agent.model_metadata import get_model_context_length
# Returns 2M instead of configured 49152
print(get_model_context_length("qwen3.5:27b-q4_K_M"))
# Returns correct value with base_url (if cache format is fixed)
print(get_model_context_length("qwen3.5:27b-q4_K_M", base_url="http://localhost:11434/v1"))
```

## Fix

Patch `gateway/run.py` to:
1. Read `model.context_length` from the already-loaded `config.yaml`
2. Pass `OPENAI_BASE_URL` to `get_model_context_length` as fallback

```diff
-                _hyg_context_length = get_model_context_length(_hyg_model)
+                _hyg_base_url = os.environ.get("OPENAI_BASE_URL", "")
+                _cfg_ctx = 0
+                try:
+                    _m = _hyg_data.get("model", {})
+                    if isinstance(_m, dict):
+                        _cfg_ctx = _m.get("context_length", 0)
+                except NameError:
+                    pass
+                _hyg_context_length = _cfg_ctx or get_model_context_length(_hyg_model, base_url=_hyg_base_url)
```

Also consider:
- Making `_load_context_cache()` handle both flat and nested formats
- Removing the `if base_url:` guard in `get_model_context_length` so bare model names can still match cache entries
- Querying Ollama's `/api/show` endpoint for actual context length when the provider is a local Ollama instance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway hygiene compression never triggers for local/Ollama models #2708

Bug

Root Cause

Impact

Reproduction

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gateway hygiene compression never triggers for local/Ollama models #2708

Description

Bug

Root Cause

Impact

Reproduction

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions