fix(ollama): pass num_ctx to override 2048 default context window by kshitijk4poor · Pull Request #5929 · NousResearch/hermes-agent

kshitijk4poor · 2026-04-07T19:53:34Z

Problem

Ollama defaults to a 2048-token context window regardless of the model's capabilities. Hermes already queries /api/show to detect the model's max context (e.g., 131072 for Llama 3.1), but never tells Ollama to actually use it. Result: the agent reports 128K context while Ollama only serves 2K, causing silent truncation and eventual context overflow errors.

This is the root cause of #2708 (gateway compression never triggers for Ollama models — the compressor thinks it has 128K tokens of headroom when only 2K are available).

Research

Tested against Ollama v0.6.2 source code + live instance. Compared approaches across 3 projects:

Project	Detects context?	Sends num_ctx?	Handles 2048 default?
Open WebUI	Reads model_info from `/api/show` (but not wired in normal flow)	Only if user manually sets per-model	No — users must configure manually
Continue.dev	Static lookup (hardcoded 4096 default)	Never	No
Aider	Via litellm static DB + bundled metadata	For 3 pre-configured models only	Partial — documents the problem, user must configure
Hermes (this PR)	`/api/show` GGUF metadata + Modelfile params	Auto on every request	Yes — fully automatic

Ollama's context flow: DefaultOptions() sets num_ctx=2048 → model's Modelfile overrides → request options override → runner starts with --ctx-size={num_ctx}. If request num_ctx differs from loaded model, Ollama reloads with expanded KV cache.

Fix

After detecting the model's context length from /api/show, inject num_ctx into every Ollama chat request via extra_body["options"]["num_ctx"].

Resolution order for num_ctx

Config override: model.ollama_num_ctx in config.yaml — for users who want to cap VRAM below model max
Modelfile PARAMETER num_ctx — user-set in the Ollama Modelfile (detected from /api/show parameters)
GGUF model_info.{arch}.context_length — training max from /api/show metadata

Config example

model:
  ollama_num_ctx: 16384  # cap context to 16K (saves VRAM)

If not set, the model's full training context is used automatically.

Changes

File	Change
`agent/model_metadata.py`	New `query_ollama_num_ctx(model, base_url)` — queries `/api/show`, returns appropriate num_ctx
`run_agent.py`	Detect Ollama at init, store `_ollama_num_ctx`. Inject into `extra_body["options"]["num_ctx"]` in `_build_api_kwargs`
`tests/test_ollama_num_ctx.py`	8 new tests covering all detection paths

Testing

python -m pytest tests/test_ollama_num_ctx.py tests/agent/test_model_metadata.py tests/test_model_metadata_local_ctx.py -q
# 102 passed (94 existing + 8 new)

Notes

The first chat request after model change may take 5-15 seconds as Ollama reallocates the KV cache. This is expected — subsequent requests reuse the loaded context.
Users with limited VRAM can set model.ollama_num_ctx in config.yaml to a lower value (e.g., 8192 or 16384).
Non-Ollama local servers (LM Studio, vLLM, llama.cpp) are unaffected — the injection only fires when query_ollama_num_ctx detects an Ollama server.

Ollama defaults to 2048 context tokens regardless of the model's capabilities. Hermes already queries /api/show to detect the model's max context (from GGUF metadata), but never told Ollama to actually use it. Result: the agent thinks it has 128K context while Ollama only serves 2K, causing silent truncation and context overflow. Fix: after detecting the model's context length from /api/show, inject num_ctx into every Ollama chat request via extra_body.options. This tells Ollama to allocate the full KV cache for the model's training context window. Resolution order for num_ctx: 1. Explicit config override: model.ollama_num_ctx in config.yaml (for users who want to cap VRAM usage below model max) 2. Modelfile PARAMETER num_ctx (user-set in the Ollama Modelfile) 3. GGUF model_info.{arch}.context_length (training max from /api/show) New public API: query_ollama_num_ctx(model, base_url) in model_metadata.py returns the appropriate num_ctx value for an Ollama model, or None if the server is not Ollama. Research notes: tested against Ollama v0.6.2 source code + live instance. Confirmed that passing num_ctx in chat options triggers Ollama to reload the model with expanded KV cache (--ctx-size flag). Neither Open WebUI, Continue.dev, nor aider auto-detect and set num_ctx — this is a unique improvement. 8 new tests covering: GGUF context extraction, Modelfile num_ctx priority, non-Ollama server handling, connection errors, 404, provider prefix stripping, multi-architecture keys, empty model_info.

Salvaged fixes from community PRs: - fix(model_switch): _read_auth_store → _load_auth_store + fix auth store key lookup (was checking top-level dict instead of store['providers']). OAuth providers now correctly detected in /model picker. Cherry-picked from PR #5911 by Xule Lin (linxule). - fix(ollama): pass num_ctx to override 2048 default context window. Ollama defaults to 2048 context regardless of model capabilities. Now auto-detects from /api/show metadata and injects num_ctx into every request. Config override via model.ollama_num_ctx. Fixes #2708. Cherry-picked from PR #5929 by kshitij (kshitijk4poor). - fix(aux): normalize provider aliases for vision/auxiliary routing. Adds _normalize_aux_provider() with 17 aliases (google→gemini, claude→anthropic, glm→zai, etc). Fixes vision routing failure when provider is set to 'google' instead of 'gemini'. Cherry-picked from PR #5793 by e11i (Elizabeth1979). - fix(aux): rewrite MiniMax /anthropic base URLs to /v1 for OpenAI SDK. MiniMax's inference_base_url ends in /anthropic (Anthropic Messages API), but auxiliary client uses OpenAI SDK which appends /chat/completions → 404 at /anthropic/chat/completions. Generic _to_openai_base_url() helper rewrites terminal /anthropic to /v1 for OpenAI-compatible endpoint. Inspired by PR #5786 by Lempkey. Added debug logging to silent exception blocks across all fixes.

…5983) Salvaged fixes from community PRs: - fix(model_switch): _read_auth_store → _load_auth_store + fix auth store key lookup (was checking top-level dict instead of store['providers']). OAuth providers now correctly detected in /model picker. Cherry-picked from PR #5911 by Xule Lin (linxule). - fix(ollama): pass num_ctx to override 2048 default context window. Ollama defaults to 2048 context regardless of model capabilities. Now auto-detects from /api/show metadata and injects num_ctx into every request. Config override via model.ollama_num_ctx. Fixes #2708. Cherry-picked from PR #5929 by kshitij (kshitijk4poor). - fix(aux): normalize provider aliases for vision/auxiliary routing. Adds _normalize_aux_provider() with 17 aliases (google→gemini, claude→anthropic, glm→zai, etc). Fixes vision routing failure when provider is set to 'google' instead of 'gemini'. Cherry-picked from PR #5793 by e11i (Elizabeth1979). - fix(aux): rewrite MiniMax /anthropic base URLs to /v1 for OpenAI SDK. MiniMax's inference_base_url ends in /anthropic (Anthropic Messages API), but auxiliary client uses OpenAI SDK which appends /chat/completions → 404 at /anthropic/chat/completions. Generic _to_openai_base_url() helper rewrites terminal /anthropic to /v1 for OpenAI-compatible endpoint. Inspired by PR #5786 by Lempkey. Added debug logging to silent exception blocks across all fixes. Co-authored-by: Hermes Agent <hermes@nousresearch.com>

teknium1 · 2026-04-08T05:23:46Z

Merged via PR #5983. Your Ollama num_ctx fix was salvaged onto current main. Thanks @kshitijk4poor!

…ousResearch#5983) Salvaged fixes from community PRs: - fix(model_switch): _read_auth_store → _load_auth_store + fix auth store key lookup (was checking top-level dict instead of store['providers']). OAuth providers now correctly detected in /model picker. Cherry-picked from PR NousResearch#5911 by Xule Lin (linxule). - fix(ollama): pass num_ctx to override 2048 default context window. Ollama defaults to 2048 context regardless of model capabilities. Now auto-detects from /api/show metadata and injects num_ctx into every request. Config override via model.ollama_num_ctx. Fixes NousResearch#2708. Cherry-picked from PR NousResearch#5929 by kshitij (kshitijk4poor). - fix(aux): normalize provider aliases for vision/auxiliary routing. Adds _normalize_aux_provider() with 17 aliases (google→gemini, claude→anthropic, glm→zai, etc). Fixes vision routing failure when provider is set to 'google' instead of 'gemini'. Cherry-picked from PR NousResearch#5793 by e11i (Elizabeth1979). - fix(aux): rewrite MiniMax /anthropic base URLs to /v1 for OpenAI SDK. MiniMax's inference_base_url ends in /anthropic (Anthropic Messages API), but auxiliary client uses OpenAI SDK which appends /chat/completions → 404 at /anthropic/chat/completions. Generic _to_openai_base_url() helper rewrites terminal /anthropic to /v1 for OpenAI-compatible endpoint. Inspired by PR NousResearch#5786 by Lempkey. Added debug logging to silent exception blocks across all fixes. Co-authored-by: Hermes Agent <hermes@nousresearch.com>

Salvaged fixes from community PRs: - fix(model_switch): _read_auth_store → _load_auth_store + fix auth store key lookup (was checking top-level dict instead of store['providers']). OAuth providers now correctly detected in /model picker. Cherry-picked from PR NousResearch#5911 by Xule Lin (linxule). - fix(ollama): pass num_ctx to override 2048 default context window. Ollama defaults to 2048 context regardless of model capabilities. Now auto-detects from /api/show metadata and injects num_ctx into every request. Config override via model.ollama_num_ctx. Fixes NousResearch#2708. Cherry-picked from PR NousResearch#5929 by kshitij (kshitijk4poor). - fix(aux): normalize provider aliases for vision/auxiliary routing. Adds _normalize_aux_provider() with 17 aliases (google→gemini, claude→anthropic, glm→zai, etc). Fixes vision routing failure when provider is set to 'google' instead of 'gemini'. Cherry-picked from PR NousResearch#5793 by e11i (Elizabeth1979). - fix(aux): rewrite MiniMax /anthropic base URLs to /v1 for OpenAI SDK. MiniMax's inference_base_url ends in /anthropic (Anthropic Messages API), but auxiliary client uses OpenAI SDK which appends /chat/completions → 404 at /anthropic/chat/completions. Generic _to_openai_base_url() helper rewrites terminal /anthropic to /v1 for OpenAI-compatible endpoint. Inspired by PR NousResearch#5786 by Lempkey. Added debug logging to silent exception blocks across all fixes.

…ousResearch#5983) Salvaged fixes from community PRs: - fix(model_switch): _read_auth_store → _load_auth_store + fix auth store key lookup (was checking top-level dict instead of store['providers']). OAuth providers now correctly detected in /model picker. Cherry-picked from PR NousResearch#5911 by Xule Lin (linxule). - fix(ollama): pass num_ctx to override 2048 default context window. Ollama defaults to 2048 context regardless of model capabilities. Now auto-detects from /api/show metadata and injects num_ctx into every request. Config override via model.ollama_num_ctx. Fixes NousResearch#2708. Cherry-picked from PR NousResearch#5929 by kshitij (kshitijk4poor). - fix(aux): normalize provider aliases for vision/auxiliary routing. Adds _normalize_aux_provider() with 17 aliases (google→gemini, claude→anthropic, glm→zai, etc). Fixes vision routing failure when provider is set to 'google' instead of 'gemini'. Cherry-picked from PR NousResearch#5793 by e11i (Elizabeth1979). - fix(aux): rewrite MiniMax /anthropic base URLs to /v1 for OpenAI SDK. MiniMax's inference_base_url ends in /anthropic (Anthropic Messages API), but auxiliary client uses OpenAI SDK which appends /chat/completions → 404 at /anthropic/chat/completions. Generic _to_openai_base_url() helper rewrites terminal /anthropic to /v1 for OpenAI-compatible endpoint. Inspired by PR NousResearch#5786 by Lempkey. Added debug logging to silent exception blocks across all fixes. Co-authored-by: Hermes Agent <hermes@nousresearch.com>

teknium1 mentioned this pull request Apr 8, 2026

fix: provider/model resolution — salvage 4 community PRs + MiniMax aux URL fix #5983

Merged

teknium1 closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ollama): pass num_ctx to override 2048 default context window#5929

fix(ollama): pass num_ctx to override 2048 default context window#5929
kshitijk4poor wants to merge 1 commit into
NousResearch:mainfrom
kshitijk4poor:fix/ollama-context-length

kshitijk4poor commented Apr 7, 2026

Uh oh!

teknium1 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented Apr 7, 2026

Problem

Research

Fix

Resolution order for num_ctx

Config example

Changes

Testing

Notes

Uh oh!

teknium1 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants