Problem or Use Case
When using local LLM providers (Ollama, oMLX, llama.cpp) on consumer hardware, the hardcoded 30s timeout in auxiliary_client.py and 45s timeout in context_compressor.py are too short. Local models need time for prefill, especially when the main model is already generating and auxiliary requests queue behind it.
This was partially addressed for the main client (#1010 → HERMES_API_TIMEOUT) and for vision (#2107 → auxiliary.vision.timeout in config.yaml), but the pattern wasn't extended to:
auxiliary_client.py — call_llm(), async_call_llm(), and _build_call_kwargs() all default to timeout: float = 30.0
context_compressor.py — hardcoded "timeout": 45.0 at line 350
title_generator.py — hardcoded timeout: float = 15.0
Impact
On a local setup running a single model for both main inference and auxiliary tasks (compression, session search, skills_hub, flush_memories, title generation), requests queue behind the main generation. A 30s timeout fires before prefill even completes, causing:
- Context compression failures → context grows until it exceeds the context window
- Title generation failures (15s is particularly tight)
- Session search timeout loops (auxiliary request queues, times out, retries, times out again)
Proposed Solution
Follow the existing pattern from HERMES_API_TIMEOUT and auxiliary.vision.timeout:
Option A (env var): HERMES_AUX_TIMEOUT for auxiliary calls, HERMES_COMPRESSION_TIMEOUT for compression — consistent with HERMES_API_TIMEOUT and HERMES_STREAM_STALE_TIMEOUT.
Option B (config.yaml): Add timeout fields under existing config sections:
compression:
timeout: 120 # was hardcoded 45
auxiliary:
default_timeout: 90 # was hardcoded 30 in call_llm/async_call_llm
Option B is cleaner long-term. Option A is a one-line patch per call site.
Workaround
Currently patching the defaults in auxiliary_client.py and context_compressor.py to read from env vars. These patches are lost on hermes update.
Environment
- macOS, M1 Max 32GB
- oMLX serving Qwen3.5-35B-A3B-4bit locally
- Single model handling both main and auxiliary tasks
- Hermes v0.3.0 (latest as of 2026-03-27)
Problem or Use Case
When using local LLM providers (Ollama, oMLX, llama.cpp) on consumer hardware, the hardcoded 30s timeout in
auxiliary_client.pyand 45s timeout incontext_compressor.pyare too short. Local models need time for prefill, especially when the main model is already generating and auxiliary requests queue behind it.This was partially addressed for the main client (#1010 →
HERMES_API_TIMEOUT) and for vision (#2107 →auxiliary.vision.timeoutin config.yaml), but the pattern wasn't extended to:auxiliary_client.py—call_llm(),async_call_llm(), and_build_call_kwargs()all default totimeout: float = 30.0context_compressor.py— hardcoded"timeout": 45.0at line 350title_generator.py— hardcodedtimeout: float = 15.0Impact
On a local setup running a single model for both main inference and auxiliary tasks (compression, session search, skills_hub, flush_memories, title generation), requests queue behind the main generation. A 30s timeout fires before prefill even completes, causing:
Proposed Solution
Follow the existing pattern from
HERMES_API_TIMEOUTandauxiliary.vision.timeout:Option A (env var):
HERMES_AUX_TIMEOUTfor auxiliary calls,HERMES_COMPRESSION_TIMEOUTfor compression — consistent withHERMES_API_TIMEOUTandHERMES_STREAM_STALE_TIMEOUT.Option B (config.yaml): Add timeout fields under existing config sections:
Option B is cleaner long-term. Option A is a one-line patch per call site.
Workaround
Currently patching the defaults in
auxiliary_client.pyandcontext_compressor.pyto read from env vars. These patches are lost onhermes update.Environment