Skip to content

[Feature]: Configurable timeouts for auxiliary call_llm and context compression #3404

@alanfwilliams

Description

@alanfwilliams

Problem or Use Case

When using local LLM providers (Ollama, oMLX, llama.cpp) on consumer hardware, the hardcoded 30s timeout in auxiliary_client.py and 45s timeout in context_compressor.py are too short. Local models need time for prefill, especially when the main model is already generating and auxiliary requests queue behind it.

This was partially addressed for the main client (#1010HERMES_API_TIMEOUT) and for vision (#2107auxiliary.vision.timeout in config.yaml), but the pattern wasn't extended to:

  1. auxiliary_client.pycall_llm(), async_call_llm(), and _build_call_kwargs() all default to timeout: float = 30.0
  2. context_compressor.py — hardcoded "timeout": 45.0 at line 350
  3. title_generator.py — hardcoded timeout: float = 15.0

Impact

On a local setup running a single model for both main inference and auxiliary tasks (compression, session search, skills_hub, flush_memories, title generation), requests queue behind the main generation. A 30s timeout fires before prefill even completes, causing:

  • Context compression failures → context grows until it exceeds the context window
  • Title generation failures (15s is particularly tight)
  • Session search timeout loops (auxiliary request queues, times out, retries, times out again)

Proposed Solution

Follow the existing pattern from HERMES_API_TIMEOUT and auxiliary.vision.timeout:

Option A (env var): HERMES_AUX_TIMEOUT for auxiliary calls, HERMES_COMPRESSION_TIMEOUT for compression — consistent with HERMES_API_TIMEOUT and HERMES_STREAM_STALE_TIMEOUT.

Option B (config.yaml): Add timeout fields under existing config sections:

compression:
  timeout: 120      # was hardcoded 45

auxiliary:
  default_timeout: 90  # was hardcoded 30 in call_llm/async_call_llm

Option B is cleaner long-term. Option A is a one-line patch per call site.

Workaround

Currently patching the defaults in auxiliary_client.py and context_compressor.py to read from env vars. These patches are lost on hermes update.

Environment

  • macOS, M1 Max 32GB
  • oMLX serving Qwen3.5-35B-A3B-4bit locally
  • Single model handling both main and auxiliary tasks
  • Hermes v0.3.0 (latest as of 2026-03-27)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions