Skip to content

fix(agent): disable stale stream timeout for local providers#6368

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-b0a4b31e
Apr 9, 2026
Merged

fix(agent): disable stale stream timeout for local providers#6368
teknium1 merged 1 commit into
mainfrom
hermes/hermes-b0a4b31e

Conversation

@teknium1

@teknium1 teknium1 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds for prefill on large contexts. The 180s stale stream detector was killing these connections while the provider was still actively processing, causing spurious reconnects and abandoned requests.

Inspired by PR #6123 by @Archerouyang (who identified the issue in #5889). This implementation uses the existing is_local_endpoint() from agent/model_metadata.py instead of creating a new detection method — it does proper URL parsing with localhost, RFC-1918, IPv6, and WSL support without the false positives from substring matching.

Changes

run_agent.py — 1 file, +18/-11 lines

Before the existing token-scaled stale timeout logic, check if:

  1. The timeout is at the default 180s (user hasn't explicitly set HERMES_STREAM_STALE_TIMEOUT)
  2. A base_url is set (not the SDK default)
  3. is_local_endpoint() identifies it as local

If all three, disable the stale detector (float('inf')). Otherwise, fall through to the existing token-scaled logic unchanged.

Behavior matrix

Scenario Stale timeout
Ollama on localhost, default config Disabled (inf)
Ollama on localhost, explicit HERMES_STREAM_STALE_TIMEOUT=300 300s (user setting honored)
LAN server (192.168.x.x) Disabled (inf)
OpenRouter / cloud provider 180s (with token scaling)
Cloud Ollama proxy (api.ollama.com) 180s (correctly NOT detected as local)
Empty/None URL (SDK default) 180s (correctly NOT detected as local)

Test plan

  • 671 run_agent tests pass, 5 skipped
  • E2E verified: is_local_endpoint() correctly identifies all local patterns and rejects cloud URLs
  • py_compile clean

Fixes #5889

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes #5889
@teknium1 teknium1 merged commit ae4a884 into main Apr 9, 2026
5 of 6 checks passed
saxster pushed a commit to saxster/hermes-agent that referenced this pull request Apr 9, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
malaiwah pushed a commit to malaiwah/hermes-agent that referenced this pull request Apr 11, 2026
)

Extends is_local_endpoint() to detect local LLM proxies accessed via
container DNS names (e.g. hermes-litellm, ollama), fixing the stale
stream timeout (180s) firing on local providers during prefill.

Three additions:

1. Unqualified hostnames (no dots) → always local. Docker/Podman DNS,
   mDNS, and /etc/hosts entries are always on the local network.

2. DNS resolution fallback — resolve hostname to IP with
   socket.gethostbyname(), check if the resolved address is private.

3. Configurable model.local_endpoints in config.yaml — explicit list
   of hostnames to treat as local for edge cases where DNS resolution
   isn't available.

Fixes NousResearch#7905
Related: NousResearch#7069, NousResearch#6368

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tommyeds pushed a commit to Tommyeds/hermes-agent that referenced this pull request Apr 12, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…earch#6368)

Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds
for prefill on large contexts. The 180s stale stream detector was killing
these connections while the provider was still processing.

Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918,
localhost, WSL detection) instead of ad-hoc substring matching. The stale
timeout is only disabled when the user hasn't explicitly set
HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored.

Fixes NousResearch#5889
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]:Hermes reconnects after 180s of provider silence even though oMLX is still processing

1 participant