Skip to content

[macOS][Policy&Network] inference.local routed through host HTTP_PROXY on Colima — large model inference fails with 120s timeout #4846

@hulynn

Description

@hulynn

Description

On macOS with Colima, when a host-level HTTP proxy is active (HTTP_PROXY=http://127.0.0.1:8118), NemoClaw does not add inference.local to NO_PROXY during onboard. All inference requests are routed through the host proxy, which does not support long-lived streaming connections. Every chat message to a large model (Ultra 550B, Super 120B) times out after exactly 120s with "LLM idle timeout (120s): no response from model" and "Broken pipe (os error 32)". Ubuntu bare-metal is unaffected because no host proxy intercepts inference.local traffic.

Environment

Device:        MacBook (arm64, Colima Docker runtime)
OS:            macOS 15.x (Darwin 25.1.0, arm64)
Architecture:  arm64
Node.js:       not captured
npm:           not captured
Docker:        Colima
OpenShell CLI: 0.0.44
NemoClaw:      v0.0.59
OpenClaw:      2026.5.27 (27ae826)
Host proxy:    HTTP_PROXY=http://127.0.0.1:8118 (detected and warned during onboard)

Steps to Reproduce

  1. On macOS with Colima and a host HTTP proxy active (HTTP_PROXY=http://127.0.0.1:8118)
  2. Install NemoClaw v0.0.59
  3. nemoclaw onboard — select NVIDIA Endpoints, model nvidia/nemotron-3-ultra-550b-a55b (onboard warns: "HTTP_PROXY detected" but does not add inference.local to NO_PROXY)
  4. nemoclaw my-assistant connect && openclaw tui
  5. Send any chat message (e.g. "hello")
  6. Wait — observe timeout after 120s

Expected Result

inference.local is a NemoClaw-managed virtual hostname that should resolve locally within the OpenShell network stack, not via the host proxy. Onboard should add inference.local (and *.local) to NO_PROXY so proxy bypass is automatic.

Model responds normally. (Confirmed: Ultra 550B TTFB = 31s on Ubuntu bare-metal, well within the 120s timeout when no proxy is involved.)

Actual Result

Gateway log (repeated for every chat message):

fetch timeout after 120000ms (elapsed 119350ms) operation=fetchWithSsrFGuard
  url=https://inference.local/v1/chat/completions
[provider-transport-fetch] error provider=inference api=openai-completions
  model=nvidia/nemotron-3-ultra-550b-a55b elapsedMs=119364 name=TimeoutError
Embedded agent failed before reply: LLM idle timeout (120s): no response from model
NET:FAIL [LOW] [msg:Proxy connection error: Broken pipe (os error 32)]

Pattern repeats on every message. Session is completely unusable.

Logs

Gateway log excerpts (openclaw gateway-persistent.log):
  2026-06-05T09:37:52.815+00:00 [fetch-timeout] fetch timeout after 120000ms (elapsed 119350ms)
    operation=fetchWithSsrFGuard url=https://inference.local/v1/chat/completions
  2026-06-05T09:37:52.826+00:00 [provider-transport-fetch] [model-fetch] error
    provider=inference api=openai-completions model=nvidia/nemotron-3-ultra-550b-a55b
    elapsedMs=119364 name=TimeoutError
  2026-06-05T09:37:52.998+00:00 Embedded agent failed before reply:
    LLM idle timeout (120s): no response from model
  [1780652455.685] NET:FAIL [LOW] [msg:Proxy connection error: Broken pipe (os error 32)]

Onboard warning:
  HTTP_PROXY=http://127.0.0.1:8118 detected on host

Comparison (Ubuntu bare-metal, same API key, same model):
  curl TTFB for nvidia/nemotron-3-ultra-550b-a55b = 31s → HTTP 200 (no proxy)
  curl TTFB for nvidia/nemotron-3-nano-omni-30b-a3b-reasoning = 300ms → HTTP 200

NVB#6272789

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamarea: inferenceInference routing, serving, model selection, or outputsarea: networkingDNS, proxy, TLS, ports, host aliases, or connectivityplatform: macosAffects macOS, including Apple Silicon

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions