Skip to content

compatible-endpoint provider does not honour NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (vllm-local and ollama-local do); 60s default leaks through to reasoning-model streams #2403

@davidglogan

Description

@davidglogan

Summary

NemoClaw supports three local-inference provider paths: ollama-local, vllm-local, and compatible-endpoint. compatible-endpoint is the path an operator picks when inference is served by an OpenAI-compatible endpoint running outside the sandbox (for example a user-owned Ollama, LM Studio, or vLLM server reachable on the LAN).

Observed behaviour: setting NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600 before nemoclaw onboard propagates the value into the gateway config when the chosen provider is ollama-local or vllm-local, but not when the chosen provider is compatible-endpoint. openshell inference get after onboard reports Timeout: 60s (default) for compatible-endpoint regardless of the exported env var.

Impact: reasoning / thinking models commonly pause longer than 60 seconds before first token (Qwen 3.6-35B during its reasoning phase, DeepSeek-R1, and similar). With the 60-second timeout in effect, those streams are cut mid-output. The client-side symptom matches the pattern reported in openclaw/openclaw#64432 and the TUI-hang pattern in NemoClaw #2099.

Reproduction

  1. Point NemoClaw at an external OpenAI-compatible endpoint (provider compatible-endpoint, baseUrl: http://<host>:11434/v1).
  2. Export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600 in the shell that will run onboard.
  3. Run nemoclaw onboard.
  4. After the sandbox is up, run openshell inference get.

Expected: Timeout: 600s.
Actual: Timeout: 60s (default).

For contrast, repeating the same sequence with provider: ollama-local (Ollama running inside the sandbox container) produces Timeout: 600s as expected.

Observed evidence

The finding is observational at the runtime layer. What I can attest to:

  • openshell inference get output after NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600 nemoclaw onboard with provider compatible-endpoint: timeout shows 60s (default).
  • Same environment variable, same sequence, provider ollama-local: timeout shows 600s.
  • The difference is consistent across repeated onboards on v0.0.23 and was also present on v0.0.20 before upgrade.

Live repro, 2026-04-23

Host: Intel NUC (i7-10710U, 64 GB) running DietPi on Debian 13 (Trixie). External Ollama served from a separate LAN host (AMD Ryzen AI 9 HX 370, Radeon 890M iGPU via ROCm, Qwen 3.6-35B quantised Q4_K_M). Provider: compatible-endpoint.

Pre-onboard:

export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600
nemoclaw onboard   # choose compatible-endpoint, base URL above, model qwen3.6:35b

Post-onboard:

$ openshell inference get
Gateway inference:
  Provider:  compatible-endpoint
  Model:     qwen3.6:35b
  Timeout:   60s (default)     <-- expected 600s
  Version:   1

Manual overrides

openshell inference set --timeout 600 after onboard does write Timeout: 600s into the gateway config successfully:

$ openshell inference set --timeout 600
Gateway inference configured:
  ...
  Timeout:   600s

However the override is not durable across sessions: the value reverts to 60s (default) on a subsequent nemoclaw <sandbox> connect (the connect subcommand re-applies blueprint defaults). I will report that reversion behaviour in a separate filing; for this issue the relevant consequence is that the post-onboard manual override is single-session only.

The other in-sandbox path, openclaw config set provider.compatible-endpoint.timeoutSeconds 600, is currently blocked by the validator behaviour reported in NVIDIA/NemoClaw#2400: the validator rejects the path on a sandbox where the key has not yet been written.

Net effect for compatible-endpoint + reasoning-model operators on v0.0.23:

  1. openshell inference set --timeout 600 after every connect, or
  2. Use ollama-local instead (which changes the operational topology: Ollama then runs inside the sandbox rather than as a pre-existing service), or
  3. Edit /sandbox/.openclaw/openclaw.json directly via kubectl exec (works, unsupported, bypasses both the validator in openclaw config set isRecognizedConfigPath rejects unset keys, blocking the documented per-agent override path (and the workaround for openclaw/openclaw#64432) #2400 and the gateway reconcile), or
  4. Accept the 60-second ceiling.

What would fix this

  • Plumb NEMOCLAW_LOCAL_INFERENCE_TIMEOUT into the compatible-endpoint provider so the env var's effect matches ollama-local and vllm-local.
  • Document the three providers' env-var contracts side by side in the NemoClaw blueprint reference so future providers are added with consistent env handling.

Environment

  • NemoClaw host: Intel NUC i7-10710U, 64 GB RAM, DietPi on Debian 13 (Trixie), kernel 6.12.74
  • External Ollama host: AMD Ryzen AI 9 HX 370, Radeon 890M iGPU via ROCm, Ubuntu 25.10, kernel 6.17
  • Model: qwen3.6:35b quantised Q4_K_M
  • NemoClaw: 0.0.23 (also seen on 0.0.20)
  • OpenShell: 0.0.32
  • OpenClaw: 2026.4.2 (sandbox image digest b3d832b596...)
  • Provider: compatible-endpoint

Supporting artifacts available on request

  • Full nemoclaw onboard transcript with NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600 exported
  • openshell inference get output after onboard (60s (default)) and after manual openshell inference set --timeout 600 (600s)
  • Comparative ollama-local onboard transcript showing 600s correctly propagated

Why this matters

Operators who configure NemoClaw with a pre-existing external inference service (LAN Ollama, internal vLLM, LM Studio) use the compatible-endpoint provider to reach it. That path is also the one where the model on the far end is often larger or more reasoning-heavy than the sandbox-embedded runtime would run locally. Reasoning models in that class can pause longer than 60 seconds before first token. With the 60-second timeout in effect, those streams are cut before the first content chunk reaches the client.

The fix is local (one provider's config-construction path); the cost of leaving it as-is is that compatible-endpoint silently applies a 60-second cap that the NEMOCLAW_LOCAL_INFERENCE_TIMEOUT documentation does not mention as provider-conditional.

Cross-reference to adjacent issues

Metadata

Metadata

Assignees

Labels

area: docsDocumentation, examples, guides, or docs buildarea: local-modelsLocal model providers, downloads, launch, or connectivityarea: providersInference provider integrations and provider behavior
No fields configured for Enhancement.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions