compatible-endpoint provider does not honour NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (vllm-local and ollama-local do); 60s default leaks through to reasoning-model streams

## Summary

NemoClaw supports three local-inference provider paths: `ollama-local`, `vllm-local`, and `compatible-endpoint`. `compatible-endpoint` is the path an operator picks when inference is served by an OpenAI-compatible endpoint running outside the sandbox (for example a user-owned Ollama, LM Studio, or vLLM server reachable on the LAN).

Observed behaviour: setting `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600` before `nemoclaw onboard` propagates the value into the gateway config when the chosen provider is `ollama-local` or `vllm-local`, but not when the chosen provider is `compatible-endpoint`. `openshell inference get` after onboard reports `Timeout: 60s (default)` for `compatible-endpoint` regardless of the exported env var.

Impact: reasoning / thinking models commonly pause longer than 60 seconds before first token (Qwen 3.6-35B during its reasoning phase, DeepSeek-R1, and similar). With the 60-second timeout in effect, those streams are cut mid-output. The client-side symptom matches the pattern reported in `openclaw/openclaw#64432` and the TUI-hang pattern in NemoClaw #2099.

## Reproduction

1. Point NemoClaw at an external OpenAI-compatible endpoint (provider `compatible-endpoint`, `baseUrl: http://<host>:11434/v1`).
2. Export `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600` in the shell that will run onboard.
3. Run `nemoclaw onboard`.
4. After the sandbox is up, run `openshell inference get`.

Expected: `Timeout: 600s`.
Actual: `Timeout: 60s (default)`.

For contrast, repeating the same sequence with `provider: ollama-local` (Ollama running inside the sandbox container) produces `Timeout: 600s` as expected.

## Observed evidence

The finding is observational at the runtime layer. What I can attest to:

- `openshell inference get` output after `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600 nemoclaw onboard` with provider `compatible-endpoint`: timeout shows `60s (default)`.
- Same environment variable, same sequence, provider `ollama-local`: timeout shows `600s`.
- The difference is consistent across repeated onboards on v0.0.23 and was also present on v0.0.20 before upgrade.

## Live repro, 2026-04-23

Host: Intel NUC (i7-10710U, 64 GB) running DietPi on Debian 13 (Trixie). External Ollama served from a separate LAN host (AMD Ryzen AI 9 HX 370, Radeon 890M iGPU via ROCm, Qwen 3.6-35B quantised Q4_K_M). Provider: `compatible-endpoint`.

Pre-onboard:

```
export NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600
nemoclaw onboard   # choose compatible-endpoint, base URL above, model qwen3.6:35b
```

Post-onboard:

```
$ openshell inference get
Gateway inference:
  Provider:  compatible-endpoint
  Model:     qwen3.6:35b
  Timeout:   60s (default)     <-- expected 600s
  Version:   1
```

### Manual overrides

`openshell inference set --timeout 600` after onboard does write `Timeout: 600s` into the gateway config successfully:

```
$ openshell inference set --timeout 600
Gateway inference configured:
  ...
  Timeout:   600s
```

However the override is not durable across sessions: the value reverts to `60s (default)` on a subsequent `nemoclaw <sandbox> connect` (the connect subcommand re-applies blueprint defaults). I will report that reversion behaviour in a separate filing; for this issue the relevant consequence is that the post-onboard manual override is single-session only.

The other in-sandbox path, `openclaw config set provider.compatible-endpoint.timeoutSeconds 600`, is currently blocked by the validator behaviour reported in [NVIDIA/NemoClaw#2400](https://github.com/NVIDIA/NemoClaw/issues/2400): the validator rejects the path on a sandbox where the key has not yet been written.

Net effect for `compatible-endpoint` + reasoning-model operators on v0.0.23:

1. `openshell inference set --timeout 600` after every `connect`, or
2. Use `ollama-local` instead (which changes the operational topology: Ollama then runs inside the sandbox rather than as a pre-existing service), or
3. Edit `/sandbox/.openclaw/openclaw.json` directly via `kubectl exec` (works, unsupported, bypasses both the validator in #2400 and the gateway reconcile), or
4. Accept the 60-second ceiling.

## What would fix this

- Plumb `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` into the `compatible-endpoint` provider so the env var's effect matches `ollama-local` and `vllm-local`.
- Document the three providers' env-var contracts side by side in the NemoClaw blueprint reference so future providers are added with consistent env handling.

## Environment

- NemoClaw host: Intel NUC i7-10710U, 64 GB RAM, DietPi on Debian 13 (Trixie), kernel 6.12.74
- External Ollama host: AMD Ryzen AI 9 HX 370, Radeon 890M iGPU via ROCm, Ubuntu 25.10, kernel 6.17
- Model: `qwen3.6:35b` quantised Q4_K_M
- NemoClaw: 0.0.23 (also seen on 0.0.20)
- OpenShell: 0.0.32
- OpenClaw: 2026.4.2 (sandbox image digest `b3d832b596...`)
- Provider: `compatible-endpoint`

## Supporting artifacts available on request

- Full `nemoclaw onboard` transcript with `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT=600` exported
- `openshell inference get` output after onboard (`60s (default)`) and after manual `openshell inference set --timeout 600` (`600s`)
- Comparative `ollama-local` onboard transcript showing `600s` correctly propagated

## Why this matters

Operators who configure NemoClaw with a pre-existing external inference service (LAN Ollama, internal vLLM, LM Studio) use the `compatible-endpoint` provider to reach it. That path is also the one where the model on the far end is often larger or more reasoning-heavy than the sandbox-embedded runtime would run locally. Reasoning models in that class can pause longer than 60 seconds before first token. With the 60-second timeout in effect, those streams are cut before the first content chunk reaches the client.

The fix is local (one provider's config-construction path); the cost of leaving it as-is is that `compatible-endpoint` silently applies a 60-second cap that the `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` documentation does not mention as provider-conditional.

## Cross-reference to adjacent issues

- `openclaw/openclaw#64432` (LLM idle timeout kills Ollama reasoning streams): the upstream symptom. If #64432 closes at the OpenClaw layer (for example by resetting the idle timer on thinking chunks), this timeout gap on `compatible-endpoint` still applies to other scenarios where a 60-second ceiling is too low.
- NemoClaw #2099 (TUI hangs 8+ min): the downstream user-visible symptom when this timeout fires without a clean error surfacing to the UI. A separate contribute-comment on #2099 will reference this filing once it is posted.
- [NVIDIA/NemoClaw#2400](https://github.com/NVIDIA/NemoClaw/issues/2400) (`openclaw config set` validator rejects unset keys): blocks the in-sandbox fix path for this timeout.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compatible-endpoint provider does not honour NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (vllm-local and ollama-local do); 60s default leaks through to reasoning-model streams #2403

Summary

Reproduction

Observed evidence

Live repro, 2026-04-23

Manual overrides

What would fix this

Environment

Supporting artifacts available on request

Why this matters

Cross-reference to adjacent issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

compatible-endpoint provider does not honour NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (vllm-local and ollama-local do); 60s default leaks through to reasoning-model streams #2403

Description

Summary

Reproduction

Observed evidence

Live repro, 2026-04-23

Manual overrides

What would fix this

Environment

Supporting artifacts available on request

Why this matters

Cross-reference to adjacent issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions