Skip to content

[DGX Spark][Inference] Express setup sandbox cannot reach local Ollama — host.openshell.internal unresolvable + policy port mismatch (11434 vs 11435) #3562

@wangericnv

Description

@wangericnv

Description

Description

On DGX Spark with NemoClaw v0.0.43 + OpenShell 0.0.39, the express setup completes successfully (installs Ollama, pulls qwen3.6:35b, creates sandbox) but inference from the sandbox fails with "LLM request failed: network connection error". Ollama works correctly on the host (direct curl returns valid response), but the sandbox cannot reach it.

Root cause analysis identified three layered issues:

1. DNS: host.openshell.internal does not resolve inside the sandbox
   - getent hosts host.openshell.internal → CANNOT RESOLVE
   - This hostname is used by the inference route to reach the host Ollama proxy
   - Likely caused by OpenShell 0.0.39 Docker-driver gateway not setting up host-gateway DNS (k3s gateway in 0.0.36 used CoreDNS + NodeHosts which worked)

2. Policy port mismatch: local_inference preset allows port 11434, but auth proxy listens on 11435
   - Policy: host.openshell.internal:11434 (allowed)
   - Actual proxy: 0.0.0.0:11435 (not in policy)
   - Even if DNS resolved, requests to :11435 would be blocked by policy

3. SSRF check: OpenShell proxy rejects requests because DNS resolution fails
   - Sandbox curl goes through http_proxy=10.200.0.1:3128
   - Proxy cannot resolve host.openshell.internal → returns 403 ssrf_denied

This broke between v0.0.38 (OpenShell 0.0.36, k3s gateway, inference worked) and v0.0.43 (OpenShell 0.0.39, Docker-driver gateway, inference broken).
Environment
Device:        DGX Spark (spark-dadc / dgx-spark-cr03, 10.173.104.110)
OS:            DGX Spark FastOS 1.135.33 (customer build)
Architecture:  aarch64
NemoClaw:      v0.0.43
OpenShell CLI: openshell 0.0.39
Ollama:        qwen3.6:35b (23 GB, 100% GPU, responds correctly on host)
Docker bridge: 172.17.0.0/16, gateway 172.17.0.1
Steps to Reproduce
1. Fresh DGX Spark with FastOS 1.135.33
2. Run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
3. Select "Express install" when prompted
4. Wait for Ollama install, model pull, sandbox creation to complete
5. Test inference:
   openshell sandbox exec --name my-assistant -- openclaw agent --session-id test -m "hello"
6. Test from sandbox:
   openshell sandbox exec --name my-assistant -- getent hosts host.openshell.internal
   openshell sandbox exec --name my-assistant -- curl -v http://host.openshell.internal:11435/api/tags
Expected Result
1. host.openshell.internal resolves to Docker bridge gateway (172.17.0.1)
2. Sandbox can reach Ollama auth proxy on port 11435
3. Agent inference returns a valid response via local Ollama
Actual Result
1. host.openshell.internal → CANNOT RESOLVE (getent returns nothing)
2. curl to host.openshell.internal:11435 → 403 ssrf_denied (allowed_ips check failed)
3. Agent inference → "FailoverError: LLM request failed: network connection error"

Direct host test (bypassing sandbox) works fine:
  curl http://localhost:11434/api/generate -d '{"model":"qwen3.6:35b","prompt":"hi"}' → valid response

Policy shows local_inference preset with port 11434 only:
  local_inference:
    endpoints:
    - host: host.openshell.internal
      port: 11434          ← should include 11435 (auth proxy port)
      allowed_ips: 10.0.0.0/8, 172.16.0.0/12
Logs
Gateway agent failed; falling back to embedded: GatewayClientRequestError:
  FailoverError: LLM request failed: network connection error.
[agent/embedded] embedded run agent end: runId=test-debug isError=true
  model=qwen3.6:35b provider=inference error=LLM request failed:
  network connection error. rawError=Connection error.
[model-fallback/decision] model fallback decision: decision=candidate_failed
  requested=inference/qwen3.6:35b reason=timeout next=none
  detail=Connection error.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword DGX_Spark_OTA_Computex, NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Install, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6179603]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.securityPotential vulnerability, unsafe behavior, or access risk

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions