Skip to content

[DGX Spark][Inference][GitHub Issue #3562] Express setup sandbox cannot reach local Ollama — host.openshell.internal unresolvable + policy port mismatch (11434 vs 11435) #3568

@cr7258

Description

@cr7258

Description

Description

On DGX Spark with NemoClaw v0.0.43 + OpenShell 0.0.39, the express setup completes successfully (installs Ollama, pulls qwen3.6:35b, creates sandbox) but inference from the sandbox fails with "LLM request failed: network connection error". Ollama works correctly on the host (direct curl returns valid response), but the sandbox cannot reach it.

Root cause analysis identified three layered issues:

1. DNS: host.openshell.internal does not resolve inside the sandbox
   - getent hosts host.openshell.internal → CANNOT RESOLVE
   - This hostname is used by the inference route to reach the host Ollama proxy
   - Likely caused by OpenShell 0.0.39 Docker-driver gateway not setting up host-gateway DNS (k3s gateway in 0.0.36 used CoreDNS + NodeHosts which worked)

2. Policy port mismatch: local_inference preset allows port 11434, but auth proxy listens on 11435
   - Policy: host.openshell.internal:11434 (allowed)
   - Actual proxy: 0.0.0.0:11435 (not in policy)
   - Even if DNS resolved, requests to :11435 would be blocked by policy

3. SSRF check: OpenShell proxy rejects requests because DNS resolution fails
   - Sandbox curl goes through http_proxy=10.200.0.1:3128
   - Proxy cannot resolve host.openshell.internal → returns 403 ssrf_denied

This broke between v0.0.38 (OpenShell 0.0.36, k3s gateway, inference worked) and v0.0.43 (OpenShell 0.0.39, Docker-driver gateway, inference broken).
Environment
Device:        DGX Spark (spark-dadc / dgx-spark-cr03, 10.173.104.110)
OS:            DGX Spark FastOS 1.135.33 (customer build)
Architecture:  aarch64
NemoClaw:      v0.0.43
OpenShell CLI: openshell 0.0.39
Ollama:        qwen3.6:35b (23 GB, 100% GPU, responds correctly on host)
Docker bridge: 172.17.0.0/16, gateway 172.17.0.1
Steps to Reproduce
1. Fresh DGX Spark with FastOS 1.135.33
2. Run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
3. Select "Express install" when prompted
4. Wait for Ollama install, model pull, sandbox creation to complete
5. Test inference:
   openshell sandbox exec --name my-assistant -- openclaw agent --session-id test -m "hello"
6. Test from sandbox:
   openshell sandbox exec --name my-assistant -- getent hosts host.openshell.internal
   openshell sandbox exec --name my-assistant -- curl -v http://host.openshell.internal:11435/api/tags
Expected Result
1. host.openshell.internal resolves to Docker bridge gateway (172.17.0.1)
2. Sandbox can reach Ollama auth proxy on port 11435
3. Agent inference returns a valid response via local Ollama
Actual Result
1. host.openshell.internal → CANNOT RESOLVE (getent returns nothing)
2. curl to host.openshell.internal:11435 → 403 ssrf_denied (allowed_ips check failed)
3. Agent inference → "FailoverError: LLM request failed: network connection error"

Direct host test (bypassing sandbox) works fine:
  curl http://localhost:11434/api/generate -d '{"model":"qwen3.6:35b","prompt":"hi"}' → valid response

Policy shows local_inference preset with port 11434 only:
  local_inference:
    endpoints:
    - host: host.openshell.internal
      port: 11434          ← should include 11435 (auth proxy port)
      allowed_ips: 10.0.0.0/8, 172.16.0.0/12
Logs
Gateway agent failed; falling back to embedded: GatewayClientRequestError:
  FailoverError: LLM request failed: network connection error.
[agent/embedded] embedded run agent end: runId=test-debug isError=true
  model=qwen3.6:35b provider=inference error=LLM request failed:
  network connection error. rawError=Connection error.
[model-fallback/decision] model fallback decision: decision=candidate_failed
  requested=inference/qwen3.6:35b reason=timeout next=none
  detail=Connection error.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword DGX_Spark_OTA_Computex, NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Install, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6179603]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamRecommended BlockerRecommended release blocker for maintainer reviewUATIssues flagged for User Acceptance Testing.area: local-modelsLocal model providers, downloads, launch, or connectivityarea: providersInference provider integrations and provider behaviorplatform: dgx-sparkAffects DGX Spark hardware or workflows

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions