Skip to content

[DGX Spark][Sandbox] Sandbox DNS completely broken under Docker-driver gateway (OpenShell 0.0.39) — all domain resolution fails with EAI_AGAIN #3579

@zNeill

Description

@zNeill

Description

Description

On DGX Spark with NemoClaw v0.0.43 + OpenShell 0.0.39 Docker-driver gateway, ALL DNS resolution inside the sandbox fails. No domain can be resolved — not local (host.openshell.internal), not cloud inference (integrate.api.nvidia.com), not external services (gateway.discord.gg, google.com). This breaks every network-dependent feature: inference, messaging channels, policy presets, npm/pip installs.

Root cause: Docker-driver gateway runs sandbox containers with NetworkMode=host. The sandbox inherits the host's /etc/resolv.conf which points to nameserver 127.0.0.53 (systemd-resolved). But 127.0.0.53 is the sandbox's own loopback — systemd-resolved only runs in the host namespace, not inside the sandbox. All DNS queries go to a non-existent resolver.

This is a regression from OpenShell 0.0.36 (k3s gateway) where CoreDNS + DNS proxy (10.43.0.10 / 10.200.0.1) handled sandbox DNS correctly.

Supersedes the narrower description in NVBug 6179603 (which only identified host.openshell.internal + port mismatch). The actual scope is all DNS, all domains, all sandbox network operations.
Environment
Device:        DGX Spark (spark-dadc, 10.173.104.110)
OS:            DGX Spark FastOS 1.135.33 (customer build)
Architecture:  aarch64
NemoClaw:      v0.0.43
OpenShell CLI: openshell 0.0.39 (Docker-driver gateway)
Docker:        29.2.1
Sandbox:       test11, provider=nvidia-prod, model=deepseek-v4-pro
Steps to Reproduce
1. DGX Spark with NemoClaw v0.0.43 installed via express setup
2. Sandbox running and healthy per nemoclaw status
3. Test DNS from inside sandbox:
   openshell sandbox exec --name test11 -- getent hosts google.com
   openshell sandbox exec --name test11 -- getent hosts gateway.discord.gg
   openshell sandbox exec --name test11 -- getent hosts integrate.api.nvidia.com
   openshell sandbox exec --name test11 -- getent hosts host.openshell.internal
4. Check sandbox resolv.conf:
   openshell sandbox exec --name test11 -- cat /etc/resolv.conf
5. Check Docker networking mode:
   docker inspect $(docker ps -q | head -1) | grep NetworkMode
Expected Result
All DNS queries resolve successfully from inside sandbox.
Sandbox DNS should use a resolver that is reachable from within the sandbox
namespace (e.g., CoreDNS on a bridge IP, or the host's real DNS forwarder
on a non-loopback address).
Actual Result
ALL DNS queries fail:
  getent hosts google.com              → (empty, no result)
  getent hosts gateway.discord.gg      → (empty, no result)
  getent hosts integrate.api.nvidia.com → (empty, no result)
  getent hosts host.openshell.internal → (empty, no result)

Sandbox resolv.conf:
  nameserver 127.0.0.53    ← host's systemd-resolved, unreachable from sandbox
  search nvidia.com nvprod.nvidia.com

Docker container config:
  NetworkMode: host
  DNS: None
  ExtraHosts: None

Impact on features:
  - Cloud inference (NVIDIA Endpoints): timeout — cannot resolve API hostname
  - Local inference (Ollama): connection error — cannot resolve host.openshell.internal
  - Discord channel: EAI_AGAIN — cannot resolve gateway.discord.gg
  - npm/pip/brew installs: would fail — cannot resolve registries
  - Any outbound HTTPS from sandbox: fails

Contrast with working config (OpenShell 0.0.36 / k3s gateway):
  nameserver 10.200.0.1    ← DNS proxy, reachable from sandbox
  DNS verification: 4 passed, 0 failed
Logs
Discord bridge log (symptom):
  [discord] gateway error: Error: getaddrinfo EAI_AGAIN gateway.discord.gg
  [discord] gateway was not ready after 15000ms; restarting gateway
  [discord] [default] auto-restart attempt 9/10 in 300s

Inference (symptom):
  FailoverError: LLM request failed: network connection error.
  [agent/embedded] error=LLM request failed: network connection error.
  rawError=Connection error.

OpenClaw agent timeout:
  The model did not produce a response before the LLM idle timeout.

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Inference, NemoClaw_Policy&Network, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6180113]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.area: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: dgx-sparkAffects DGX Spark hardware or workflows

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions