Skip to content

fix(onboard): inject DNS fallback when host has systemd-resolved loopback (#3579)#3628

Merged
cv merged 2 commits into
mainfrom
fix/3579-sandbox-dns-fallback
May 15, 2026
Merged

fix(onboard): inject DNS fallback when host has systemd-resolved loopback (#3579)#3628
cv merged 2 commits into
mainfrom
fix/3579-sandbox-dns-fallback

Conversation

@cjagwani

@cjagwani cjagwani commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

nemoclaw onboard --gpu on Linux hosts running systemd-resolved (Ubuntu 22+/24, DGX Spark FastOS) creates GPU sandboxes that inherit nameserver 127.0.0.53 in /etc/resolv.conf when running on --network=host. #3623 already sidestepped the default-path symptom of #3579 by switching the GPU patch's default network mode away from host; Docker's embedded resolver at 127.0.0.11 then forwards correctly via the daemon (which runs in the host namespace and can reach 127.0.0.53).

This PR closes #3579 against the manager-provided spec, which calls for explicit detection of the 127.0.0.53 trap and explicit --dns injection rather than relying on Docker's implicit embedded-DNS forwarding. Belt-and-suspenders defense-in-depth: detectSandboxFallbackDns() reads /etc/resolv.conf, detects loopback-only state, pulls the real upstream from /run/systemd/resolve/resolv.conf, and the GPU-recreate path injects it as --dns <upstream>.

Acceptance criteria mapping (issue #3579 + manager's spec)

Clause Evidence Status
All DNS queries resolve from sandbox non-host network (#3623) + explicit --dns injection here MET
Resolver reachable from sandbox namespace --dns <real upstream> forwarded to container MET
Detect 127.0.0.53 / systemd-resolved detectSandboxFallbackDns reads /etc/resolv.conf, falls through to /run/systemd/resolve/resolv.conf for the real upstream MET
Inject reachable resolver explicitly via --dns buildDockerGpuCloneRunArgs DNS block MET
host.openshell.internal resolvable #3623 --add-host preservation (asserted by regression test) MET
Regression check naming google.com / gateway.discord.gg / integrate.api.nvidia.com / host.openshell.internal New test regression manifest: … names all four MET

Test plan

npm run build:cli
npx vitest run src/lib/onboard/docker-gpu-patch.test.ts

24/24 pass (16 existing + 8 new for #3579).

Notes for reviewers

  • Detection is deliberately narrow: only fires when all /etc/resolv.conf nameservers are 127.0.0.x. Single non-loopback resolver → null. Empty file → null. Missing systemd-resolved upstream file → null.
  • Injection respects OpenShell's existing --dns config: if host.Dns is non-empty, we don't override.
  • --dns is skipped entirely on --network=host because Docker ignores --dns flags in host networking mode. The opt-in host-networking case (via NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host) is therefore not mitigated by this PR — the detection helper is now exported and available, but auto-injection would be a no-op. If a user explicitly opts into host networking on a systemd-resolved host, they still hit the original [DGX Spark][Sandbox] Sandbox DNS completely broken under Docker-driver gateway (OpenShell 0.0.39) — all domain resolution fails with EAI_AGAIN #3579 trap. The default path is bulletproof.
  • The regression test for the 4 hostnames is unit-level (asserts the wiring: --add-host for host.openshell.internal, non-host network mode for public hosts, --dns injection when fallback is set). True E2E that actually runs getent hosts for the three public hostnames would need real Docker + outbound network and belongs in test/e2e/ if QA wants it later.
  • DGX Spark validation still required to confirm fix in production.

Closes #3579

Summary by CodeRabbit

  • New Features
    • Added automatic DNS fallback detection for GPU-enabled Docker sandboxes to handle systemd-resolved configurations.
    • Improves DNS resolution reliability when containers encounter loopback-only nameserver configurations.
    • Ensures proper DNS injection while respecting existing container DNS settings.

Review Change Stack

…back (#3579)

Adds `detectSandboxFallbackDns()` to read `/etc/resolv.conf`, detect when
all nameservers are loopback (e.g. 127.0.0.53 from systemd-resolved), and
return the real upstream from `/run/systemd/resolve/resolv.conf`. The
recreate path now injects this as `--dns <upstream>` so sandboxes on a
non-host network don't inherit an unreachable resolver.

Closes #3579. Pairs with #3623 (which switched the default network mode
to non-host so Docker's embedded resolver works for most users).

Acceptance clauses from the issue (manager spec) all met:
- Detect /etc/resolv.conf → 127.0.0.53
- Inject reachable resolver via --dns
- Preserve host.openshell.internal (#3623 already did this)
- Regression test naming all 4 hostnames from the spec

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@cjagwani cjagwani added bug Something fails against expected or documented behavior Platform: DGX Spark NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. labels May 15, 2026
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 07fd385e-456b-4dee-9871-da5ebcb3280f

📥 Commits

Reviewing files that changed from the base of the PR and between 8803093 and 1060db9.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts

📝 Walkthrough

Walkthrough

Detects when the host's /etc/resolv.conf contains only loopback nameservers, optionally reads systemd's upstream resolv.conf to pick a non-loopback DNS, exposes this via a new dependency and option, injects --dns <fallback> into docker run args for non-host networks when needed, and adds tests covering detection and injection behavior.

Changes

Docker GPU Sandbox DNS Fallback

Layer / File(s) Summary
Type additions and deps wiring
src/lib/onboard/docker-gpu-patch.ts
Adds sandboxFallbackDns to DockerGpuCloneRunOptions and an optional `detectSandboxFallbackDns?: () => string
DNS fallback detection implementation
src/lib/onboard/docker-gpu-patch.ts
New exported detectSandboxFallbackDns() parses /etc/resolv.conf, treats all-127.* nameservers as loopback-only, and reads /run/systemd/resolve/resolv.conf to return the first non-loopback nameserver or null.
Docker clone args DNS injection
src/lib/onboard/docker-gpu-patch.ts
buildDockerGpuCloneRunArgs continues forwarding host.Dns/host.DnsSearch, and injects --dns <sandboxFallbackDns> for non-host network mode only when the inspected host has no host.Dns and a fallback is provided.
GPU sandbox recreation integration
src/lib/onboard/docker-gpu-patch.ts
During GPU sandbox recreation, calls d.detectSandboxFallbackDns() and assigns the result to cloneOptions.sandboxFallbackDns for downstream arg injection.
DNS fallback detection and injection tests
src/lib/onboard/docker-gpu-patch.test.ts
Imports detectSandboxFallbackDns and adds tests validating upstream selection from systemd-resolved, null cases when resolv.conf is missing or non-loopback, conditional --dns injection, skipping injection when OpenShell already supplies DNS or when networkMode is host, and a regression combining host.openshell.internal with public DNS hostnames.

Sequence Diagram

sequenceDiagram
  participant Recreate as recreateOpenShellDockerSandboxWithGpu
  participant Detect as detectSandboxFallbackDns
  participant Builder as buildDockerGpuCloneRunArgs
  participant Docker as docker run
  Recreate->>Detect: d.detectSandboxFallbackDns()
  Detect-->>Recreate: returns fallback or null
  Recreate->>Builder: pass cloneOptions (sandboxFallbackDns set if present)
  Builder->>Builder: forward host.Dns/host.DnsSearch
  Builder->>Builder: if host.Dns empty && sandboxFallbackDns -> add --dns <fallback>
  Builder->>Docker: run args (may include --dns)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/NemoClaw#3623: Also modifies docker argument construction in GPU sandbox recreation; related to preserving --add-host while this PR adds --dns fallback injection.

Suggested labels

Docker, fix, Platform: Ubuntu, Local Models, NemoClaw CLI

Suggested reviewers

  • cv
  • ericksoa

Poem

🐰 In loopback's hush, resolv.conf hides,
I hop to systemd where upstream resides.
I fetch a nameserver, steady and bright,
Inject it in Docker — DNS works tonight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding DNS fallback injection for systemd-resolved loopback scenarios in GPU sandbox creation, which directly addresses the PR's core objective (issue #3579).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3579-sandbox-dns-fallback

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25942252020
Target ref: 8803093d7564c55472020a5441265f61a79ebe5e
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-e2e
Optional E2E: gpu-double-onboard-e2e, gpu-repo-local-ollama-openclaw

Dispatch hint: gpu-e2e

Auto-dispatched E2E: gpu-e2e via nightly-e2e.yaml at 1060db9762f12cc038f54c3c70891282eab655d7nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Covers the real Linux GPU onboarding flow that triggers the Docker GPU patch: install with NEMOCLAW_PROVIDER=ollama, create/recreate the sandbox with GPU access, verify host.openshell.internal/container reachability, and prove local inference from inside the sandbox.

Optional E2E

  • gpu-double-onboard-e2e (high): Useful adjacent confidence because re-onboard recreates/repairs the GPU Ollama sandbox and then verifies sandbox inference still works. It is less directly targeted than gpu-e2e because the PR changes DNS fallback rather than proxy token consistency.
  • gpu-repo-local-ollama-openclaw (high): Scenario-runner equivalent for the repo-current GPU local Ollama OpenClaw setup, covering local-ollama-inference and ollama-proxy suites if maintainers prefer scenario-based validation.

New E2E recommendations

Dispatch hint

  • Workflow: E2E / Nightly
  • jobs input: gpu-e2e

@cjagwani cjagwani self-assigned this May 15, 2026
…-through (#3579)

Hooks `detectSandboxFallbackDns` into `DockerGpuPatchDeps` so the
production callsite in `recreateOpenShellDockerSandboxWithGpu` can be
stubbed in unit tests. Adds two integration tests:

- Stubs the hook to return "9.9.9.9" and asserts `--dns 9.9.9.9` lands
  in the final dockerRunDetached call args (wire-through verified).
- Stubs the hook to return null and asserts no --dns is injected.

Addresses the audit gap on PR #3628 where `detectSandboxFallbackDns()`
was unit-tested in isolation and the injection path in
`buildDockerGpuCloneRunArgs` was unit-tested in isolation, but the
wire-through from production callsite into clone args was not.

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25942766352
Target ref: 1060db9762f12cc038f54c3c70891282eab655d7
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@cjagwani cjagwani requested a review from cv May 15, 2026 21:41
@cv cv merged commit 84b2663 into main May 15, 2026
27 checks passed
@miyoungc miyoungc mentioned this pull request May 16, 2026
12 tasks
@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed priority: high bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression NV QA Bugs found by the NVIDIA QA Team platform: dgx-spark Affects DGX Spark hardware or workflows UAT Issues flagged for User Acceptance Testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Sandbox] Sandbox DNS completely broken under Docker-driver gateway (OpenShell 0.0.39) — all domain resolution fails with EAI_AGAIN

3 participants