fix(onboard): preserve --add-host on GPU sandbox recreate (#3562) by cjagwani · Pull Request #3623 · NVIDIA/NemoClaw

cjagwani · 2026-05-15T19:27:01Z

Summary

Fixes host.openshell.internal DNS resolution inside GPU sandboxes. Closes #3562, #3568.

Root cause

PR #3434 added the Docker GPU patch, which recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without --gpu. The recreation gated --add-host injection on networkMode !== "host". But getDockerGpuPatchNetworkMode defaults to "host" (and NEMOCLAW_DOCKER_GPU_PATCH_NETWORK is never set), so every GPU recreate silently dropped the OpenShell-injected host.openshell.internal entry.

Symptom on DGX Spark: LLM request failed: network connection error — inference can't reach the Ollama auth proxy because the hostname doesn't resolve in the recreated container.

The fix

--add-host writes to /etc/hosts in the mount namespace, not the network stack — safe to preserve with --network=host. Removed the conditional. --dns / --dns-search guard kept (those genuinely conflict with host networking).

Also flipped local.test.ts:618 to match the current validator output (pre-existing test drift).

Tests

All 16/16 docker-gpu-patch.test.ts pass.
Existing test 'can switch the recreated sandbox to host networking for OpenShell callbacks' was asserting the bug; now asserts --add-host is preserved.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Docker GPU container networking now defaults to "preserve" unless explicitly set to host, improving predictable network behavior.
- Host-to-container callback mapping is always preserved, ensuring external callback endpoints remain reachable after container recreation and improving endpoint resolution.

The Docker GPU patch (#3434) recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without --gpu. The recreation gated --add-host injection on networkMode !== host, but the patch's default network mode IS host, so the OpenShell-injected host.openshell.internal entry was silently dropped on every GPU sandbox. Result on DGX Spark and other GPU hosts: the recreated sandbox cannot resolve host.openshell.internal, so OpenClaw inference cannot reach the Ollama auth proxy and falls back to embedded which hits the same broken path. Surfaces as 'LLM request failed: network connection error' (#3562, #3568). --add-host writes to the container's /etc/hosts in the mount namespace, not the network stack, so it is safe to preserve even with --network=host. Also flips local.test.ts:618 to match the current validator output ('failed the local probe' instead of the legacy 'did not answer the local probe in time'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

coderabbitai · 2026-05-15T19:27:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e55925f1-544a-41ce-96f3-09c36d2bfa95

📥 Commits

Reviewing files that changed from the base of the PR and between 2067acd and 505d6c4.

📒 Files selected for processing (2)

src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Defaults Docker GPU patch network mode to "preserve" unless explicitly set to "host", and always emits --add-host entries for host.ExtraHosts so host.openshell.internal is added and used (OPENSHELL_ENDPOINT updated) when recreating GPU sandbox containers.

Changes

GPU sandbox network mode and host DNS resolution

Layer / File(s)	Summary
Network mode default to preserve `src/lib/onboard/docker-gpu-patch.ts`, `src/lib/onboard/docker-gpu-patch.test.ts`	`getDockerGpuPatchNetworkMode` returns `"host"` only when `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK` is exactly `"host"`. `"preserve"`, `"bridge"`, and unspecified values now map to `"preserve"`. Test expectation updated for empty config.
Always translate host.ExtraHosts into --add-host `src/lib/onboard/docker-gpu-patch.ts`, `src/lib/onboard/docker-gpu-patch.test.ts`	`buildDockerGpuCloneRunArgs` now always iterates host.ExtraHosts and injects `--add-host` entries. Tests updated to expect `--add-host host.openshell.internal:172.17.0.1`, `--network openshell-docker`, and `OPENSHELL_ENDPOINT=http://host.openshell.internal:8080/` for recreated sandbox runs.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Docker, fix, OpenShell, Sandbox

Suggested reviewers

jyaunches
cv

"I'm a rabbit in the build and test,
hopping through network mode and host DNS best,
preserve the mode, map openshell just right,
now sandboxes can reach the host in flight!
🐇✨"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(onboard): preserve --add-host on GPU sandbox recreate' directly describes the main code change—preserving --add-host arguments when GPU sandboxes are recreated.
Linked Issues check	✅ Passed	Changes implement required fixes from issue `#3562`: preserve --add-host to restore host.openshell.internal DNS resolution in recreated GPU sandboxes and maintain connectivity to the host Ollama proxy.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing DNS resolution for host.openshell.internal in GPU sandbox recreation; no unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3562-docker-gpu-patch-preserves-add-host

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-15T19:28:32Z

E2E Advisor Recommendation

Required E2E: gpu-e2e, gpu-double-onboard-e2e
Optional E2E: sandbox-operations-e2e

Dispatch hint: gpu-e2e,gpu-double-onboard-e2e

Auto-dispatched E2E: gpu-e2e, gpu-double-onboard-e2e via nightly-e2e.yaml at 674a42299646e287d6d85d0b10df5a72aee665f2 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-e2e (high): Directly exercises the real GPU/Ollama user flow: install, onboard, Docker sandbox creation with GPU access, and inference through the sandbox. This is the primary E2E coverage for the modified Docker GPU patch path.
gpu-double-onboard-e2e (high): Re-onboards with the Ollama GPU provider and verifies the resulting sandbox/proxy remains usable. This gives merge-blocking coverage for repeated GPU sandbox recreation after changes to Docker GPU clone networking and add-host behavior.

Optional E2E

sandbox-operations-e2e (medium): Useful adjacent confidence for general sandbox lifecycle and recovery, but it does not specifically exercise the Docker GPU patch path or GPU runner behavior changed here.

New E2E recommendations

gpu-network-override (high): Existing GPU E2E appears to cover the default Docker GPU patch network behavior, but this PR also changes explicit NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host handling by preserving --add-host. Add an E2E variant that runs GPU/Ollama onboarding with NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host and verifies OpenShell callbacks plus inference still work.
- Suggested test: Add a GPU host-network override E2E job or scenario using NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host against the local Ollama OpenClaw flow.

Dispatch hint

Workflow: nightly-e2e.yaml
jobs input: gpu-e2e,gpu-double-onboard-e2e

github-actions · 2026-05-15T19:28:33Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25937196542
Target ref: 05d7f39720f1d7e9b81b574927bb519805f76b32
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/local.test.ts`:
- Line 618: The test assertion in local.test.ts is using an outdated regex
(/failed the local probe/) that doesn't match the actual message returned by
validateOllamaModel when the mocked probe returns empty; update the expectation
for expect(result.message) to match the real implementation message (e.g. assert
it matches /did not answer the local probe in time/ or broaden the regex to
accept both variants) so the test aligns with validateOllamaModel's returned
string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 51ba21b0-061f-4508-8752-32d6aba148e7

📥 Commits

Reviewing files that changed from the base of the PR and between 0964a7e and 05d7f39.

📒 Files selected for processing (3)

src/lib/inference/local.test.ts
src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts

@zifu

Replace the targeted ExtraHosts unconditional with the upstream root-cause fix verified on Spark by @zifu Yang: flip the default of getDockerGpuPatchNetworkMode from 'host' to 'preserve' so the GPU recreate keeps the original openshell-docker bridge network (and its --add-host injection) instead of switching to host networking. NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host still opts into host networking for callers that explicitly want it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

github-actions · 2026-05-15T19:33:58Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25937436958
Target ref: 2067acdabd8d598a9b2047f9b3bf22b3493ff96c
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 361-363: The current check returns "host" for networkOverride ===
"host", which causes callers like buildDockerGpuCloneRunArgs() to take the
networkMode === "host" branch and skip HostConfig.ExtraHosts; change the mapping
so an explicit "host" override is treated as "preserve" (i.e., return "preserve"
for networkOverride === "host" as well) so buildDockerGpuCloneRunArgs() and any
logic depending on networkMode will keep ExtraHosts intact; update the function
containing networkOverride (and its return) accordingly and keep references to
buildDockerGpuCloneRunArgs(), HostConfig.ExtraHosts, and networkMode in your
change notes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ac38133-5d9f-43b0-a3ff-bad93a127055

📥 Commits

Reviewing files that changed from the base of the PR and between 05d7f39 and 2067acd.

📒 Files selected for processing (2)

src/lib/onboard/docker-gpu-patch.test.ts
src/lib/onboard/docker-gpu-patch.ts

CodeRabbit follow-up on #3623: when a caller explicitly sets NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host, the recreation still drops --add-host because the buildDockerGpuCloneRunArgs guard was gated on networkMode !== host. --add-host modifies /etc/hosts in the mount namespace, not the network stack, so it is safe to preserve even with --network=host. Drop the guard; keep the --dns guard since --dns genuinely conflicts with --network=host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

github-actions · 2026-05-15T19:42:02Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25937800787
Target ref: 505d6c42f33c1ab69696d6c10a26875472b67726
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job	Result
gpu-e2e	⏭️ skipped

github-actions · 2026-05-15T20:00:24Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25938601978
Target ref: 674a42299646e287d6d85d0b10df5a72aee665f2
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job	Result
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped

…back (#3579) (#3628) ## Summary `nemoclaw onboard --gpu` on Linux hosts running systemd-resolved (Ubuntu 22+/24, DGX Spark FastOS) creates GPU sandboxes that inherit `nameserver 127.0.0.53` in `/etc/resolv.conf` when running on `--network=host`. #3623 already sidestepped the default-path symptom of #3579 by switching the GPU patch's default network mode away from host; Docker's embedded resolver at 127.0.0.11 then forwards correctly via the daemon (which runs in the host namespace and can reach 127.0.0.53). This PR closes #3579 against the manager-provided spec, which calls for **explicit** detection of the `127.0.0.53` trap and **explicit** `--dns` injection rather than relying on Docker's implicit embedded-DNS forwarding. Belt-and-suspenders defense-in-depth: `detectSandboxFallbackDns()` reads `/etc/resolv.conf`, detects loopback-only state, pulls the real upstream from `/run/systemd/resolve/resolv.conf`, and the GPU-recreate path injects it as `--dns <upstream>`. ## Acceptance criteria mapping (issue #3579 + manager's spec) | Clause | Evidence | Status | |---|---|---| | All DNS queries resolve from sandbox | non-host network (#3623) + explicit `--dns` injection here | MET | | Resolver reachable from sandbox namespace | `--dns <real upstream>` forwarded to container | MET | | Detect `127.0.0.53` / systemd-resolved | `detectSandboxFallbackDns` reads `/etc/resolv.conf`, falls through to `/run/systemd/resolve/resolv.conf` for the real upstream | MET | | Inject reachable resolver explicitly via `--dns` | `buildDockerGpuCloneRunArgs` DNS block | MET | | `host.openshell.internal` resolvable | #3623 `--add-host` preservation (asserted by regression test) | MET | | Regression check naming google.com / gateway.discord.gg / integrate.api.nvidia.com / host.openshell.internal | New test `regression manifest: …` names all four | MET | ## Test plan ``` npm run build:cli npx vitest run src/lib/onboard/docker-gpu-patch.test.ts ``` 24/24 pass (16 existing + 8 new for #3579). ## Notes for reviewers - Detection is deliberately narrow: only fires when **all** `/etc/resolv.conf` nameservers are 127.0.0.x. Single non-loopback resolver → null. Empty file → null. Missing systemd-resolved upstream file → null. - Injection respects OpenShell's existing `--dns` config: if `host.Dns` is non-empty, we don't override. - **`--dns` is skipped entirely on `--network=host`** because Docker ignores `--dns` flags in host networking mode. The opt-in host-networking case (via `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host`) is therefore **not** mitigated by this PR — the detection helper is now exported and available, but auto-injection would be a no-op. If a user explicitly opts into host networking on a systemd-resolved host, they still hit the original #3579 trap. The default path is bulletproof. - The regression test for the 4 hostnames is unit-level (asserts the wiring: `--add-host` for `host.openshell.internal`, non-host network mode for public hosts, `--dns` injection when fallback is set). True E2E that actually runs `getent hosts` for the three public hostnames would need real Docker + outbound network and belongs in `test/e2e/` if QA wants it later. - DGX Spark validation still required to confirm fix in production. Closes #3579  ## Summary by CodeRabbit * **New Features** * Added automatic DNS fallback detection for GPU-enabled Docker sandboxes to handle systemd-resolved configurations. * Improves DNS resolution reliability when containers encounter loopback-only nameserver configurations. * Ensures proper DNS injection while respecting existing container DNS settings.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3628?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)   --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/lib/inference/local.test.ts Outdated

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/lib/onboard/docker-gpu-patch.ts

cjagwani self-assigned this May 15, 2026

cjagwani added bug Something fails against expected or documented behavior priority: high Recommended Blocker Recommended release blocker for maintainer review NV QA Bugs found by the NVIDIA QA Team and removed Recommended Blocker Recommended release blocker for maintainer review labels May 15, 2026

cjagwani requested a review from cv May 15, 2026 19:38

cv approved these changes May 15, 2026

View reviewed changes

Merge branch 'main' into fix/3562-docker-gpu-patch-preserves-add-host

674a422

cv mentioned this pull request May 15, 2026

[DGX Spark][Sandbox] Sandbox DNS completely broken under Docker-driver gateway (OpenShell 0.0.39) — all domain resolution fails with EAI_AGAIN #3579

Closed

cv merged commit c643154 into main May 15, 2026
24 checks passed

cjagwani mentioned this pull request May 15, 2026

fix(onboard): inject DNS fallback when host has systemd-resolved loopback (#3579) #3628

Merged

miyoungc mentioned this pull request May 16, 2026

docs: update for 0.0.44 #3637

Merged

12 tasks

coderabbitai Bot mentioned this pull request May 21, 2026

fix(onboard): reject host.docker.internal inference URLs #3804

Merged

ericksoa mentioned this pull request May 21, 2026

Brev GPU E2E fails at OpenShell Docker bridge gateway reachability #3959

Open

wangericnv mentioned this pull request May 28, 2026

[WSL2 x86_64][Inference] Onboard with 'Start Ollama on Windows host' fails at [4/8] — auth proxy at host.openshell.internal:11435 not started #4421

Closed

Conversation

cjagwani commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

The fix

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 15, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 15, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjagwani commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading