fix(onboard): preserve --add-host on GPU sandbox recreate (#3562)#3623
Conversation
The Docker GPU patch (#3434) recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without --gpu. The recreation gated --add-host injection on networkMode !== host, but the patch's default network mode IS host, so the OpenShell-injected host.openshell.internal entry was silently dropped on every GPU sandbox. Result on DGX Spark and other GPU hosts: the recreated sandbox cannot resolve host.openshell.internal, so OpenClaw inference cannot reach the Ollama auth proxy and falls back to embedded which hits the same broken path. Surfaces as 'LLM request failed: network connection error' (#3562, #3568). --add-host writes to the container's /etc/hosts in the mount namespace, not the network stack, so it is safe to preserve even with --network=host. Also flips local.test.ts:618 to match the current validator output ('failed the local probe' instead of the legacy 'did not answer the local probe in time'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughDefaults Docker GPU patch network mode to "preserve" unless explicitly set to "host", and always emits ChangesGPU sandbox network mode and host DNS resolution
🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/local.test.ts`:
- Line 618: The test assertion in local.test.ts is using an outdated regex
(/failed the local probe/) that doesn't match the actual message returned by
validateOllamaModel when the mocked probe returns empty; update the expectation
for expect(result.message) to match the real implementation message (e.g. assert
it matches /did not answer the local probe in time/ or broaden the regex to
accept both variants) so the test aligns with validateOllamaModel's returned
string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 51ba21b0-061f-4508-8752-32d6aba148e7
📒 Files selected for processing (3)
src/lib/inference/local.test.tssrc/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.ts
Replace the targeted ExtraHosts unconditional with the upstream root-cause fix verified on Spark by @zifu Yang: flip the default of getDockerGpuPatchNetworkMode from 'host' to 'preserve' so the GPU recreate keeps the original openshell-docker bridge network (and its --add-host injection) instead of switching to host networking. NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host still opts into host networking for callers that explicitly want it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 361-363: The current check returns "host" for networkOverride ===
"host", which causes callers like buildDockerGpuCloneRunArgs() to take the
networkMode === "host" branch and skip HostConfig.ExtraHosts; change the mapping
so an explicit "host" override is treated as "preserve" (i.e., return "preserve"
for networkOverride === "host" as well) so buildDockerGpuCloneRunArgs() and any
logic depending on networkMode will keep ExtraHosts intact; update the function
containing networkOverride (and its return) accordingly and keep references to
buildDockerGpuCloneRunArgs(), HostConfig.ExtraHosts, and networkMode in your
change notes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7ac38133-5d9f-43b0-a3ff-bad93a127055
📒 Files selected for processing (2)
src/lib/onboard/docker-gpu-patch.test.tssrc/lib/onboard/docker-gpu-patch.ts
CodeRabbit follow-up on #3623: when a caller explicitly sets NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host, the recreation still drops --add-host because the buildDockerGpuCloneRunArgs guard was gated on networkMode !== host. --add-host modifies /etc/hosts in the mount namespace, not the network stack, so it is safe to preserve even with --network=host. Drop the guard; keep the --dns guard since --dns genuinely conflicts with --network=host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
…back (#3579) (#3628) ## Summary `nemoclaw onboard --gpu` on Linux hosts running systemd-resolved (Ubuntu 22+/24, DGX Spark FastOS) creates GPU sandboxes that inherit `nameserver 127.0.0.53` in `/etc/resolv.conf` when running on `--network=host`. #3623 already sidestepped the default-path symptom of #3579 by switching the GPU patch's default network mode away from host; Docker's embedded resolver at 127.0.0.11 then forwards correctly via the daemon (which runs in the host namespace and can reach 127.0.0.53). This PR closes #3579 against the manager-provided spec, which calls for **explicit** detection of the `127.0.0.53` trap and **explicit** `--dns` injection rather than relying on Docker's implicit embedded-DNS forwarding. Belt-and-suspenders defense-in-depth: `detectSandboxFallbackDns()` reads `/etc/resolv.conf`, detects loopback-only state, pulls the real upstream from `/run/systemd/resolve/resolv.conf`, and the GPU-recreate path injects it as `--dns <upstream>`. ## Acceptance criteria mapping (issue #3579 + manager's spec) | Clause | Evidence | Status | |---|---|---| | All DNS queries resolve from sandbox | non-host network (#3623) + explicit `--dns` injection here | MET | | Resolver reachable from sandbox namespace | `--dns <real upstream>` forwarded to container | MET | | Detect `127.0.0.53` / systemd-resolved | `detectSandboxFallbackDns` reads `/etc/resolv.conf`, falls through to `/run/systemd/resolve/resolv.conf` for the real upstream | MET | | Inject reachable resolver explicitly via `--dns` | `buildDockerGpuCloneRunArgs` DNS block | MET | | `host.openshell.internal` resolvable | #3623 `--add-host` preservation (asserted by regression test) | MET | | Regression check naming google.com / gateway.discord.gg / integrate.api.nvidia.com / host.openshell.internal | New test `regression manifest: …` names all four | MET | ## Test plan ``` npm run build:cli npx vitest run src/lib/onboard/docker-gpu-patch.test.ts ``` 24/24 pass (16 existing + 8 new for #3579). ## Notes for reviewers - Detection is deliberately narrow: only fires when **all** `/etc/resolv.conf` nameservers are 127.0.0.x. Single non-loopback resolver → null. Empty file → null. Missing systemd-resolved upstream file → null. - Injection respects OpenShell's existing `--dns` config: if `host.Dns` is non-empty, we don't override. - **`--dns` is skipped entirely on `--network=host`** because Docker ignores `--dns` flags in host networking mode. The opt-in host-networking case (via `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host`) is therefore **not** mitigated by this PR — the detection helper is now exported and available, but auto-injection would be a no-op. If a user explicitly opts into host networking on a systemd-resolved host, they still hit the original #3579 trap. The default path is bulletproof. - The regression test for the 4 hostnames is unit-level (asserts the wiring: `--add-host` for `host.openshell.internal`, non-host network mode for public hosts, `--dns` injection when fallback is set). True E2E that actually runs `getent hosts` for the three public hostnames would need real Docker + outbound network and belongs in `test/e2e/` if QA wants it later. - DGX Spark validation still required to confirm fix in production. Closes #3579 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added automatic DNS fallback detection for GPU-enabled Docker sandboxes to handle systemd-resolved configurations. * Improves DNS resolution reliability when containers encounter loopback-only nameserver configurations. * Ensures proper DNS injection while respecting existing container DNS settings. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3628?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Summary
Fixes
host.openshell.internalDNS resolution inside GPU sandboxes. Closes #3562, #3568.Root cause
PR #3434 added the Docker GPU patch, which recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without
--gpu. The recreation gated--add-hostinjection onnetworkMode !== "host". ButgetDockerGpuPatchNetworkModedefaults to"host"(andNEMOCLAW_DOCKER_GPU_PATCH_NETWORKis never set), so every GPU recreate silently dropped the OpenShell-injectedhost.openshell.internalentry.Symptom on DGX Spark:
LLM request failed: network connection error— inference can't reach the Ollama auth proxy because the hostname doesn't resolve in the recreated container.The fix
--add-hostwrites to/etc/hostsin the mount namespace, not the network stack — safe to preserve with--network=host. Removed the conditional.--dns/--dns-searchguard kept (those genuinely conflict with host networking).Also flipped
local.test.ts:618to match the current validator output (pre-existing test drift).Tests
docker-gpu-patch.test.tspass.--add-hostis preserved.🤖 Generated with Claude Code
Summary by CodeRabbit