Skip to content

fix(onboard): preserve --add-host on GPU sandbox recreate (#3562)#3623

Merged
cv merged 4 commits into
mainfrom
fix/3562-docker-gpu-patch-preserves-add-host
May 15, 2026
Merged

fix(onboard): preserve --add-host on GPU sandbox recreate (#3562)#3623
cv merged 4 commits into
mainfrom
fix/3562-docker-gpu-patch-preserves-add-host

Conversation

@cjagwani

@cjagwani cjagwani commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes host.openshell.internal DNS resolution inside GPU sandboxes. Closes #3562, #3568.

Root cause

PR #3434 added the Docker GPU patch, which recreates the OpenShell-managed sandbox container with GPU flags after OpenShell creates it without --gpu. The recreation gated --add-host injection on networkMode !== "host". But getDockerGpuPatchNetworkMode defaults to "host" (and NEMOCLAW_DOCKER_GPU_PATCH_NETWORK is never set), so every GPU recreate silently dropped the OpenShell-injected host.openshell.internal entry.

Symptom on DGX Spark: LLM request failed: network connection error — inference can't reach the Ollama auth proxy because the hostname doesn't resolve in the recreated container.

The fix

--add-host writes to /etc/hosts in the mount namespace, not the network stack — safe to preserve with --network=host. Removed the conditional. --dns / --dns-search guard kept (those genuinely conflict with host networking).

Also flipped local.test.ts:618 to match the current validator output (pre-existing test drift).

Tests

  • All 16/16 docker-gpu-patch.test.ts pass.
  • Existing test 'can switch the recreated sandbox to host networking for OpenShell callbacks' was asserting the bug; now asserts --add-host is preserved.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Docker GPU container networking now defaults to "preserve" unless explicitly set to host, improving predictable network behavior.
    • Host-to-container callback mapping is always preserved, ensuring external callback endpoints remain reachable after container recreation and improving endpoint resolution.

Review Change Stack

The Docker GPU patch (#3434) recreates the OpenShell-managed sandbox
container with GPU flags after OpenShell creates it without --gpu. The
recreation gated --add-host injection on networkMode !== host, but the
patch's default network mode IS host, so the OpenShell-injected
host.openshell.internal entry was silently dropped on every GPU
sandbox.

Result on DGX Spark and other GPU hosts: the recreated sandbox cannot
resolve host.openshell.internal, so OpenClaw inference cannot reach the
Ollama auth proxy and falls back to embedded which hits the same broken
path. Surfaces as 'LLM request failed: network connection error'
(#3562, #3568).

--add-host writes to the container's /etc/hosts in the mount namespace,
not the network stack, so it is safe to preserve even with
--network=host.

Also flips local.test.ts:618 to match the current validator output
('failed the local probe' instead of the legacy 'did not answer the
local probe in time').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e55925f1-544a-41ce-96f3-09c36d2bfa95

📥 Commits

Reviewing files that changed from the base of the PR and between 2067acd and 505d6c4.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-gpu-patch.test.ts

📝 Walkthrough

Walkthrough

Defaults Docker GPU patch network mode to "preserve" unless explicitly set to "host", and always emits --add-host entries for host.ExtraHosts so host.openshell.internal is added and used (OPENSHELL_ENDPOINT updated) when recreating GPU sandbox containers.

Changes

GPU sandbox network mode and host DNS resolution

Layer / File(s) Summary
Network mode default to preserve
src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-patch.test.ts
getDockerGpuPatchNetworkMode returns "host" only when NEMOCLAW_DOCKER_GPU_PATCH_NETWORK is exactly "host". "preserve", "bridge", and unspecified values now map to "preserve". Test expectation updated for empty config.
Always translate host.ExtraHosts into --add-host
src/lib/onboard/docker-gpu-patch.ts, src/lib/onboard/docker-gpu-patch.test.ts
buildDockerGpuCloneRunArgs now always iterates host.ExtraHosts and injects --add-host entries. Tests updated to expect --add-host host.openshell.internal:172.17.0.1, --network openshell-docker, and OPENSHELL_ENDPOINT=http://host.openshell.internal:8080/ for recreated sandbox runs.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Docker, fix, OpenShell, Sandbox

Suggested reviewers

  • jyaunches
  • cv

"I'm a rabbit in the build and test,
hopping through network mode and host DNS best,
preserve the mode, map openshell just right,
now sandboxes can reach the host in flight!
🐇✨"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): preserve --add-host on GPU sandbox recreate' directly describes the main code change—preserving --add-host arguments when GPU sandboxes are recreated.
Linked Issues check ✅ Passed Changes implement required fixes from issue #3562: preserve --add-host to restore host.openshell.internal DNS resolution in recreated GPU sandboxes and maintain connectivity to the host Ollama proxy.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing DNS resolution for host.openshell.internal in GPU sandbox recreation; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3562-docker-gpu-patch-preserves-add-host

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-e2e, gpu-double-onboard-e2e
Optional E2E: sandbox-operations-e2e

Dispatch hint: gpu-e2e,gpu-double-onboard-e2e

Auto-dispatched E2E: gpu-e2e, gpu-double-onboard-e2e via nightly-e2e.yaml at 674a42299646e287d6d85d0b10df5a72aee665f2nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Directly exercises the real GPU/Ollama user flow: install, onboard, Docker sandbox creation with GPU access, and inference through the sandbox. This is the primary E2E coverage for the modified Docker GPU patch path.
  • gpu-double-onboard-e2e (high): Re-onboards with the Ollama GPU provider and verifies the resulting sandbox/proxy remains usable. This gives merge-blocking coverage for repeated GPU sandbox recreation after changes to Docker GPU clone networking and add-host behavior.

Optional E2E

  • sandbox-operations-e2e (medium): Useful adjacent confidence for general sandbox lifecycle and recovery, but it does not specifically exercise the Docker GPU patch path or GPU runner behavior changed here.

New E2E recommendations

  • gpu-network-override (high): Existing GPU E2E appears to cover the default Docker GPU patch network behavior, but this PR also changes explicit NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host handling by preserving --add-host. Add an E2E variant that runs GPU/Ollama onboarding with NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host and verifies OpenShell callbacks plus inference still work.
    • Suggested test: Add a GPU host-network override E2E job or scenario using NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host against the local Ollama OpenClaw flow.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e,gpu-double-onboard-e2e

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25937196542
Target ref: 05d7f39720f1d7e9b81b574927bb519805f76b32
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/local.test.ts`:
- Line 618: The test assertion in local.test.ts is using an outdated regex
(/failed the local probe/) that doesn't match the actual message returned by
validateOllamaModel when the mocked probe returns empty; update the expectation
for expect(result.message) to match the real implementation message (e.g. assert
it matches /did not answer the local probe in time/ or broaden the regex to
accept both variants) so the test aligns with validateOllamaModel's returned
string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 51ba21b0-061f-4508-8752-32d6aba148e7

📥 Commits

Reviewing files that changed from the base of the PR and between 0964a7e and 05d7f39.

📒 Files selected for processing (3)
  • src/lib/inference/local.test.ts
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts

Comment thread src/lib/inference/local.test.ts Outdated
Replace the targeted ExtraHosts unconditional with the upstream
root-cause fix verified on Spark by @zifu Yang: flip the default of
getDockerGpuPatchNetworkMode from 'host' to 'preserve' so the GPU
recreate keeps the original openshell-docker bridge network (and its
--add-host injection) instead of switching to host networking.

NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host still opts into host
networking for callers that explicitly want it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25937436958
Target ref: 2067acdabd8d598a9b2047f9b3bf22b3493ff96c
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/docker-gpu-patch.ts`:
- Around line 361-363: The current check returns "host" for networkOverride ===
"host", which causes callers like buildDockerGpuCloneRunArgs() to take the
networkMode === "host" branch and skip HostConfig.ExtraHosts; change the mapping
so an explicit "host" override is treated as "preserve" (i.e., return "preserve"
for networkOverride === "host" as well) so buildDockerGpuCloneRunArgs() and any
logic depending on networkMode will keep ExtraHosts intact; update the function
containing networkOverride (and its return) accordingly and keep references to
buildDockerGpuCloneRunArgs(), HostConfig.ExtraHosts, and networkMode in your
change notes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ac38133-5d9f-43b0-a3ff-bad93a127055

📥 Commits

Reviewing files that changed from the base of the PR and between 05d7f39 and 2067acd.

📒 Files selected for processing (2)
  • src/lib/onboard/docker-gpu-patch.test.ts
  • src/lib/onboard/docker-gpu-patch.ts

Comment thread src/lib/onboard/docker-gpu-patch.ts
@cjagwani cjagwani self-assigned this May 15, 2026
@cjagwani cjagwani added bug Something fails against expected or documented behavior priority: high Recommended Blocker Recommended release blocker for maintainer review NV QA Bugs found by the NVIDIA QA Team and removed Recommended Blocker Recommended release blocker for maintainer review labels May 15, 2026
@cjagwani cjagwani requested a review from cv May 15, 2026 19:38
CodeRabbit follow-up on #3623: when a caller explicitly sets
NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host, the recreation still drops
--add-host because the buildDockerGpuCloneRunArgs guard was gated on
networkMode !== host. --add-host modifies /etc/hosts in the mount
namespace, not the network stack, so it is safe to preserve even
with --network=host. Drop the guard; keep the --dns guard since
--dns genuinely conflicts with --network=host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25937800787
Target ref: 505d6c42f33c1ab69696d6c10a26875472b67726
Workflow ref: main
Requested jobs: gpu-e2e
Summary: 0 passed, 0 failed, 1 skipped

Job Result
gpu-e2e ⏭️ skipped

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ⚠️ No requested jobs ran

Run: 25938601978
Target ref: 674a42299646e287d6d85d0b10df5a72aee665f2
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job Result
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped

@cv cv merged commit c643154 into main May 15, 2026
24 checks passed
cv pushed a commit that referenced this pull request May 15, 2026
…back (#3579) (#3628)

## Summary

`nemoclaw onboard --gpu` on Linux hosts running systemd-resolved (Ubuntu
22+/24, DGX Spark FastOS) creates GPU sandboxes that inherit `nameserver
127.0.0.53` in `/etc/resolv.conf` when running on `--network=host`.
#3623 already sidestepped the default-path symptom of #3579 by switching
the GPU patch's default network mode away from host; Docker's embedded
resolver at 127.0.0.11 then forwards correctly via the daemon (which
runs in the host namespace and can reach 127.0.0.53).

This PR closes #3579 against the manager-provided spec, which calls for
**explicit** detection of the `127.0.0.53` trap and **explicit** `--dns`
injection rather than relying on Docker's implicit embedded-DNS
forwarding. Belt-and-suspenders defense-in-depth:
`detectSandboxFallbackDns()` reads `/etc/resolv.conf`, detects
loopback-only state, pulls the real upstream from
`/run/systemd/resolve/resolv.conf`, and the GPU-recreate path injects it
as `--dns <upstream>`.

## Acceptance criteria mapping (issue #3579 + manager's spec)

| Clause | Evidence | Status |
|---|---|---|
| All DNS queries resolve from sandbox | non-host network (#3623) +
explicit `--dns` injection here | MET |
| Resolver reachable from sandbox namespace | `--dns <real upstream>`
forwarded to container | MET |
| Detect `127.0.0.53` / systemd-resolved | `detectSandboxFallbackDns`
reads `/etc/resolv.conf`, falls through to
`/run/systemd/resolve/resolv.conf` for the real upstream | MET |
| Inject reachable resolver explicitly via `--dns` |
`buildDockerGpuCloneRunArgs` DNS block | MET |
| `host.openshell.internal` resolvable | #3623 `--add-host` preservation
(asserted by regression test) | MET |
| Regression check naming google.com / gateway.discord.gg /
integrate.api.nvidia.com / host.openshell.internal | New test
`regression manifest: …` names all four | MET |

## Test plan

```
npm run build:cli
npx vitest run src/lib/onboard/docker-gpu-patch.test.ts
```
24/24 pass (16 existing + 8 new for #3579).

## Notes for reviewers

- Detection is deliberately narrow: only fires when **all**
`/etc/resolv.conf` nameservers are 127.0.0.x. Single non-loopback
resolver → null. Empty file → null. Missing systemd-resolved upstream
file → null.
- Injection respects OpenShell's existing `--dns` config: if `host.Dns`
is non-empty, we don't override.
- **`--dns` is skipped entirely on `--network=host`** because Docker
ignores `--dns` flags in host networking mode. The opt-in
host-networking case (via `NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host`) is
therefore **not** mitigated by this PR — the detection helper is now
exported and available, but auto-injection would be a no-op. If a user
explicitly opts into host networking on a systemd-resolved host, they
still hit the original #3579 trap. The default path is bulletproof.
- The regression test for the 4 hostnames is unit-level (asserts the
wiring: `--add-host` for `host.openshell.internal`, non-host network
mode for public hosts, `--dns` injection when fallback is set). True E2E
that actually runs `getent hosts` for the three public hostnames would
need real Docker + outbound network and belongs in `test/e2e/` if QA
wants it later.
- DGX Spark validation still required to confirm fix in production.

Closes #3579

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added automatic DNS fallback detection for GPU-enabled Docker
sandboxes to handle systemd-resolved configurations.
* Improves DNS resolution reliability when containers encounter
loopback-only nameserver configurations.
* Ensures proper DNS injection while respecting existing container DNS
settings.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3628?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
@miyoungc miyoungc mentioned this pull request May 16, 2026
12 tasks
@wscurran wscurran added area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression and removed priority: high bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression NV QA Bugs found by the NVIDIA QA Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Inference] Express setup sandbox cannot reach local Ollama — host.openshell.internal unresolvable + policy port mismatch (11434 vs 11435)

3 participants