Skip to content

fix(onboard): preflight sandbox-bridge → gateway reachability on Docker-driver#3441

Merged
ericksoa merged 2 commits into
NVIDIA:mainfrom
stevenrick:fix/preflight-sandbox-gateway-reachability
May 13, 2026
Merged

fix(onboard): preflight sandbox-bridge → gateway reachability on Docker-driver#3441
ericksoa merged 2 commits into
NVIDIA:mainfrom
stevenrick:fix/preflight-sandbox-gateway-reachability

Conversation

@stevenrick

@stevenrick stevenrick commented May 13, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #3439.

Adds a sibling probe to gateway-tcp-readiness.tsgateway-sandbox-reachability.ts — that runs once after the host loopback TCP check passes during Docker-driver gateway startup. It spawns a short-lived busybox container on the openshell-docker bridge and TCP-connects to host.openshell.internal:GATEWAY_PORT the same way real sandboxes do.

If the probe fails, the preflight fails with an actionable message instead of letting the user spend 12+ minutes building a sandbox image that then crash-loops on Policy fetch failed: failed to connect to OpenShell server until the 180s readiness window expires and the orphan-cleanup path runs.

Why a separate probe

The existing isGatewayTcpReady probe checks 127.0.0.1:GATEWAY_PORT. That correctly verifies "the gateway binary is alive and listening on the host", but localhost connects take the kernel loopback shortcut and bypass the host INPUT chain. On hosts with UFW active in default-deny incoming posture (Brev's default), the gateway answers from the host while sandbox containers can't reach it — the packet path that matters is bridge → host INPUT, which only a container probe exercises faithfully.

Behavior

  • OK path: quiet — proceeds straight to "✓ Docker-driver gateway is healthy" as before.
  • Failure path: prints the bridge subnet plus a copy-paste-ready sudo ufw allow from <subnet> to any port 8080 proto tcp, then exits via the same exitOnFailure contract the surrounding function already uses.
  • Diagnostic-only. Never mutates iptables/ufw. Privileged firewall changes from an installer are correctly out of scope.

The probe adds one docker run per onboard (~1-2s on first call when the helper image needs pulling, sub-second after). It only runs once per onboard — gated on the existing gateway-healthy success branch, not the poll loop.

Validation

Pre-existing test infra note

The test-cli prek hook was skipped on the commit. The Model Router tests in test/onboard.test.ts exceed the default 5s vitest timeout under parallel suite load (all 6 pass standalone in ~3s each). Reproduces against d12cce5 on a clean checkout — independent of this PR.

Test plan

  • CI passes
  • Reviewer eyeballs the actionable error message format in formatSandboxBridgeUnreachableMessage — the wording is the user-facing hook for this whole change.

Signed-off-by: Steven Rick srick@nvidia.com

Summary by CodeRabbit

  • New Features

    • Gateway startup now performs an additional sandbox bridge reachability check before declaring readiness.
  • Bug Fixes / Improvements

    • Better handling of probe failures with clearer, context-sensitive CLI messages and non-blocking warnings when appropriate.
  • Tests

    • Added test coverage validating sandbox bridge reachability checks during gateway startup.

Review Change Stack

…er-driver

Add a sibling probe to gateway-tcp-readiness that runs once after the host
loopback TCP check passes, spawning a short-lived container on the
openshell-docker bridge and TCP-connecting to host.openshell.internal:GATEWAY_PORT
the same way real sandboxes do. Fails the preflight with an actionable
"sudo ufw allow from <subnet> to any port 8080 proto tcp" message instead of
proceeding to a 12+ minute silent sandbox build that crash-loops on
"Policy fetch failed" once it lands.

The existing isGatewayTcpReady probe checks 127.0.0.1:GATEWAY_PORT, which
takes the kernel loopback shortcut and bypasses the host INPUT chain. On
UFW-active hosts (Brev's default posture) the gateway answers from the host
but sandbox containers can't reach it — the path that matters is
bridge → host INPUT, which only a container probe exercises faithfully.

Diagnostic-only: never mutates iptables/ufw. Hint includes the bridge
subnet derived from `docker network inspect openshell-docker` so the user
can paste the correct rule verbatim.

Fixes #3439.

## Validation
- npm run typecheck:cli
- npm run build:cli
- npm run checks
- npm run format:check
- npx vitest run src/lib/onboard/gateway-sandbox-reachability.test.ts (8 passed)
- npx vitest run test/onboard.test.ts -t "Model Router" (6 passed standalone;
  see "Pre-existing test infra notes" in the PR body)

Signed-off-by: Steven Rick <srick@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 547e490d-17db-4b4f-963b-b70edd4876ae

📥 Commits

Reviewing files that changed from the base of the PR and between 934eee6 and f3795b7.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • src/lib/onboard/gateway-sandbox-reachability.ts
  • test/gateway-liveness-probe.test.ts

📝 Walkthrough

Walkthrough

Adds a Docker-network-based sandbox→gateway reachability probe, a CLI formatter for actionable failure messages, Vitest tests for probe outcomes and messaging, and integrates a verifier (verifySandboxBridgeGatewayReachableOrExit) into startDockerDriverGateway() at multiple readiness checkpoints.

Changes

Sandbox-bridge Gateway Reachability

Layer / File(s) Summary
Onboard integration: call reachability verifier from gateway startup
src/lib/onboard.ts
Imports verifySandboxBridgeGatewayReachableOrExit and invokes it at three Docker-driver gateway readiness points: reuse path after metadata+host-HTTP health, adoption after registration+host-HTTP health, and inside the startup TCP readiness polling loop before declaring the gateway healthy.
Sandbox-bridge reachability probe module
src/lib/onboard/gateway-sandbox-reachability.ts
Defines result/reason types (adds probe_unavailable), inspects openshell-docker subnet, runs a transient helper container to execute nc -zw<timeout> host.openshell.internal <port>, maps outcomes to ok/tcp_failed/network_not_found/probe_unavailable with subnet/detail, and exports verifySandboxBridgeGatewayReachableOrExit(exitOnFailure) that logs/formats messages and optionally exits on hard failures.
Probe and message formatter tests
src/lib/onboard/gateway-sandbox-reachability.test.ts, test/gateway-liveness-probe.test.ts
Adds Vitest coverage for probe outcomes (ok, network_not_found, tcp_failed, probe_unavailable), argument wiring (networkName/port/timeoutSec and nc -zw{timeout} host.openshell.internal {port}), message formatting branches (ufw remediation, inspect fallback, network-not-found, probe-unavailable), and a test asserting reachability verifier is invoked multiple times in startDockerDriverGateway().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3378: Overlapping changes around gating "Docker-driver gateway is healthy" on additional readiness checks (this PR adds sandbox-bridge reachability; that PR adds TCP readiness and child-exit handling).

Suggested labels

Docker, fix, OpenShell

Suggested reviewers

  • jyaunches
  • ericksoa
  • prekshivyas

Poem

🐰 A rabbit's hop for gateway checks
I sniff the bridge where packets roam,
I spawn a probe to find the home.
If firewalls halt the tiny stream,
I print the rule to mend the seam.
Hop happy — now the sandboxes dream.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and accurately summarizes the main change: adding a preflight sandbox-bridge reachability check to the Docker-driver gateway onboarding process.
Linked Issues check ✅ Passed The PR fully implements the objectives from issue #3439: adds gateway-sandbox-reachability probe to inspect Docker network subnet, spawn container on that network, attempt TCP connect to host.openshell.internal:8080, and fail preflight with actionable firewall remediation on failure.
Out of Scope Changes check ✅ Passed All code changes are directly scoped to implementing the sandbox-bridge reachability check: new probe module, tests, integration into onboard flow, and test coverage; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/lib/onboard/gateway-sandbox-reachability.ts (1)

21-21: ⚡ Quick win

Pin the probe image to an immutable reference instead of latest.

Line 21 uses a mutable tag (busybox:latest), which can change onboarding behavior without code changes. Pin to a version or digest for reproducible preflight behavior.

Proposed change
-const DEFAULT_PROBE_IMAGE = "busybox:latest";
+// Pin to a reviewed immutable image reference for reproducible probe behavior.
+const DEFAULT_PROBE_IMAGE = "busybox@sha256:<verified-digest>";
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/gateway-sandbox-reachability.ts` at line 21,
DEFAULT_PROBE_IMAGE is set to a mutable tag ("busybox:latest"); replace it with
an immutable reference to ensure reproducible behavior by updating the
DEFAULT_PROBE_IMAGE constant to a specific version tag or image digest (e.g., a
numeric tag or sha256 digest) in gateway-sandbox-reachability.ts, and update any
related tests or documentation that expect the previous value.
src/lib/onboard.ts (1)

4567-4580: Please exercise the gateway reuse path in E2E.

This change sits on the Docker-driver startup/reuse branch, so I'd run at least sandbox-operations-e2e and openshell-gateway-upgrade-e2e, plus a reuse/UFW regression if you have one, before merging. As per coding guidelines, src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 4567 - 4580, The test reviewer asks to
exercise the gateway reuse path in E2E: run sandbox-operations-e2e and
openshell-gateway-upgrade-e2e (and any reuse/UFW regression) to validate the
Docker-driver startup/reuse changes around isSandboxBridgeGatewayReachable;
before merging, run those E2E suites against the branch that modifies
src/lib/onboard.ts (specifically the block that calls
isSandboxBridgeGatewayReachable and uses
formatSandboxBridgeUnreachableMessage/exitOnFailure) and confirm gateway reuse
works and UFW-related reachability is handled as expected, fixing any failures
exposed by the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4567-4580: Create a small helper (e.g.,
verifySandboxBridgeReachabilityOrExit) that encapsulates the existing
reachability probe using isSandboxBridgeGatewayReachable(),
console.error(formatSandboxBridgeUnreachableMessage(reach)), process.exit(1)
when exitOnFailure is true, and throwing the Error when not exiting; then call
that helper immediately before every successful return from
startDockerDriverGateway() reuse/adopt paths (the branches that currently return
on the earlier reuse/adopt lines) as well as where the fresh-start loop
currently runs the probe, ensuring the bridge probe executes on every successful
Docker-driver reuse path.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4567-4580: The test reviewer asks to exercise the gateway reuse
path in E2E: run sandbox-operations-e2e and openshell-gateway-upgrade-e2e (and
any reuse/UFW regression) to validate the Docker-driver startup/reuse changes
around isSandboxBridgeGatewayReachable; before merging, run those E2E suites
against the branch that modifies src/lib/onboard.ts (specifically the block that
calls isSandboxBridgeGatewayReachable and uses
formatSandboxBridgeUnreachableMessage/exitOnFailure) and confirm gateway reuse
works and UFW-related reachability is handled as expected, fixing any failures
exposed by the tests.

In `@src/lib/onboard/gateway-sandbox-reachability.ts`:
- Line 21: DEFAULT_PROBE_IMAGE is set to a mutable tag ("busybox:latest");
replace it with an immutable reference to ensure reproducible behavior by
updating the DEFAULT_PROBE_IMAGE constant to a specific version tag or image
digest (e.g., a numeric tag or sha256 digest) in
gateway-sandbox-reachability.ts, and update any related tests or documentation
that expect the previous value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cd158b8f-28bb-4ca8-9068-3f5cd75b5f55

📥 Commits

Reviewing files that changed from the base of the PR and between d12cce5 and 934eee6.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • src/lib/onboard/gateway-sandbox-reachability.ts

Comment thread src/lib/onboard.ts Outdated
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The follow-up covers the reuse/adopt paths, keeps src/lib/onboard.ts net-neutral, avoids treating helper-container failures as firewall failures, and CI is green on the updated head.

@ericksoa ericksoa enabled auto-merge (squash) May 13, 2026 06:46
@ericksoa ericksoa merged commit e6077ca into NVIDIA:main May 13, 2026
21 checks passed
@stevenrick stevenrick deleted the fix/preflight-sandbox-gateway-reachability branch May 13, 2026 06:47
ericksoa added a commit that referenced this pull request May 13, 2026
Reverts e6077ca / PR #3441.

Removes the Docker-driver sandbox bridge reachability preflight after it blocked onboarding with false host-firewall diagnostics. This restores Docker-driver gateway readiness to the pre-probe behavior while keeping later main changes intact.
cv pushed a commit that referenced this pull request May 13, 2026
## Summary
- supersede #3441 with a Docker-driver sandbox reachability probe that
mirrors OpenShell's current Docker routing model
- inspect the managed Docker network IPAM config, prefer the IPv4 bridge
gateway, and inject the same host.openshell.internal mapping real
OpenShell sandboxes receive
- classify probe setup/DNS/network-inspect failures as non-blocking
`probe_unavailable` instead of host-firewall failures
- keep the UFW remediation only for native bridge-gateway TCP failures
after the exact OpenShell route has been modeled

## Why
#3441 tried to catch the real partner/Brev failure from #3439, but its
helper container did not actually behave like an OpenShell Docker
sandbox after OpenShell #1128. Real sandboxes get explicit
`host.openshell.internal` routing: native Linux Docker maps it to the
`openshell-docker` bridge gateway IP, while Docker Desktop/VM-backed
Docker uses Docker's `host-gateway` route.

This replacement keeps the useful early diagnostic while avoiding the
false DNS/host-gateway failures that forced the #3441 revert.

## Validation
- npm run build:cli
- npm run typecheck:cli
- npx vitest run src/lib/onboard/gateway-sandbox-reachability.test.ts
test/gateway-liveness-probe.test.ts
- npm run checks
- git diff --check

Local Docker note: this checkout did not have an `openshell-docker`
network, which now maps to `probe_unavailable` / continue rather than a
firewall diagnosis.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Improvements**
* Docker gateway startup now performs sandbox-bridge reachability checks
before reporting healthy, reducing startup surprises.

* **User-facing**
* Clearer diagnostics and guidance when sandbox-to-gateway connectivity
fails (including conditional firewall hints for TCP failures).

* **Tests**
* Added tests covering gateway reachability checks and related messaging
to prevent regressions.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3459)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
cv pushed a commit that referenced this pull request May 13, 2026
… Linux (#3472)

> Re-attributed replacement for #3465. Same code, single squashed commit
authored and signed-off by me. Force-push is blocked on the original
branch, so this is a fresh branch + new PR; #3465 will be closed in
favour of this one.

## Summary

Fixes #3340.

On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth
proxy on port 11435 is unreachable from sandbox containers, causing
inference calls to hang. Not fixed by PR #3459 (probes port 8080, which
has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side
container check uses `--add-host host-gateway` on the default bridge,
never testing the real sandbox path).

**Root cause:** Port 11435 has no Docker DNAT rule, so traffic from
sandbox containers reaches the host UFW INPUT chain — where a
default-deny policy silently drops it.

## What this PR does

Adds `src/lib/onboard/ollama-proxy-reachability.ts`, which runs a
short-lived `busybox` container on the `openshell-docker` network (the
exact network the Docker-driver gateway creates for sandboxes) and
performs `nc -zw5 host.openshell.internal:11435`, mirroring the real
sandbox route.

The probe is called inside `setupInference()` **before**
`upsertProvider` and `runOpenshell(["inference", "set", ...])`, so:

- A `tcp_failed` result (nc exit 1 on Linux native) prints a targeted
`sudo ufw allow from <subnet> to any port 11435 proto tcp` remediation
and exits 1.
- Because no inference route is committed on failure,
`isInferenceRouteReady()` stays false — both a fresh re-run and
`--resume` re-enter `setupInference()` and re-probe after the user
applies the UFW fix.
- A `probe_unavailable` result (Docker Desktop, DNS failure, network not
found, non-0/non-1 nc exit) continues silently — these environments
either don't have UFW or aren't using the Docker-driver gateway.
- Skipped entirely on WSL.

## Why PR #3459 doesn't fix this

Port 8080 has a Docker DNAT rule (`DNAT tcp dpt:8080
to:172.18.0.2:30051`) that redirects traffic before it hits UFW INPUT,
so that probe passes even with UFW blocking everything. Port 11435 has
no such rule.

## Why the previous probe (PR #3441) was reverted

PR #3441's probe had no `--add-host`, so `host.openshell.internal` was
unresolvable inside the probe container — nc always exited 1 regardless
of whether UFW was enabled. This PR adds `--add-host
host.openshell.internal:<gatewayIp>` and a `isNameResolutionFailure()`
guard that reclassifies DNS errors as `probe_unavailable` (non-fatal)
rather than `tcp_failed`.

## Scope

Targets **Docker-driver gateway mode** (v0.0.40+) where sandboxes run on
the `openshell-docker` bridge. Pre-v0.0.40 K3S deployments
(`openshell-cluster-nemoclaw`) don't have this network; the probe
returns `probe_unavailable` and continues silently. Users re-onboarding
to v0.0.40+ migrate to Docker-driver gateway mode where the probe
applies.

## Functional verification

Verified end-to-end on a Brev VM:

- **Without blocking:** probe container runs `nc -zw5
host.openshell.internal:11435` on `openshell-docker` → `status 0` → `ok`
✓
- **With `iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j
DROP`:** same probe → `status 1` → `tcp_failed` → UFW remediation
message printed ✓

## Test plan

- [x] 25 unit tests covering all result variants, argument construction,
host-gateway mode, DNS failure classification, env var override, and UFW
message formatter — all pass
- [x] No new test failures vs main baseline (verified after `npm run
build:cli`)
- [x] Biome lint/format, SPDX headers, shfmt all pass (`make check` —
`hadolint` binary missing on Brev VM, pre-existing)
- [x] No new TypeScript type errors (`npm run typecheck:cli`)
- [x] Functional end-to-end: correct `status 0` / `status 1` behaviour
with and without iptables blocking

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding for Ollama-local now probes sandbox→proxy reachability and
will halt onboarding with a clear, actionable error if a TCP
connectivity failure is detected.

* **Tests**
* Added extensive tests covering reachability probing, network parsing,
DNS/connection failure cases, Docker run behavior, and user-facing error
message formatting.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3472)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ericksoa added a commit that referenced this pull request May 14, 2026
## Summary
- Restore affected macOS OpenShell VM sandboxes by treating the VM
driver as a separate compatibility path from Docker/Kubernetes.
- Patch the macOS VM sandbox rootfs DNS to the gvproxy resolver
(`192.168.127.1`) so `inference.local` resolves from inside VM
sandboxes.
- Skip legacy Kubernetes/Docker DNS-proxy repair only for VM sandboxes
and fall back to OpenShell route reapply when appropriate.
- Gate VM sandbox-create early detach on NemoClaw startup output so
`Ready` alone cannot advance onboarding before the sandbox startup
command is actually running.
- Fix downstream Hermes/Discord issues exposed by the VM path:
locked-aware non-root Hermes config verification, guild-only Discord
authorization, regional `*.discord.gg` websocket policy, and stricter
Slack provider reuse checks.

## Direction / Scope Guardrail
This PR is **not** the strategic macOS driver direction. It is a narrow
compatibility bridge for already-created or explicitly selected
OpenShell VM sandboxes while NemoClaw is pinned to the OpenShell
behavior that exposed this regression.

Normal macOS onboarding should move back to Docker/Colima in #3454
(`fix(onboard): use Docker driver on macOS`). This PR should not default
`OPENSHELL_DRIVERS=vm`, should not add installer requirements for VM
helper assets, and should not make VM the preferred macOS runtime.

OpenShell #1375 has merged upstream and keeps VM driver selection
opt-in. Once NemoClaw can consume that OpenShell release path, the
durable direction is to rely on Docker/Colima for normal macOS
onboarding and keep this VM shim only for explicit/legacy VM cases until
it can be removed.

## Root Cause
Earlier PR text blamed `#3441`. That was too narrow and is not accurate
for the final fix. The reverted Docker bridge reachability probe was one
visible blocker, but it was not the underlying `inference.local`
failure, and that bridge-probe code is no longer part of this PR's final
diff.

The affected failure chain is a macOS VM-driver mismatch:

- Ubuntu uses the Docker/Kubernetes sandbox path, where NemoClaw's
legacy DNS proxy and bridge assumptions apply.
- The affected macOS flow used OpenShell's VM driver, where sandbox
networking is backed by the VM/gvproxy path rather than a
Docker/Kubernetes gateway container.
- The VM rootfs could end up with public DNS fallback resolvers
(`8.8.8.8` / `8.8.4.4`). Those can resolve public hostnames, but they
cannot resolve OpenShell/NemoClaw synthetic hostnames such as
`inference.local`.
- When `inference.local` failed, NemoClaw tried the legacy DNS repair
path, which produced misleading gateway-container warnings instead of
repairing VM DNS.
- Separately, the VM driver can report the sandbox `Ready` before
NemoClaw startup output appears. On macOS that allowed onboarding to
detach before dashboard/Hermes/OpenClaw startup was actually observable.

The Discord failures were downstream runtime issues exposed after the VM
sandbox got far enough to run. Discord may use regional websocket hosts
such as `gateway-us-east1-d.discord.gg`, and Hermes guild-only
configuration without explicit user IDs must permit guild members
instead of rejecting every Discord user as unauthorized.

## Tradeoff / Follow-up
The DNS portion of this PR is intentionally a narrow emergency
compatibility shim, not the ideal long-term owner boundary. It is Darwin
+ OpenShell VM-driver gated, best-effort, and disableable with
`NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`, but it still depends on today's
OpenShell VM rootfs layout, init-script shape, and gvproxy resolver IP
(`192.168.127.1`). That is acceptable only as a compatibility bridge for
explicit/legacy VM sandboxes.

Durable follow-up is split by owner:

- NemoClaw: #3454 restores normal macOS onboarding to Docker/Colima.
- OpenShell: #1375 makes VM opt-in upstream; VM resolver setup should
ultimately be OpenShell-owned rather than a NemoClaw rootfs patch.
- Future OpenShell VM layouts, including ext4-style root disks, should
be diagnosed clearly by NemoClaw but not mounted or rewritten from
NemoClaw.

## Regression Risk
- macOS VM path: intentional behavior change. The VM DNS patch is gated
to `openshellDriver === "vm"` on Darwin, is best-effort, and can be
disabled with `NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`.
- Normal macOS Docker path: intentionally out of scope here and owned by
#3454. This PR should not default macOS to VM.
- Linux/Docker path: low risk. The VM DNS patch does not run for Docker
sandboxes, and legacy DNS proxy repair remains available for non-VM
drivers.
- Discord policy: low risk. The change adds websocket-specific
`*.discord.gg` handling with credential rewrite; it does not broadly
open Discord REST beyond the existing Discord policy surface.
- Messaging reuse: lower risk than before. Slack reuse now requires both
`-slack-bridge` and `-slack-app`, avoiding partial provider reuse.

## Validation
- `npm run build:cli`
- `git diff --check`
- `npx vitest run src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts
test/sandbox-connect-inference.test.ts test/onboard.test.ts
--fileParallelism=false`
- Focused suite result on current head: 262 tests passed.
- Manual macOS VM validation during debugging:
`https://inference.local/v1/models` and chat completions returned 200
from inside the VM sandbox after the DNS patch.
- Manual Discord validation during debugging: Hermes Discord responded
after the regional gateway websocket policy was applied.
- Current full nightly dispatch:
https://github.com/NVIDIA/NemoClaw/actions/runs/25861533504

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Non-interactive onboarding can reuse stored messaging channels to
speed setup
* Added WebSocket support for Discord gateways with credential rewrite
handling
* Create-stream option to require startup output before considering
sandboxes "ready"

* **Bug Fixes**
* Improved VM/macOS DNS setup and repair paths; refined sandbox driver
selection
  * More robust inference-route repair behavior for sandboxes

* **Tests**
* Expanded tests for messaging reuse, VM DNS patching, sandbox
creation/connect, and policy validation

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3445)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

3 participants