fix(onboard): preflight sandbox-bridge → gateway reachability on Docker-driver#3441
Conversation
…er-driver Add a sibling probe to gateway-tcp-readiness that runs once after the host loopback TCP check passes, spawning a short-lived container on the openshell-docker bridge and TCP-connecting to host.openshell.internal:GATEWAY_PORT the same way real sandboxes do. Fails the preflight with an actionable "sudo ufw allow from <subnet> to any port 8080 proto tcp" message instead of proceeding to a 12+ minute silent sandbox build that crash-loops on "Policy fetch failed" once it lands. The existing isGatewayTcpReady probe checks 127.0.0.1:GATEWAY_PORT, which takes the kernel loopback shortcut and bypasses the host INPUT chain. On UFW-active hosts (Brev's default posture) the gateway answers from the host but sandbox containers can't reach it — the path that matters is bridge → host INPUT, which only a container probe exercises faithfully. Diagnostic-only: never mutates iptables/ufw. Hint includes the bridge subnet derived from `docker network inspect openshell-docker` so the user can paste the correct rule verbatim. Fixes #3439. ## Validation - npm run typecheck:cli - npm run build:cli - npm run checks - npm run format:check - npx vitest run src/lib/onboard/gateway-sandbox-reachability.test.ts (8 passed) - npx vitest run test/onboard.test.ts -t "Model Router" (6 passed standalone; see "Pre-existing test infra notes" in the PR body) Signed-off-by: Steven Rick <srick@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds a Docker-network-based sandbox→gateway reachability probe, a CLI formatter for actionable failure messages, Vitest tests for probe outcomes and messaging, and integrates a verifier (verifySandboxBridgeGatewayReachableOrExit) into startDockerDriverGateway() at multiple readiness checkpoints. ChangesSandbox-bridge Gateway Reachability
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/lib/onboard/gateway-sandbox-reachability.ts (1)
21-21: ⚡ Quick winPin the probe image to an immutable reference instead of
latest.Line 21 uses a mutable tag (
busybox:latest), which can change onboarding behavior without code changes. Pin to a version or digest for reproducible preflight behavior.Proposed change
-const DEFAULT_PROBE_IMAGE = "busybox:latest"; +// Pin to a reviewed immutable image reference for reproducible probe behavior. +const DEFAULT_PROBE_IMAGE = "busybox@sha256:<verified-digest>";🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/gateway-sandbox-reachability.ts` at line 21, DEFAULT_PROBE_IMAGE is set to a mutable tag ("busybox:latest"); replace it with an immutable reference to ensure reproducible behavior by updating the DEFAULT_PROBE_IMAGE constant to a specific version tag or image digest (e.g., a numeric tag or sha256 digest) in gateway-sandbox-reachability.ts, and update any related tests or documentation that expect the previous value.src/lib/onboard.ts (1)
4567-4580: Please exercise the gateway reuse path in E2E.This change sits on the Docker-driver startup/reuse branch, so I'd run at least
sandbox-operations-e2eandopenshell-gateway-upgrade-e2e, plus a reuse/UFW regression if you have one, before merging. As per coding guidelines,src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 4567 - 4580, The test reviewer asks to exercise the gateway reuse path in E2E: run sandbox-operations-e2e and openshell-gateway-upgrade-e2e (and any reuse/UFW regression) to validate the Docker-driver startup/reuse changes around isSandboxBridgeGatewayReachable; before merging, run those E2E suites against the branch that modifies src/lib/onboard.ts (specifically the block that calls isSandboxBridgeGatewayReachable and uses formatSandboxBridgeUnreachableMessage/exitOnFailure) and confirm gateway reuse works and UFW-related reachability is handled as expected, fixing any failures exposed by the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 4567-4580: Create a small helper (e.g.,
verifySandboxBridgeReachabilityOrExit) that encapsulates the existing
reachability probe using isSandboxBridgeGatewayReachable(),
console.error(formatSandboxBridgeUnreachableMessage(reach)), process.exit(1)
when exitOnFailure is true, and throwing the Error when not exiting; then call
that helper immediately before every successful return from
startDockerDriverGateway() reuse/adopt paths (the branches that currently return
on the earlier reuse/adopt lines) as well as where the fresh-start loop
currently runs the probe, ensuring the bridge probe executes on every successful
Docker-driver reuse path.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4567-4580: The test reviewer asks to exercise the gateway reuse
path in E2E: run sandbox-operations-e2e and openshell-gateway-upgrade-e2e (and
any reuse/UFW regression) to validate the Docker-driver startup/reuse changes
around isSandboxBridgeGatewayReachable; before merging, run those E2E suites
against the branch that modifies src/lib/onboard.ts (specifically the block that
calls isSandboxBridgeGatewayReachable and uses
formatSandboxBridgeUnreachableMessage/exitOnFailure) and confirm gateway reuse
works and UFW-related reachability is handled as expected, fixing any failures
exposed by the tests.
In `@src/lib/onboard/gateway-sandbox-reachability.ts`:
- Line 21: DEFAULT_PROBE_IMAGE is set to a mutable tag ("busybox:latest");
replace it with an immutable reference to ensure reproducible behavior by
updating the DEFAULT_PROBE_IMAGE constant to a specific version tag or image
digest (e.g., a numeric tag or sha256 digest) in
gateway-sandbox-reachability.ts, and update any related tests or documentation
that expect the previous value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cd158b8f-28bb-4ca8-9068-3f5cd75b5f55
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/gateway-sandbox-reachability.test.tssrc/lib/onboard/gateway-sandbox-reachability.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Approved. The follow-up covers the reuse/adopt paths, keeps src/lib/onboard.ts net-neutral, avoids treating helper-container failures as firewall failures, and CI is green on the updated head.
## Summary - supersede #3441 with a Docker-driver sandbox reachability probe that mirrors OpenShell's current Docker routing model - inspect the managed Docker network IPAM config, prefer the IPv4 bridge gateway, and inject the same host.openshell.internal mapping real OpenShell sandboxes receive - classify probe setup/DNS/network-inspect failures as non-blocking `probe_unavailable` instead of host-firewall failures - keep the UFW remediation only for native bridge-gateway TCP failures after the exact OpenShell route has been modeled ## Why #3441 tried to catch the real partner/Brev failure from #3439, but its helper container did not actually behave like an OpenShell Docker sandbox after OpenShell #1128. Real sandboxes get explicit `host.openshell.internal` routing: native Linux Docker maps it to the `openshell-docker` bridge gateway IP, while Docker Desktop/VM-backed Docker uses Docker's `host-gateway` route. This replacement keeps the useful early diagnostic while avoiding the false DNS/host-gateway failures that forced the #3441 revert. ## Validation - npm run build:cli - npm run typecheck:cli - npx vitest run src/lib/onboard/gateway-sandbox-reachability.test.ts test/gateway-liveness-probe.test.ts - npm run checks - git diff --check Local Docker note: this checkout did not have an `openshell-docker` network, which now maps to `probe_unavailable` / continue rather than a firewall diagnosis. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Improvements** * Docker gateway startup now performs sandbox-bridge reachability checks before reporting healthy, reducing startup surprises. * **User-facing** * Clearer diagnostics and guidance when sandbox-to-gateway connectivity fails (including conditional firewall hints for TCP failures). * **Tests** * Added tests covering gateway reachability checks and related messaging to prevent regressions. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3459) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… Linux (#3472) > Re-attributed replacement for #3465. Same code, single squashed commit authored and signed-off by me. Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3465 will be closed in favour of this one. ## Summary Fixes #3340. On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth proxy on port 11435 is unreachable from sandbox containers, causing inference calls to hang. Not fixed by PR #3459 (probes port 8080, which has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side container check uses `--add-host host-gateway` on the default bridge, never testing the real sandbox path). **Root cause:** Port 11435 has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it. ## What this PR does Adds `src/lib/onboard/ollama-proxy-reachability.ts`, which runs a short-lived `busybox` container on the `openshell-docker` network (the exact network the Docker-driver gateway creates for sandboxes) and performs `nc -zw5 host.openshell.internal:11435`, mirroring the real sandbox route. The probe is called inside `setupInference()` **before** `upsertProvider` and `runOpenshell(["inference", "set", ...])`, so: - A `tcp_failed` result (nc exit 1 on Linux native) prints a targeted `sudo ufw allow from <subnet> to any port 11435 proto tcp` remediation and exits 1. - Because no inference route is committed on failure, `isInferenceRouteReady()` stays false — both a fresh re-run and `--resume` re-enter `setupInference()` and re-probe after the user applies the UFW fix. - A `probe_unavailable` result (Docker Desktop, DNS failure, network not found, non-0/non-1 nc exit) continues silently — these environments either don't have UFW or aren't using the Docker-driver gateway. - Skipped entirely on WSL. ## Why PR #3459 doesn't fix this Port 8080 has a Docker DNAT rule (`DNAT tcp dpt:8080 to:172.18.0.2:30051`) that redirects traffic before it hits UFW INPUT, so that probe passes even with UFW blocking everything. Port 11435 has no such rule. ## Why the previous probe (PR #3441) was reverted PR #3441's probe had no `--add-host`, so `host.openshell.internal` was unresolvable inside the probe container — nc always exited 1 regardless of whether UFW was enabled. This PR adds `--add-host host.openshell.internal:<gatewayIp>` and a `isNameResolutionFailure()` guard that reclassifies DNS errors as `probe_unavailable` (non-fatal) rather than `tcp_failed`. ## Scope Targets **Docker-driver gateway mode** (v0.0.40+) where sandboxes run on the `openshell-docker` bridge. Pre-v0.0.40 K3S deployments (`openshell-cluster-nemoclaw`) don't have this network; the probe returns `probe_unavailable` and continues silently. Users re-onboarding to v0.0.40+ migrate to Docker-driver gateway mode where the probe applies. ## Functional verification Verified end-to-end on a Brev VM: - **Without blocking:** probe container runs `nc -zw5 host.openshell.internal:11435` on `openshell-docker` → `status 0` → `ok` ✓ - **With `iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j DROP`:** same probe → `status 1` → `tcp_failed` → UFW remediation message printed ✓ ## Test plan - [x] 25 unit tests covering all result variants, argument construction, host-gateway mode, DNS failure classification, env var override, and UFW message formatter — all pass - [x] No new test failures vs main baseline (verified after `npm run build:cli`) - [x] Biome lint/format, SPDX headers, shfmt all pass (`make check` — `hadolint` binary missing on Brev VM, pre-existing) - [x] No new TypeScript type errors (`npm run typecheck:cli`) - [x] Functional end-to-end: correct `status 0` / `status 1` behaviour with and without iptables blocking Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding for Ollama-local now probes sandbox→proxy reachability and will halt onboarding with a clear, actionable error if a TCP connectivity failure is detected. * **Tests** * Added extensive tests covering reachability probing, network parsing, DNS/connection failure cases, Docker run behavior, and user-facing error message formatting. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3472) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary - Restore affected macOS OpenShell VM sandboxes by treating the VM driver as a separate compatibility path from Docker/Kubernetes. - Patch the macOS VM sandbox rootfs DNS to the gvproxy resolver (`192.168.127.1`) so `inference.local` resolves from inside VM sandboxes. - Skip legacy Kubernetes/Docker DNS-proxy repair only for VM sandboxes and fall back to OpenShell route reapply when appropriate. - Gate VM sandbox-create early detach on NemoClaw startup output so `Ready` alone cannot advance onboarding before the sandbox startup command is actually running. - Fix downstream Hermes/Discord issues exposed by the VM path: locked-aware non-root Hermes config verification, guild-only Discord authorization, regional `*.discord.gg` websocket policy, and stricter Slack provider reuse checks. ## Direction / Scope Guardrail This PR is **not** the strategic macOS driver direction. It is a narrow compatibility bridge for already-created or explicitly selected OpenShell VM sandboxes while NemoClaw is pinned to the OpenShell behavior that exposed this regression. Normal macOS onboarding should move back to Docker/Colima in #3454 (`fix(onboard): use Docker driver on macOS`). This PR should not default `OPENSHELL_DRIVERS=vm`, should not add installer requirements for VM helper assets, and should not make VM the preferred macOS runtime. OpenShell #1375 has merged upstream and keeps VM driver selection opt-in. Once NemoClaw can consume that OpenShell release path, the durable direction is to rely on Docker/Colima for normal macOS onboarding and keep this VM shim only for explicit/legacy VM cases until it can be removed. ## Root Cause Earlier PR text blamed `#3441`. That was too narrow and is not accurate for the final fix. The reverted Docker bridge reachability probe was one visible blocker, but it was not the underlying `inference.local` failure, and that bridge-probe code is no longer part of this PR's final diff. The affected failure chain is a macOS VM-driver mismatch: - Ubuntu uses the Docker/Kubernetes sandbox path, where NemoClaw's legacy DNS proxy and bridge assumptions apply. - The affected macOS flow used OpenShell's VM driver, where sandbox networking is backed by the VM/gvproxy path rather than a Docker/Kubernetes gateway container. - The VM rootfs could end up with public DNS fallback resolvers (`8.8.8.8` / `8.8.4.4`). Those can resolve public hostnames, but they cannot resolve OpenShell/NemoClaw synthetic hostnames such as `inference.local`. - When `inference.local` failed, NemoClaw tried the legacy DNS repair path, which produced misleading gateway-container warnings instead of repairing VM DNS. - Separately, the VM driver can report the sandbox `Ready` before NemoClaw startup output appears. On macOS that allowed onboarding to detach before dashboard/Hermes/OpenClaw startup was actually observable. The Discord failures were downstream runtime issues exposed after the VM sandbox got far enough to run. Discord may use regional websocket hosts such as `gateway-us-east1-d.discord.gg`, and Hermes guild-only configuration without explicit user IDs must permit guild members instead of rejecting every Discord user as unauthorized. ## Tradeoff / Follow-up The DNS portion of this PR is intentionally a narrow emergency compatibility shim, not the ideal long-term owner boundary. It is Darwin + OpenShell VM-driver gated, best-effort, and disableable with `NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`, but it still depends on today's OpenShell VM rootfs layout, init-script shape, and gvproxy resolver IP (`192.168.127.1`). That is acceptable only as a compatibility bridge for explicit/legacy VM sandboxes. Durable follow-up is split by owner: - NemoClaw: #3454 restores normal macOS onboarding to Docker/Colima. - OpenShell: #1375 makes VM opt-in upstream; VM resolver setup should ultimately be OpenShell-owned rather than a NemoClaw rootfs patch. - Future OpenShell VM layouts, including ext4-style root disks, should be diagnosed clearly by NemoClaw but not mounted or rewritten from NemoClaw. ## Regression Risk - macOS VM path: intentional behavior change. The VM DNS patch is gated to `openshellDriver === "vm"` on Darwin, is best-effort, and can be disabled with `NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`. - Normal macOS Docker path: intentionally out of scope here and owned by #3454. This PR should not default macOS to VM. - Linux/Docker path: low risk. The VM DNS patch does not run for Docker sandboxes, and legacy DNS proxy repair remains available for non-VM drivers. - Discord policy: low risk. The change adds websocket-specific `*.discord.gg` handling with credential rewrite; it does not broadly open Discord REST beyond the existing Discord policy surface. - Messaging reuse: lower risk than before. Slack reuse now requires both `-slack-bridge` and `-slack-app`, avoiding partial provider reuse. ## Validation - `npm run build:cli` - `git diff --check` - `npx vitest run src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts test/sandbox-connect-inference.test.ts test/onboard.test.ts --fileParallelism=false` - Focused suite result on current head: 262 tests passed. - Manual macOS VM validation during debugging: `https://inference.local/v1/models` and chat completions returned 200 from inside the VM sandbox after the DNS patch. - Manual Discord validation during debugging: Hermes Discord responded after the regional gateway websocket policy was applied. - Current full nightly dispatch: https://github.com/NVIDIA/NemoClaw/actions/runs/25861533504 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Non-interactive onboarding can reuse stored messaging channels to speed setup * Added WebSocket support for Discord gateways with credential rewrite handling * Create-stream option to require startup output before considering sandboxes "ready" * **Bug Fixes** * Improved VM/macOS DNS setup and repair paths; refined sandbox driver selection * More robust inference-route repair behavior for sandboxes * **Tests** * Expanded tests for messaging reuse, VM DNS patching, sandbox creation/connect, and policy validation <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3445) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Fixes #3439.
Adds a sibling probe to
gateway-tcp-readiness.ts—gateway-sandbox-reachability.ts— that runs once after the host loopback TCP check passes during Docker-driver gateway startup. It spawns a short-livedbusyboxcontainer on theopenshell-dockerbridge and TCP-connects tohost.openshell.internal:GATEWAY_PORTthe same way real sandboxes do.If the probe fails, the preflight fails with an actionable message instead of letting the user spend 12+ minutes building a sandbox image that then crash-loops on
Policy fetch failed: failed to connect to OpenShell serveruntil the 180s readiness window expires and the orphan-cleanup path runs.Why a separate probe
The existing
isGatewayTcpReadyprobe checks127.0.0.1:GATEWAY_PORT. That correctly verifies "the gateway binary is alive and listening on the host", but localhost connects take the kernel loopback shortcut and bypass the host INPUT chain. On hosts with UFW active in default-deny incoming posture (Brev's default), the gateway answers from the host while sandbox containers can't reach it — the packet path that matters isbridge → host INPUT, which only a container probe exercises faithfully.Behavior
sudo ufw allow from <subnet> to any port 8080 proto tcp, then exits via the sameexitOnFailurecontract the surrounding function already uses.The probe adds one
docker runper onboard (~1-2s on first call when the helper image needs pulling, sub-second after). It only runs once per onboard — gated on the existing gateway-healthy success branch, not the poll loop.Validation
npm run typecheck:clinpm run build:clinpm run checksnpm run format:checknpx vitest run src/lib/onboard/gateway-sandbox-reachability.test.ts— 8 passedPre-existing test infra note
The
test-cliprek hook was skipped on the commit. The Model Router tests intest/onboard.test.tsexceed the default 5s vitest timeout under parallel suite load (all 6 pass standalone in ~3s each). Reproduces againstd12cce5on a clean checkout — independent of this PR.Test plan
formatSandboxBridgeUnreachableMessage— the wording is the user-facing hook for this whole change.Signed-off-by: Steven Rick srick@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Tests