fix(onboard): use Docker driver on macOS#3454
Conversation
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConsolidates macOS onboarding to prefer the Docker driver: removes VM-driver runtime dependency, updates gateway env construction and drift checks, narrows installer/upgrade validation to gateway-only artifacts, adds ps-based VM-child-process detection, and updates unit and E2E tests. ChangesDocker-driver gateway transition for macOS
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/onboard.test.ts (1)
454-457: ⚡ Quick winAdd an explicit guard for
OPENSHELL_DRIVER_DIRremoval in the macOS Docker-driver env test.This block already checks stale VM env cleanup; asserting
OPENSHELL_DRIVER_DIRtoo would lock the intended contract more completely.Suggested assertion
expect(darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_IMAGE).toContain(":0.0.37"); expect(darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_BIN).toBeUndefined(); expect(darwinEnv.OPENSHELL_VM_DRIVER_STATE_DIR).toBeUndefined(); + expect(darwinEnv.OPENSHELL_DRIVER_DIR).toBeUndefined();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/onboard.test.ts` around lines 454 - 457, Add an explicit assertion to the macOS Docker-driver env test to guard that OPENSHELL_DRIVER_DIR is removed: in the same block that asserts darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_IMAGE, darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_BIN, and darwinEnv.OPENSHELL_VM_DRIVER_STATE_DIR, also assert darwinEnv.OPENSHELL_DRIVER_DIR is undefined so the test enforces removal of the driver directory environment variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/onboard.test.ts`:
- Around line 454-457: Add an explicit assertion to the macOS Docker-driver env
test to guard that OPENSHELL_DRIVER_DIR is removed: in the same block that
asserts darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_IMAGE,
darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_BIN, and
darwinEnv.OPENSHELL_VM_DRIVER_STATE_DIR, also assert
darwinEnv.OPENSHELL_DRIVER_DIR is undefined so the test enforces removal of the
driver directory environment variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 682c64e0-4981-4269-9953-47f40d533fcc
📒 Files selected for processing (8)
scripts/install-openshell.shsrc/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-env.test.tssrc/lib/onboard/docker-driver-gateway-env.tssrc/lib/onboard/openshell-install.tstest/e2e/test-openshell-gateway-upgrade.shtest/install-openshell-version-check.test.tstest/onboard.test.ts
…nboard # Conflicts: # src/lib/onboard/gateway-sandbox-reachability.test.ts # src/lib/onboard/gateway-sandbox-reachability.ts
…nboard # Conflicts: # src/lib/onboard.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3796-3800: The current macOS/darwin/arm64 branches still branch on
isLinuxDockerDriverGatewayEnabled() or hardcode Docker, causing Podman/VM
selections (via OPENSHELL_DRIVERS) to be ignored; update the checks around the
host.runtime === "podman" early-exit and the other noted sites (around lines
~4825 and ~5793 and ~6120) to branch on the resolved driver variable (e.g.,
resolvedDriver or the function that returns the final driver selection) instead
of isLinuxDockerDriverGatewayEnabled()/false or implicit Docker assumptions, so
that the code uses the actual selected driver for preflight, registry metadata,
and sandbox creation paths (VM DNS/chmod) across the onboarding flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 08c71fdc-8175-48af-9d74-c2b9709535d7
📒 Files selected for processing (6)
scripts/install-openshell.shsrc/lib/onboard.tssrc/lib/onboard/openshell-install.tstest/install-openshell-version-check.test.tstest/onboard.test.tstest/shellquote-sandbox.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- test/install-openshell-version-check.test.ts
- src/lib/onboard/openshell-install.ts
- test/onboard.test.ts
| if (isLinuxDockerDriverGatewayEnabled() && host.runtime === "podman") { | ||
| console.error(" ✗ NemoClaw onboarding now uses OpenShell's Docker driver."); | ||
| console.error(" Podman is not supported for this NemoClaw integration path."); | ||
| console.error(" Switch to Docker Engine and rerun onboarding."); | ||
| process.exit(1); |
There was a problem hiding this comment.
Use the resolved driver for these macOS branches.
These branches still equate darwin/arm64 with Docker. If a user exports OPENSHELL_DRIVERS=vm, preflight still rejects Podman, registry metadata still says "docker", and sandbox creation still disables the VM DNS/chmod paths. Please branch on the resolved driver selection here instead of isLinuxDockerDriverGatewayEnabled()/false. As per coding guidelines: src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.
Also applies to: 4825-4825, 5793-5795, 6120-6125
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` around lines 3796 - 3800, The current macOS/darwin/arm64
branches still branch on isLinuxDockerDriverGatewayEnabled() or hardcode Docker,
causing Podman/VM selections (via OPENSHELL_DRIVERS) to be ignored; update the
checks around the host.runtime === "podman" early-exit and the other noted sites
(around lines ~4825 and ~5793 and ~6120) to branch on the resolved driver
variable (e.g., resolvedDriver or the function that returns the final driver
selection) instead of isLinuxDockerDriverGatewayEnabled()/false or implicit
Docker assumptions, so that the code uses the actual selected driver for
preflight, registry metadata, and sandbox creation paths (VM DNS/chmod) across
the onboarding flow.
Selective E2E Results — ✅ All requested jobs passedRun: 25859343406
|
Selective E2E Results — ✅ All requested jobs passedRun: 25859629562
|
Manual macOS Colima/Docker validationValidated PR head Results:
Observed during the focused Hermes check:
Current PR checks are green, including |
Selective E2E Results — ✅ All requested jobs passedRun: 25876302080
|
## Summary - Restore affected macOS OpenShell VM sandboxes by treating the VM driver as a separate compatibility path from Docker/Kubernetes. - Patch the macOS VM sandbox rootfs DNS to the gvproxy resolver (`192.168.127.1`) so `inference.local` resolves from inside VM sandboxes. - Skip legacy Kubernetes/Docker DNS-proxy repair only for VM sandboxes and fall back to OpenShell route reapply when appropriate. - Gate VM sandbox-create early detach on NemoClaw startup output so `Ready` alone cannot advance onboarding before the sandbox startup command is actually running. - Fix downstream Hermes/Discord issues exposed by the VM path: locked-aware non-root Hermes config verification, guild-only Discord authorization, regional `*.discord.gg` websocket policy, and stricter Slack provider reuse checks. ## Direction / Scope Guardrail This PR is **not** the strategic macOS driver direction. It is a narrow compatibility bridge for already-created or explicitly selected OpenShell VM sandboxes while NemoClaw is pinned to the OpenShell behavior that exposed this regression. Normal macOS onboarding should move back to Docker/Colima in #3454 (`fix(onboard): use Docker driver on macOS`). This PR should not default `OPENSHELL_DRIVERS=vm`, should not add installer requirements for VM helper assets, and should not make VM the preferred macOS runtime. OpenShell #1375 has merged upstream and keeps VM driver selection opt-in. Once NemoClaw can consume that OpenShell release path, the durable direction is to rely on Docker/Colima for normal macOS onboarding and keep this VM shim only for explicit/legacy VM cases until it can be removed. ## Root Cause Earlier PR text blamed `#3441`. That was too narrow and is not accurate for the final fix. The reverted Docker bridge reachability probe was one visible blocker, but it was not the underlying `inference.local` failure, and that bridge-probe code is no longer part of this PR's final diff. The affected failure chain is a macOS VM-driver mismatch: - Ubuntu uses the Docker/Kubernetes sandbox path, where NemoClaw's legacy DNS proxy and bridge assumptions apply. - The affected macOS flow used OpenShell's VM driver, where sandbox networking is backed by the VM/gvproxy path rather than a Docker/Kubernetes gateway container. - The VM rootfs could end up with public DNS fallback resolvers (`8.8.8.8` / `8.8.4.4`). Those can resolve public hostnames, but they cannot resolve OpenShell/NemoClaw synthetic hostnames such as `inference.local`. - When `inference.local` failed, NemoClaw tried the legacy DNS repair path, which produced misleading gateway-container warnings instead of repairing VM DNS. - Separately, the VM driver can report the sandbox `Ready` before NemoClaw startup output appears. On macOS that allowed onboarding to detach before dashboard/Hermes/OpenClaw startup was actually observable. The Discord failures were downstream runtime issues exposed after the VM sandbox got far enough to run. Discord may use regional websocket hosts such as `gateway-us-east1-d.discord.gg`, and Hermes guild-only configuration without explicit user IDs must permit guild members instead of rejecting every Discord user as unauthorized. ## Tradeoff / Follow-up The DNS portion of this PR is intentionally a narrow emergency compatibility shim, not the ideal long-term owner boundary. It is Darwin + OpenShell VM-driver gated, best-effort, and disableable with `NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`, but it still depends on today's OpenShell VM rootfs layout, init-script shape, and gvproxy resolver IP (`192.168.127.1`). That is acceptable only as a compatibility bridge for explicit/legacy VM sandboxes. Durable follow-up is split by owner: - NemoClaw: #3454 restores normal macOS onboarding to Docker/Colima. - OpenShell: #1375 makes VM opt-in upstream; VM resolver setup should ultimately be OpenShell-owned rather than a NemoClaw rootfs patch. - Future OpenShell VM layouts, including ext4-style root disks, should be diagnosed clearly by NemoClaw but not mounted or rewritten from NemoClaw. ## Regression Risk - macOS VM path: intentional behavior change. The VM DNS patch is gated to `openshellDriver === "vm"` on Darwin, is best-effort, and can be disabled with `NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`. - Normal macOS Docker path: intentionally out of scope here and owned by #3454. This PR should not default macOS to VM. - Linux/Docker path: low risk. The VM DNS patch does not run for Docker sandboxes, and legacy DNS proxy repair remains available for non-VM drivers. - Discord policy: low risk. The change adds websocket-specific `*.discord.gg` handling with credential rewrite; it does not broadly open Discord REST beyond the existing Discord policy surface. - Messaging reuse: lower risk than before. Slack reuse now requires both `-slack-bridge` and `-slack-app`, avoiding partial provider reuse. ## Validation - `npm run build:cli` - `git diff --check` - `npx vitest run src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts test/sandbox-connect-inference.test.ts test/onboard.test.ts --fileParallelism=false` - Focused suite result on current head: 262 tests passed. - Manual macOS VM validation during debugging: `https://inference.local/v1/models` and chat completions returned 200 from inside the VM sandbox after the DNS patch. - Manual Discord validation during debugging: Hermes Discord responded after the regional gateway websocket policy was applied. - Current full nightly dispatch: https://github.com/NVIDIA/NemoClaw/actions/runs/25861533504 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Non-interactive onboarding can reuse stored messaging channels to speed setup * Added WebSocket support for Discord gateways with credential rewrite handling * Create-stream option to require startup output before considering sandboxes "ready" * **Bug Fixes** * Improved VM/macOS DNS setup and repair paths; refined sandbox driver selection * More robust inference-route repair behavior for sandboxes * **Tests** * Expanded tests for messaging reuse, VM DNS patching, sandbox creation/connect, and policy validation <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3445) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
## Summary Refreshes the NemoClaw documentation for the local `main` changes included in the 0.0.42 release. The update adds release notes, updates the affected user-facing setup and troubleshooting pages, bumps docs metadata to 0.0.42, and regenerates the matching user skills. ## Changes - #3537 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: Documented host-level status fields, cloudflared state-specific recovery hints, and Local Ollama auth proxy status diagnostics. - #3454 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`: Documented macOS Docker-driver onboarding and removed the expectation that standard macOS setup needs a VM driver helper. - #3514 -> `docs/inference/use-local-inference.md`: Documented compatible-endpoint retry behavior for reasoning-only smoke responses. - #3448 -> `docs/reference/commands.md`, `docs/manage-sandboxes/messaging-channels.md`: Documented canonical channel names and policy preset hints after `channels add`. - #3520 -> `docs/about/release-notes.md`: Captured clearer GPU recovery and uninstall wording in the 0.0.42 release notes. - #3313 -> `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documented stronger dashboard port detection and rollback when a forward cannot start. - #3502 -> `docs/about/release-notes.md`: Captured batched onboarding policy preset application in the 0.0.42 release notes. - #3505 -> `docs/reference/troubleshooting.md`: Documented the top-level Colima socket path. - #3421 -> `docs/about/release-notes.md`: Captured idempotent installer shim logging in the 0.0.42 release notes. - Updated `docs/project.json`, `docs/versions1.json`, and regenerated `.agents/skills/nemoclaw-user-*` outputs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - v0.0.42 * **Documentation** * Enhanced macOS onboarding guidance for Docker gateway setup * Improved dashboard port conflict handling with automatic rollback * Better local Ollama inference diagnostics and authentication proxy checks * Clarified status command output and recovery procedures * Refined messaging channel setup documentation * **Chores** * Version bump to 0.0.42 <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3540) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
…3728) (#4221) ## Summary Fixes #3728. On macOS arm64 (and any other host where the Docker-driver gateway path is enabled), `getSandboxRuntimeRegistryFields` recorded `openshellDriver: "vm"` based purely on `process.platform === "darwin"`. That mismatched the runtime — OpenShell's Docker-driver gateway always starts with `OPENSHELL_DRIVERS=docker` (#3454) — and downstream code keyed off `openshellDriver === "vm"` to run the VM-only DNS monkeypatch and emit the misleading VM-driver warnings reported in the issue. This change records `"docker"` on every Docker-driver host. The VM-only log/warning paths are gated on `openshellDriver === "vm"`, so they now stay silent for macOS Docker-driver sandboxes. Legacy/opt-in sandboxes that were already written to disk with `openshellDriver: "vm"` still trigger the existing VM-only compatibility shim. ## Changes - `src/lib/onboard/sandbox-registry-metadata.ts` — drop the `process.platform === "darwin" ? "vm" : "docker"` branch; record `"docker"` whenever `isLinuxDockerDriverGatewayEnabled()` is true. - `src/lib/onboard/sandbox-registry-metadata.test.ts` (new) — unit tests asserting macOS Docker-driver → `"docker"`, Linux Docker-driver → `"docker"`, and legacy Linux → `"kubernetes"`. - `src/lib/onboard/vm-dns-monkeypatch.test.ts` — regression test that exercises the real `applyOpenShellVmDnsMonkeypatch` with `openshellDriver: "docker"` on a mocked darwin platform and verifies the onboard wrapper emits no logs or warnings. ## Test plan - [x] `npm run typecheck:cli` - [x] `npm run build:cli` - [x] `npx vitest run src/lib/onboard/sandbox-registry-metadata.test.ts src/lib/onboard/vm-dns-monkeypatch.test.ts src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts` — 18/18 pass - [x] `cd nemoclaw && npm run build && npm test` — 457/457 pass - [x] `npx @biomejs/biome check` clean on touched files - [x] Linux host can't reproduce the macOS-specific behavior directly, so the regression is covered by mocking `process.platform` (allowed by the issue brief). Signed-off-by: Yimo Jiang <yimoj@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved Docker and Kubernetes driver selection for sandbox runtime configuration. * Fixed DNS monkeypatch handling on macOS Docker-driver sandboxes. * Corrected platform-specific driver assignment logic for Linux and macOS environments. * **Tests** * Added comprehensive test coverage for driver selection across different platforms and configurations. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4221?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
openshell-driver-vmfor macOS onboarding; macOS only requiresopenshell-gatewayValidation
npm run build:clinpx vitest run src/lib/onboard/docker-driver-gateway-runtime-marker.test.ts src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-launch.test.ts test/install-openshell-version-check.test.tsnpx vitest run test/onboard.test.ts -t "models the OpenShell standalone gateway environment|requires platform-specific standalone gateway binaries|detects VM-driver children attached to a macOS standalone gateway|detects stale Docker-driver gateway runtime state before reuse"2ba746a0925905d509f07b9b015fe1469744ac75:test/e2e/test-full-e2e.sh, 18 passed / 0 failed / 0 skippedopenclaw agentanswered6×7=42throughopenclaw -> inference.localinference.local, logs, and manifest regression checks passed/v1/chat/completionsreturned content42, modelnvidia/nemotron-3-super-120b-a12b,finish_reason=stopopenshell-docker-gateway/runtime.jsonproducedExisting OpenShell Docker-driver gateway is stale (missing Docker-driver runtime marker); it will be recreated.and then✓ Docker-driver gateway is healthychecks,macos-e2e,wsl-e2e,onboard-entrypoint-budget, CodeQL, ShellCheck, DCO, commit-lint, installer hash, legacy-path guard, CLI parity, and Pi semantic E2E recommendation.Notes
NEMOCLAW_E2E_KEEP_SANDBOX=1to run an extra through-agent inference probe before cleanup, so the script's cleanup assertion reported the kept sandbox as expected. The product checks and through-agent inference passed, then the sandbox/gateway runtime state was cleaned up.--no-verifyafter a local linked-worktree hook path left Git metadata pointed at a temporary fixture. The explicit local validation above and PR CI both passed on the pushed head.Fixes #3467.
Summary by CodeRabbit