Skip to content

fix(onboard): use Docker driver on macOS#3454

Merged
cv merged 9 commits into
mainfrom
fix/macos-colima-docker-driver
May 14, 2026
Merged

fix(onboard): use Docker driver on macOS#3454
cv merged 9 commits into
mainfrom
fix/macos-colima-docker-driver

Conversation

@ericksoa

@ericksoa ericksoa commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • switch macOS standalone onboarding from the OpenShell VM driver to the Docker driver so Colima/Docker Desktop follow the same compute path as Linux
  • stop requiring/downloading/signing openshell-driver-vm for macOS onboarding; macOS only requires openshell-gateway
  • avoid reusing an already-running VM-backed or unmarked standalone gateway when macOS onboarding now expects the Docker driver
  • keep stale VM gateway env keys out of regenerated Docker-driver env files and leave VM chmod/DNS-proxy compatibility paths disabled for Docker-driver macOS
  • add a Docker-driver gateway runtime marker so stale Colima/Docker state is detected and recreated instead of silently reused

Validation

  • Local focused validation:
    • npm run build:cli
    • npx vitest run src/lib/onboard/docker-driver-gateway-runtime-marker.test.ts src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-launch.test.ts test/install-openshell-version-check.test.ts
    • npx vitest run test/onboard.test.ts -t "models the OpenShell standalone gateway environment|requires platform-specific standalone gateway binaries|detects VM-driver children attached to a macOS standalone gateway|detects stale Docker-driver gateway runtime state before reuse"
  • Independent Colima/Docker validation on macOS at 2ba746a0925905d509f07b9b015fe1469744ac75:
    • OpenClaw full E2E: test/e2e/test-full-e2e.sh, 18 passed / 0 failed / 0 skipped
    • OpenClaw live inference: openclaw agent answered 6×7=42 through openclaw -> inference.local
    • Hermes full E2E product path: Hermes health, binary/config/state, direct live API, sandbox routing through inference.local, logs, and manifest regression checks passed
    • Hermes through-agent live inference: /v1/chat/completions returned content 42, model nvidia/nemotron-3-super-120b-a12b, finish_reason=stop
    • Runtime-marker remediation probe: removing openshell-docker-gateway/runtime.json produced Existing OpenShell Docker-driver gateway is stale (missing Docker-driver runtime marker); it will be recreated. and then ✓ Docker-driver gateway is healthy
  • PR CI at current head is green, including checks, macos-e2e, wsl-e2e, onboard-entrypoint-budget, CodeQL, ShellCheck, DCO, commit-lint, installer hash, legacy-path guard, CLI parity, and Pi semantic E2E recommendation.

Notes

  • The independent Hermes E2E command intentionally used NEMOCLAW_E2E_KEEP_SANDBOX=1 to run an extra through-agent inference probe before cleanup, so the script's cleanup assertion reported the kept sandbox as expected. The product checks and through-agent inference passed, then the sandbox/gateway runtime state was cleaned up.
  • Local commit/push used --no-verify after a local linked-worktree hook path left Git metadata pointed at a temporary fixture. The explicit local validation above and PR CI both passed on the pushed head.

Fixes #3467.

Summary by CodeRabbit

  • Bug Fixes
    • Installation/upgrade validation tightened: upgrades now fail if required gateway binaries or messaging support are missing; Linux still requires gateway + sandbox. macOS installer paths no longer request VM helper assets for aarch64.
  • Refactor
    • Unified Docker-driver gateway behavior across platforms; macOS no longer requires VM driver assets or repair steps and gains improved detection of legacy VM helper processes.
  • Tests
    • Updated regression and unit tests to reflect gateway-only macOS flows and Docker-rootfs expectations.

Review Change Stack

@ericksoa ericksoa self-assigned this May 13, 2026
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: openshell-gateway-upgrade-e2e, macos-e2e, e2e-branch-validation:full, sandbox-survival-e2e, double-onboard-e2e
Optional E2E: gateway-drift-preflight-e2e, onboard-resume-e2e, onboard-repair-e2e, cloud-onboard-e2e, wsl-e2e

Dispatch hint: openshell-gateway-upgrade-e2e,sandbox-survival-e2e,double-onboard-e2e

Auto-dispatched E2E: openshell-gateway-upgrade-e2e, sandbox-survival-e2e, double-onboard-e2e via nightly-e2e.yaml at e2227182ab3052bdd739915be2bc5c01d948f85d

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • openshell-gateway-upgrade-e2e (high): Directly touched and highly relevant: validates old OpenShell/NemoClaw gateway upgrade, backup/restore survival, and includes installer regressions for macOS Darwin gateway asset selection and VM-driver removal.
  • macos-e2e (medium-high): Required because the PR changes macOS OpenShell install/onboard behavior from VM-driver-specific paths to Docker-driver gateway behavior. This is the only existing platform E2E that can exercise the macOS user flow end to end when Docker is available.
  • e2e-branch-validation:full (high): Onboard and gateway lifecycle code changed substantially. The full branch-validation suite installs from source on a clean Linux environment, onboards a sandbox, and verifies the real assistant flow before merge.
  • sandbox-survival-e2e (medium-high): Gateway process reuse, drift detection, runtime marker writing/clearing, and stop/start behavior changed. This job validates sandbox survival across gateway restart/recovery with a live sandbox.
  • double-onboard-e2e (high): The PR changes second-run/reuse behavior for existing Docker-driver gateways and clears runtime files on stale PID/process detection. Double onboard is the closest existing real-flow coverage for repeated onboarding/lifecycle recovery.

Optional E2E

  • gateway-drift-preflight-e2e (low-medium): Adjacent regression coverage for stale gateway/image drift and fail-closed behavior. Useful because this PR adds Docker-driver runtime marker drift detection, but the existing test focuses on schema/image drift rather than the new marker file.
  • onboard-resume-e2e (medium-high): Useful adjacent confidence for onboarding state/session recovery after changes to sandbox metadata and gateway reuse logic, but less directly targeted than double-onboard and sandbox-survival.
  • onboard-repair-e2e (medium-high): Useful adjacent coverage for recovery from broken onboard/gateway state after lifecycle changes, but not strictly merge-blocking if the required lifecycle tests pass.
  • cloud-onboard-e2e (medium-high): Public install/onboard path with policy and inference.local checks is useful extra confidence for installer/onboard changes, especially if branch-validation full is unavailable.
  • wsl-e2e (high): The Linux Docker-driver path also affects WSL-style Linux environments. Run if risk tolerance requires additional platform coverage beyond Ubuntu and macOS.

New E2E recommendations

  • macOS Docker-driver gateway runtime marker and stale reuse (high): Existing macOS E2E may skip the full test when Docker is unavailable, and the touched unit tests do not prove real macOS Docker/Colima behavior. Add an E2E that starts the Docker-driver gateway on macOS arm64, verifies runtime.json and PID files are written with owner-only permissions, verifies no openshell-driver-vm child is required, reruns onboarding to reuse the gateway, then changes DOCKER_HOST or gateway env and confirms stale marker detection recreates the gateway.
    • Suggested test: Add a macOS arm64 Docker-driver gateway marker/reuse E2E scenario or job using Docker/Colima with assertions for runtime marker write, reuse, drift, cleanup, and absence of VM-driver dependency.
  • Darwin installer asset matrix (medium): test-openshell-gateway-upgrade.sh includes mocked shell checks for Darwin gateway asset selection, but there is no dedicated matrix job that validates real release download/install behavior on macOS arm64 with current OpenShell artifacts.
    • Suggested test: Add a macOS installer asset E2E that runs scripts/install-openshell.sh against the real release channel on a clean macOS arm64 runner and asserts openshell plus openshell-gateway are installed while openshell-driver-vm is not required.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: openshell-gateway-upgrade-e2e,sandbox-survival-e2e,double-onboard-e2e

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Consolidates macOS onboarding to prefer the Docker driver: removes VM-driver runtime dependency, updates gateway env construction and drift checks, narrows installer/upgrade validation to gateway-only artifacts, adds ps-based VM-child-process detection, and updates unit and E2E tests.

Changes

Docker-driver gateway transition for macOS

Layer / File(s) Summary
Docker-driver gateway environment contract and options
src/lib/onboard/docker-driver-gateway-env.ts
DOCKER_DRIVER_GATEWAY_RUNTIME_ENV_KEYS adds OPENSHELL_VM_DRIVER_STATE_DIR and OPENSHELL_DRIVER_DIR. BuildDockerDriverGatewayEnvOptions removes resolveVmDriverBin. buildDockerDriverGatewayEnv always selects Docker driver, sets OPENSHELL_GRPC_ENDPOINT to the Docker endpoint, and initializes Docker supervisor/network vars unconditionally.
Onboard integration, runtime drift, and preflight updates
src/lib/onboard.ts
getDockerDriverGatewayEnv() passes gateway state dir, docker network, and supervisor-image resolver; VM-driver resolver removed. Adds VM-child-process detection usage and reports dedicated drift when a VM child is attached on darwin. Preflight and DNS/proxy gating use Docker-driver-enabled predicate; sandbox runtime records openshellDriver: "docker" when enabled; patchStagedDockerfile chmod-compat forced off.
VM-driver process utilities
src/lib/onboard/vm-driver-process.ts
Adds PROCESS_LIST_ARGS, ProcessListCapture type, hasOpenShellVmDriverChildProcessFromPsOutput(...), and hasOpenShellVmDriverChildProcess(...) to parse ps output and detect openshell-driver-vm child processes.
Required-binaries validation simplification
src/lib/onboard/openshell-install.ts
areRequiredDockerDriverBinariesPresent() no longer requires resolveOpenShellVmDriverBinary; validation now requires gateway universally and sandbox only on Linux; missing-binaries message for non-Linux reports gateway-only.
Installer/upgrade script and macOS asset changes
scripts/install-openshell.sh
macOS/installer logic now requires only openshell-gateway on Darwin; stable-channel upgrade flow removes macOS VM-driver repair path, gates on required binaries and messaging-credential-rewrite support (missing rewrite support now fails), and omits macOS aarch64 openshell-driver-vm artifact from download/checksum expectations.
Unit tests for gateway env and onboard integration
src/lib/onboard/docker-driver-gateway-env.test.ts, test/onboard.test.ts, test/shellquote-sandbox.test.ts
Tests updated to pass resolveSandboxBin and assert OPENSHELL_DOCKER_SUPERVISOR_BIN; added macOS test that validates Docker-driver env without VM helper state; env-file writer tests ensure stale VM-driver keys removed; onboard tests import process-list helper and assert Docker-driver darwin expectations (localhost gRPC, supervisor image tag, no VM state dir); sandbox arch override added to a test helper.
Install-version-check tests for macOS gateway-only flows
test/install-openshell-version-check.test.ts
macOS tests updated to accept gateway-only installed scenario; entitlement/codesign expectations assert VM-driver entitlements/signing are not required; reinstall asset test narrowed to gateway-only; VM driver tarball asserted not downloaded.
E2E gateway upgrade regression test updates
test/e2e/test-openshell-gateway-upgrade.sh
macOS installer regression asserts gateway-asset-only fetch; VM-driver entitlement regression changed to "not required"; rootfs-permission regression renamed/retooled to Docker-focused checks; invocations updated to run the new tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3323: Tightens OpenShell messaging-credential-rewrite validation used to gate reinstall flows.
  • NVIDIA/NemoClaw#3429: Updates e2e fake OpenShell stub and credential-rewrite markers that align with installer gating changes.
  • NVIDIA/NemoClaw#3383: Overlapping macOS Docker-driver onboarding and installer asset selection changes.

Suggested labels

OpenShell, Sandbox

Suggested reviewers

  • jyaunches
  • cv
  • prekshivyas

"🐰
A gateway wakes on macOS, docker first, not VM,
I hop through ps lines to see what's attached to thee.
Tests now check only gateway tiles, installers fetch the right art,
Env keys pruned and drift alarms call out a clinging VM part.
Hooray — onboarding runs with Docker at the heart!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: switching macOS onboarding from the OpenShell VM driver to the Docker driver.
Linked Issues check ✅ Passed The PR addresses issue #3467 by stopping the hard-coding of OPENSHELL_DRIVERS=vm on macOS and ensuring Docker driver is used. Changes reflect all coding-level objectives.
Out of Scope Changes check ✅ Passed All changes are directly related to switching macOS onboarding from VM to Docker driver. No unrelated modifications detected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/macos-colima-docker-driver

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/onboard.test.ts (1)

454-457: ⚡ Quick win

Add an explicit guard for OPENSHELL_DRIVER_DIR removal in the macOS Docker-driver env test.

This block already checks stale VM env cleanup; asserting OPENSHELL_DRIVER_DIR too would lock the intended contract more completely.

Suggested assertion
     expect(darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_IMAGE).toContain(":0.0.37");
     expect(darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_BIN).toBeUndefined();
     expect(darwinEnv.OPENSHELL_VM_DRIVER_STATE_DIR).toBeUndefined();
+    expect(darwinEnv.OPENSHELL_DRIVER_DIR).toBeUndefined();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/onboard.test.ts` around lines 454 - 457, Add an explicit assertion to
the macOS Docker-driver env test to guard that OPENSHELL_DRIVER_DIR is removed:
in the same block that asserts darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_IMAGE,
darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_BIN, and
darwinEnv.OPENSHELL_VM_DRIVER_STATE_DIR, also assert
darwinEnv.OPENSHELL_DRIVER_DIR is undefined so the test enforces removal of the
driver directory environment variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/onboard.test.ts`:
- Around line 454-457: Add an explicit assertion to the macOS Docker-driver env
test to guard that OPENSHELL_DRIVER_DIR is removed: in the same block that
asserts darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_IMAGE,
darwinEnv.OPENSHELL_DOCKER_SUPERVISOR_BIN, and
darwinEnv.OPENSHELL_VM_DRIVER_STATE_DIR, also assert
darwinEnv.OPENSHELL_DRIVER_DIR is undefined so the test enforces removal of the
driver directory environment variable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 682c64e0-4981-4269-9953-47f40d533fcc

📥 Commits

Reviewing files that changed from the base of the PR and between c517d62 and c0ec69d.

📒 Files selected for processing (8)
  • scripts/install-openshell.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.test.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/openshell-install.ts
  • test/e2e/test-openshell-gateway-upgrade.sh
  • test/install-openshell-version-check.test.ts
  • test/onboard.test.ts

@wscurran wscurran added Docker platform: macos Affects macOS, including Apple Silicon and removed NemoClaw CLI labels May 13, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3796-3800: The current macOS/darwin/arm64 branches still branch on
isLinuxDockerDriverGatewayEnabled() or hardcode Docker, causing Podman/VM
selections (via OPENSHELL_DRIVERS) to be ignored; update the checks around the
host.runtime === "podman" early-exit and the other noted sites (around lines
~4825 and ~5793 and ~6120) to branch on the resolved driver variable (e.g.,
resolvedDriver or the function that returns the final driver selection) instead
of isLinuxDockerDriverGatewayEnabled()/false or implicit Docker assumptions, so
that the code uses the actual selected driver for preflight, registry metadata,
and sandbox creation paths (VM DNS/chmod) across the onboarding flow.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08c71fdc-8175-48af-9d74-c2b9709535d7

📥 Commits

Reviewing files that changed from the base of the PR and between ffc8ff4 and 9a73c1e.

📒 Files selected for processing (6)
  • scripts/install-openshell.sh
  • src/lib/onboard.ts
  • src/lib/onboard/openshell-install.ts
  • test/install-openshell-version-check.test.ts
  • test/onboard.test.ts
  • test/shellquote-sandbox.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/install-openshell-version-check.test.ts
  • src/lib/onboard/openshell-install.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts
Comment on lines +3796 to 3800
if (isLinuxDockerDriverGatewayEnabled() && host.runtime === "podman") {
console.error(" ✗ NemoClaw onboarding now uses OpenShell's Docker driver.");
console.error(" Podman is not supported for this NemoClaw integration path.");
console.error(" Switch to Docker Engine and rerun onboarding.");
process.exit(1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the resolved driver for these macOS branches.

These branches still equate darwin/arm64 with Docker. If a user exports OPENSHELL_DRIVERS=vm, preflight still rejects Podman, registry metadata still says "docker", and sandbox creation still disables the VM DNS/chmod paths. Please branch on the resolved driver selection here instead of isLinuxDockerDriverGatewayEnabled()/false. As per coding guidelines: src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.

Also applies to: 4825-4825, 5793-5795, 6120-6125

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3796 - 3800, The current macOS/darwin/arm64
branches still branch on isLinuxDockerDriverGatewayEnabled() or hardcode Docker,
causing Podman/VM selections (via OPENSHELL_DRIVERS) to be ignored; update the
checks around the host.runtime === "podman" early-exit and the other noted sites
(around lines ~4825 and ~5793 and ~6120) to branch on the resolved driver
variable (e.g., resolvedDriver or the function that returns the final driver
selection) instead of isLinuxDockerDriverGatewayEnabled()/false or implicit
Docker assumptions, so that the code uses the actual selected driver for
preflight, registry metadata, and sandbox creation paths (VM DNS/chmod) across
the onboarding flow.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25859343406
Target ref: 9a73c1eeb69f6ef2bdd53aef488782f6c58f4c90
Workflow ref: main
Requested jobs: openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
openshell-gateway-upgrade-e2e ⚠️ cancelled

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25859629562
Target ref: 0c4140dbff34999774ba11c0c6ac1e8b1254c61c
Workflow ref: main
Requested jobs: openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
openshell-gateway-upgrade-e2e ✅ success

@ericksoa

Copy link
Copy Markdown
Contributor Author

Manual macOS Colima/Docker validation

Validated PR head 0c4140dbff34999774ba11c0c6ac1e8b1254c61c on macOS with Colima using the Docker-driver path (OPENSHELL_DRIVERS=docker, DOCKER_HOST=unix://$HOME/.colima/default/docker.sock) after cleaning the previous OpenShell/NemoClaw install state.

Results:

  • OpenClaw full E2E: test/e2e/test-full-e2e.sh passed, including sandbox creation, Docker-driver gateway startup, and live openclaw agent inference through inference.local (18 passed, 0 failed).
  • Hermes full E2E: test/e2e/test-hermes-e2e.sh passed with fresh install, Docker-driver gateway, sandbox Ready, Hermes health OK, and live inference.local routing (26 passed, 0 failed).
  • Hermes through-agent inference: additional focused check executed inside the Hermes sandbox against http://localhost:8642/v1/chat/completions; model nvidia/nemotron-3-super-120b-a12b returned PONG with finish_reason=stop.

Observed during the focused Hermes check:

  • OpenShell: 0.0.39 (docker)
  • Agent: Hermes Agent v2026.4.23
  • The successful Colima runs used the OpenShell Docker-driver gateway, not the VM driver path.

Current PR checks are green, including macos-e2e and the required openshell-gateway-upgrade-e2e selective E2E.

@ericksoa ericksoa added v0.0.42 integration: openclaw OpenClaw integration behavior integration: hermes Hermes integration behavior platform: arm64 Affects ARM64 or aarch64 architecture labels May 14, 2026
@cv cv merged commit e4a2f93 into main May 14, 2026
25 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25876302080
Target ref: e2227182ab3052bdd739915be2bc5c01d948f85d
Workflow ref: main
Requested jobs: openshell-gateway-upgrade-e2e,sandbox-survival-e2e,double-onboard-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
double-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
sandbox-survival-e2e ✅ success

ericksoa added a commit that referenced this pull request May 14, 2026
## Summary
- Restore affected macOS OpenShell VM sandboxes by treating the VM
driver as a separate compatibility path from Docker/Kubernetes.
- Patch the macOS VM sandbox rootfs DNS to the gvproxy resolver
(`192.168.127.1`) so `inference.local` resolves from inside VM
sandboxes.
- Skip legacy Kubernetes/Docker DNS-proxy repair only for VM sandboxes
and fall back to OpenShell route reapply when appropriate.
- Gate VM sandbox-create early detach on NemoClaw startup output so
`Ready` alone cannot advance onboarding before the sandbox startup
command is actually running.
- Fix downstream Hermes/Discord issues exposed by the VM path:
locked-aware non-root Hermes config verification, guild-only Discord
authorization, regional `*.discord.gg` websocket policy, and stricter
Slack provider reuse checks.

## Direction / Scope Guardrail
This PR is **not** the strategic macOS driver direction. It is a narrow
compatibility bridge for already-created or explicitly selected
OpenShell VM sandboxes while NemoClaw is pinned to the OpenShell
behavior that exposed this regression.

Normal macOS onboarding should move back to Docker/Colima in #3454
(`fix(onboard): use Docker driver on macOS`). This PR should not default
`OPENSHELL_DRIVERS=vm`, should not add installer requirements for VM
helper assets, and should not make VM the preferred macOS runtime.

OpenShell #1375 has merged upstream and keeps VM driver selection
opt-in. Once NemoClaw can consume that OpenShell release path, the
durable direction is to rely on Docker/Colima for normal macOS
onboarding and keep this VM shim only for explicit/legacy VM cases until
it can be removed.

## Root Cause
Earlier PR text blamed `#3441`. That was too narrow and is not accurate
for the final fix. The reverted Docker bridge reachability probe was one
visible blocker, but it was not the underlying `inference.local`
failure, and that bridge-probe code is no longer part of this PR's final
diff.

The affected failure chain is a macOS VM-driver mismatch:

- Ubuntu uses the Docker/Kubernetes sandbox path, where NemoClaw's
legacy DNS proxy and bridge assumptions apply.
- The affected macOS flow used OpenShell's VM driver, where sandbox
networking is backed by the VM/gvproxy path rather than a
Docker/Kubernetes gateway container.
- The VM rootfs could end up with public DNS fallback resolvers
(`8.8.8.8` / `8.8.4.4`). Those can resolve public hostnames, but they
cannot resolve OpenShell/NemoClaw synthetic hostnames such as
`inference.local`.
- When `inference.local` failed, NemoClaw tried the legacy DNS repair
path, which produced misleading gateway-container warnings instead of
repairing VM DNS.
- Separately, the VM driver can report the sandbox `Ready` before
NemoClaw startup output appears. On macOS that allowed onboarding to
detach before dashboard/Hermes/OpenClaw startup was actually observable.

The Discord failures were downstream runtime issues exposed after the VM
sandbox got far enough to run. Discord may use regional websocket hosts
such as `gateway-us-east1-d.discord.gg`, and Hermes guild-only
configuration without explicit user IDs must permit guild members
instead of rejecting every Discord user as unauthorized.

## Tradeoff / Follow-up
The DNS portion of this PR is intentionally a narrow emergency
compatibility shim, not the ideal long-term owner boundary. It is Darwin
+ OpenShell VM-driver gated, best-effort, and disableable with
`NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`, but it still depends on today's
OpenShell VM rootfs layout, init-script shape, and gvproxy resolver IP
(`192.168.127.1`). That is acceptable only as a compatibility bridge for
explicit/legacy VM sandboxes.

Durable follow-up is split by owner:

- NemoClaw: #3454 restores normal macOS onboarding to Docker/Colima.
- OpenShell: #1375 makes VM opt-in upstream; VM resolver setup should
ultimately be OpenShell-owned rather than a NemoClaw rootfs patch.
- Future OpenShell VM layouts, including ext4-style root disks, should
be diagnosed clearly by NemoClaw but not mounted or rewritten from
NemoClaw.

## Regression Risk
- macOS VM path: intentional behavior change. The VM DNS patch is gated
to `openshellDriver === "vm"` on Darwin, is best-effort, and can be
disabled with `NEMOCLAW_DISABLE_VM_DNS_MONKEYPATCH=1`.
- Normal macOS Docker path: intentionally out of scope here and owned by
#3454. This PR should not default macOS to VM.
- Linux/Docker path: low risk. The VM DNS patch does not run for Docker
sandboxes, and legacy DNS proxy repair remains available for non-VM
drivers.
- Discord policy: low risk. The change adds websocket-specific
`*.discord.gg` handling with credential rewrite; it does not broadly
open Discord REST beyond the existing Discord policy surface.
- Messaging reuse: lower risk than before. Slack reuse now requires both
`-slack-bridge` and `-slack-app`, avoiding partial provider reuse.

## Validation
- `npm run build:cli`
- `git diff --check`
- `npx vitest run src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts
test/sandbox-connect-inference.test.ts test/onboard.test.ts
--fileParallelism=false`
- Focused suite result on current head: 262 tests passed.
- Manual macOS VM validation during debugging:
`https://inference.local/v1/models` and chat completions returned 200
from inside the VM sandbox after the DNS patch.
- Manual Discord validation during debugging: Hermes Discord responded
after the regional gateway websocket policy was applied.
- Current full nightly dispatch:
https://github.com/NVIDIA/NemoClaw/actions/runs/25861533504

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Non-interactive onboarding can reuse stored messaging channels to
speed setup
* Added WebSocket support for Discord gateways with credential rewrite
handling
* Create-stream option to require startup output before considering
sandboxes "ready"

* **Bug Fixes**
* Improved VM/macOS DNS setup and repair paths; refined sandbox driver
selection
  * More robust inference-route repair behavior for sandboxes

* **Tests**
* Expanded tests for messaging reuse, VM DNS patching, sandbox
creation/connect, and policy validation

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3445)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@miyoungc miyoungc mentioned this pull request May 14, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 14, 2026
## Summary
Refreshes the NemoClaw documentation for the local `main` changes
included in the 0.0.42 release. The update adds release notes, updates
the affected user-facing setup and troubleshooting pages, bumps docs
metadata to 0.0.42, and regenerates the matching user skills.

## Changes
- #3537 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`: Documented host-level status
fields, cloudflared state-specific recovery hints, and Local Ollama auth
proxy status diagnostics.
- #3454 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`: Documented macOS Docker-driver
onboarding and removed the expectation that standard macOS setup needs a
VM driver helper.
- #3514 -> `docs/inference/use-local-inference.md`: Documented
compatible-endpoint retry behavior for reasoning-only smoke responses.
- #3448 -> `docs/reference/commands.md`,
`docs/manage-sandboxes/messaging-channels.md`: Documented canonical
channel names and policy preset hints after `channels add`.
- #3520 -> `docs/about/release-notes.md`: Captured clearer GPU recovery
and uninstall wording in the 0.0.42 release notes.
- #3313 -> `docs/get-started/quickstart.md`,
`docs/reference/troubleshooting.md`: Documented stronger dashboard port
detection and rollback when a forward cannot start.
- #3502 -> `docs/about/release-notes.md`: Captured batched onboarding
policy preset application in the 0.0.42 release notes.
- #3505 -> `docs/reference/troubleshooting.md`: Documented the top-level
Colima socket path.
- #3421 -> `docs/about/release-notes.md`: Captured idempotent installer
shim logging in the 0.0.42 release notes.
- Updated `docs/project.json`, `docs/versions1.json`, and regenerated
`.agents/skills/nemoclaw-user-*` outputs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes - v0.0.42

* **Documentation**
  * Enhanced macOS onboarding guidance for Docker gateway setup
  * Improved dashboard port conflict handling with automatic rollback
* Better local Ollama inference diagnostics and authentication proxy
checks
  * Clarified status command output and recovery procedures
  * Refined messaging channel setup documentation

* **Chores**
  * Version bump to 0.0.42

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3540)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
cv added a commit that referenced this pull request May 27, 2026
…3728) (#4221)

## Summary

Fixes #3728. On macOS arm64 (and any other host where the Docker-driver
gateway path is enabled), `getSandboxRuntimeRegistryFields` recorded
`openshellDriver: "vm"` based purely on `process.platform === "darwin"`.
That mismatched the runtime — OpenShell's Docker-driver gateway always
starts with `OPENSHELL_DRIVERS=docker` (#3454) — and downstream code
keyed
off `openshellDriver === "vm"` to run the VM-only DNS monkeypatch and
emit
the misleading VM-driver warnings reported in the issue.

This change records `"docker"` on every Docker-driver host. The VM-only
log/warning paths are gated on `openshellDriver === "vm"`, so they now
stay silent for macOS Docker-driver sandboxes. Legacy/opt-in sandboxes
that were already written to disk with `openshellDriver: "vm"` still
trigger the existing VM-only compatibility shim.

## Changes

- `src/lib/onboard/sandbox-registry-metadata.ts` — drop the
  `process.platform === "darwin" ? "vm" : "docker"` branch; record
  `"docker"` whenever `isLinuxDockerDriverGatewayEnabled()` is true.
- `src/lib/onboard/sandbox-registry-metadata.test.ts` (new) — unit tests
  asserting macOS Docker-driver → `"docker"`, Linux Docker-driver →
  `"docker"`, and legacy Linux → `"kubernetes"`.
- `src/lib/onboard/vm-dns-monkeypatch.test.ts` — regression test that
  exercises the real `applyOpenShellVmDnsMonkeypatch` with
  `openshellDriver: "docker"` on a mocked darwin platform and verifies
  the onboard wrapper emits no logs or warnings.

## Test plan

- [x] `npm run typecheck:cli`
- [x] `npm run build:cli`
- [x] `npx vitest run src/lib/onboard/sandbox-registry-metadata.test.ts
src/lib/onboard/vm-dns-monkeypatch.test.ts
src/lib/actions/sandbox/vm-dns-monkeypatch.test.ts` — 18/18 pass
- [x] `cd nemoclaw && npm run build && npm test` — 457/457 pass
- [x] `npx @biomejs/biome check` clean on touched files
- [x] Linux host can't reproduce the macOS-specific behavior directly,
so
      the regression is covered by mocking `process.platform` (allowed
      by the issue brief).

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved Docker and Kubernetes driver selection for sandbox runtime
configuration.
  * Fixed DNS monkeypatch handling on macOS Docker-driver sandboxes.
* Corrected platform-specific driver assignment logic for Linux and
macOS environments.

* **Tests**
* Added comprehensive test coverage for driver selection across
different platforms and configurations.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4221?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: e2e End-to-end tests, nightly failures, or validation infrastructure area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow area: packaging Packages, images, registries, installers, or distribution area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images needs: review PR is conflict-free and awaiting maintainer review and removed area: packaging Packages, images, registries, installers, or distribution Getting Started bug Something fails against expected or documented behavior needs: review PR is conflict-free and awaiting maintainer review labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: e2e End-to-end tests, nightly failures, or validation infrastructure area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression integration: hermes Hermes integration behavior integration: openclaw OpenClaw integration behavior platform: arm64 Affects ARM64 or aarch64 architecture platform: container Affects Docker, containerd, Podman, or images platform: macos Affects macOS, including Apple Silicon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NemoClaw hard-codes OPENSHELL_DRIVERS=vm on macOS, causing sandbox startup failure with Colima while Docker driver works

3 participants