fix(onboard): use Docker-driver gateway on macOS OpenShell 0.0.37#3383
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds platform-aware OpenShell installer checks and macOS aarch64 gateway asset selection, refactors onboard install/upgrade into helpers that consider platform-specific Docker-driver binary requirements (gateway vs gateway+sandbox), enables the gateway path on Darwin, and expands unit and E2E tests to validate installer behavior and survivor sandbox persistence across gateway restarts. ChangesmacOS Docker-driver gateway support and test coverage
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
E2E Advisor RecommendationRequired E2E: None Full advisor summaryPi Semantic E2E AdvisorFailed: pi exited with status 1; see /home/runner/work/NemoClaw/NemoClaw/artifacts/e2e-advisor/e2e-advisor-pi-raw-output.txt |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/test-openshell-gateway-upgrade.sh (1)
84-93: ⚡ Quick winRestore
sandboxes.jsonafter test to avoid persistent local side effects.At Line 213-237 the script mutates
$HOME/.nemoclaw/sandboxes.json, butcleanup(Line 84-93) doesn’t revert it. This can leave stale local state after manual runs.♻️ Proposed patch
@@ REGISTRY_FILE="$HOME/.nemoclaw/sandboxes.json" +REGISTRY_BACKUP="" @@ cleanup() { set +e cleanup_pid "$OLD_PID" cleanup_pid "$NEW_PID" @@ rm -f "$PID_FILE" + if [ "${REGISTRY_BACKUP:-}" = "__ABSENT__" ]; then + rm -f "$REGISTRY_FILE" + elif [ -n "${REGISTRY_BACKUP:-}" ] && [ -f "$REGISTRY_BACKUP" ]; then + mkdir -p "$(dirname "$REGISTRY_FILE")" + cp "$REGISTRY_BACKUP" "$REGISTRY_FILE" + rm -f "$REGISTRY_BACKUP" + fi } @@ create_survivor_sandbox() { @@ mkdir -p "$(dirname "$REGISTRY_FILE")" + if [ -z "${REGISTRY_BACKUP:-}" ]; then + if [ -f "$REGISTRY_FILE" ]; then + REGISTRY_BACKUP="$(mktemp)" + cp "$REGISTRY_FILE" "$REGISTRY_BACKUP" + else + REGISTRY_BACKUP="__ABSENT__" + fi + fi python3 - <<PYAlso applies to: 212-237
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-openshell-gateway-upgrade.sh` around lines 84 - 93, The test mutates the local sandboxes JSON but never restores it; before any mutation (in the setup section around where sandboxes.json is modified) save the original contents to a temp backup (e.g., BACKUP_SANDBOXES or a tmp file) and after the test ensure restoration in the cleanup() function; update cleanup() (which already calls cleanup_pid, handles OLD_PID/NEW_PID, openshell sandbox delete/SURVIVOR_SANDBOX and removes PID_FILE) to detect the backup and restore the original sandboxes JSON (or remove the temporary copy) so the local state is reverted even on early exits.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Line 3790: The change widening Docker-driver mode to include darwin causes
downstream procfs-based checks to misclassify macOS gateway processes; update
the logic to skip /proc inspections on macOS by guarding calls to readProcessEnv
and readProcessExe (and any callers that use them, e.g., the gateway process
reuse/drift classification) when platform === "darwin" instead of assuming Linux
semantics; add a short-circuit that treats darwin as “cannot inspect via /proc”
and falls back to a safe non-stale/unowned assumption or an alternative check so
healthy macOS gateway processes are not marked stale.
---
Nitpick comments:
In `@test/e2e/test-openshell-gateway-upgrade.sh`:
- Around line 84-93: The test mutates the local sandboxes JSON but never
restores it; before any mutation (in the setup section around where
sandboxes.json is modified) save the original contents to a temp backup (e.g.,
BACKUP_SANDBOXES or a tmp file) and after the test ensure restoration in the
cleanup() function; update cleanup() (which already calls cleanup_pid, handles
OLD_PID/NEW_PID, openshell sandbox delete/SURVIVOR_SANDBOX and removes PID_FILE)
to detect the backup and restore the original sandboxes JSON (or remove the
temporary copy) so the local state is reverted even on early exits.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f0f0af19-d406-4f47-80f9-e6bf0244d7c3
📒 Files selected for processing (5)
scripts/install-openshell.shsrc/lib/onboard.tstest/e2e/test-openshell-gateway-upgrade.shtest/install-openshell-version-check.test.tstest/onboard.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 25711792996
|
Selective E2E Results — ❌ Some jobs failedRun: 25712253610
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/test-openshell-gateway-upgrade.sh (1)
406-410: 💤 Low valueDuplicate
localdeclaration formarker.Line 406 declares
local marker pid counter, then line 410 declareslocal markeragain. The second declaration is redundant.♻️ Proposed fix
assert_survivor_sandbox_after_upgrade() { - local marker pid counter + local pid counter marker info "Verifying survivor sandbox after OpenShell gateway upgrade" wait_for_survivor_ready || fail "survivor sandbox is not Ready after gateway upgrade" - local marker marker="$(🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-openshell-gateway-upgrade.sh` around lines 406 - 410, Remove the redundant local declaration for marker: keep the original single declaration "local marker pid counter" and delete the subsequent "local marker" line so the variables are declared only once (look for the duplicate declaration of marker in the test-openshell-gateway-upgrade.sh script near the survivor readiness check).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/test-openshell-gateway-upgrade.sh`:
- Around line 406-410: Remove the redundant local declaration for marker: keep
the original single declaration "local marker pid counter" and delete the
subsequent "local marker" line so the variables are declared only once (look for
the duplicate declaration of marker in the test-openshell-gateway-upgrade.sh
script near the survivor readiness check).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cf0ab207-6a05-4851-a121-d5097970d217
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-openshell-gateway-upgrade.sh
✅ Files skipped from review due to trivial changes (1)
- .github/workflows/nightly-e2e.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
3927-3931:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDarwin enablement still routes through Linux-only
/procchecks.With this gate returning
trueon macOS, the reuse/drift path now reachesreadProcessEnv/readProcessExeand therequireDockerDriverEnv: truechecks below. Those are Linux-only, so healthyopenshell-gatewayprocesses on Darwin will be classified as stale/unowned and get restarted or fail adoption on later onboard/resume runs. This still needs the non-Linux guard/fallback that was already called out earlier.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 3927 - 3931, The function isLinuxDockerDriverGatewayEnabled currently returns true for darwin which lets macOS code flow into Linux-only checks (readProcessEnv, readProcessExe and places using requireDockerDriverEnv: true) causing healthy openshell-gateway processes to be misclassified; change the gate to only enable on Linux (return platform === "linux") or, alternatively, keep the function but add an explicit platform guard where callers invoke readProcessEnv/readProcessExe and check requireDockerDriverEnv so those Linux-specific checks are skipped on darwin, ensuring macOS follows the non-Linux fallback path.
🧹 Nitpick comments (1)
test/e2e/test-openshell-gateway-upgrade.sh (1)
358-401: 💤 Low valueRedundant
local markerdeclaration.Line 359 already declares
markeras local, so thelocal markeron line 363 is redundant.♻️ Suggested cleanup
assert_survivor_sandbox_after_upgrade() { local marker pid counter info "Verifying survivor sandbox after OpenShell gateway upgrade" wait_for_survivor_ready || fail "survivor sandbox is not Ready after gateway upgrade" - local marker marker="$( openshell sandbox exec --name "$SURVIVOR_SANDBOX" -- \ cat /tmp/nemoclaw-gateway-upgrade-marker 2>/dev/null || true )"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-openshell-gateway-upgrade.sh` around lines 358 - 401, In function assert_survivor_sandbox_after_upgrade remove the duplicate local declaration for marker (the second "local marker" before assigning marker="$(openshell sandbox exec ... )"); keep the initial local marker,pid,counter declaration at the top and delete the redundant declaration so marker is only declared once.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3516-3647: ensureOpenshellForOnboard currently assumes
installOpenshell() success implies required docker-driver binaries are present;
after any installOpenshell() call (the ones inside ensureOpenshellForOnboard
where openshellInstall is assigned), re-run the docker-driver check
(areRequiredDockerDriverBinariesPresent or equivalent) and, if still missing
when isLinuxDockerDriverGatewayEnabled() is true and docker-driver binaries were
the original reason for reinstall (needsDockerDriverBinaries was true), treat
the install as failed: log a clear error and exit (mirror existing failure
handling for installs/upgrades) or attempt one more install+check loop before
giving up so ensureOpenshellForOnboard never returns success while required
openshell-gateway/openshell-sandbox binaries are absent.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3927-3931: The function isLinuxDockerDriverGatewayEnabled
currently returns true for darwin which lets macOS code flow into Linux-only
checks (readProcessEnv, readProcessExe and places using requireDockerDriverEnv:
true) causing healthy openshell-gateway processes to be misclassified; change
the gate to only enable on Linux (return platform === "linux") or,
alternatively, keep the function but add an explicit platform guard where
callers invoke readProcessEnv/readProcessExe and check requireDockerDriverEnv so
those Linux-specific checks are skipped on darwin, ensuring macOS follows the
non-Linux fallback path.
---
Nitpick comments:
In `@test/e2e/test-openshell-gateway-upgrade.sh`:
- Around line 358-401: In function assert_survivor_sandbox_after_upgrade remove
the duplicate local declaration for marker (the second "local marker" before
assigning marker="$(openshell sandbox exec ... )"); keep the initial local
marker,pid,counter declaration at the top and delete the redundant declaration
so marker is only declared once.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2247a77e-5bff-420e-8279-ae1cbc71f98b
📒 Files selected for processing (3)
src/lib/onboard.tstest/e2e/test-openshell-gateway-upgrade.shtest/onboard.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 25712441875
|
Selective E2E Results — ❌ Some jobs failedRun: 25712785974
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/test-openshell-gateway-upgrade.sh (1)
419-427: 💤 Low valueRemove duplicate
local markerdeclaration.
markeris declared twice: at line 419 and again at line 423. This is redundant and likely a copy-paste artifact.Proposed fix
assert_survivor_sandbox_after_upgrade() { local marker pid counter info "Verifying survivor sandbox after OpenShell gateway upgrade" wait_for_survivor_ready || fail "survivor sandbox is not Ready after gateway upgrade" - local marker marker="$( openshell sandbox exec --name "$SURVIVOR_SANDBOX" -- \ cat /tmp/nemoclaw-gateway-upgrade-marker 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/test-openshell-gateway-upgrade.sh` around lines 419 - 427, The file declares the variable "marker" twice; remove the redundant declaration so "marker" is declared only once before it's assigned. Specifically, delete the earlier standalone "local marker" and keep the single "local marker" that directly precedes the assignment using openshell sandbox exec (referencing SURVIVOR_SANDBOX and the cat /tmp/nemoclaw-gateway-upgrade-marker command); no other logic changes needed and wait_for_survivor_ready() and subsequent uses of marker should remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/test-openshell-gateway-upgrade.sh`:
- Around line 368-372: The sandbox creation fails because the test uses a
standalone gateway binary (exec "$STALE_GATEWAY_BIN") and registers it locally
(openshell gateway add --local --name nemoclaw ...) but still passes --gateway
nemoclaw to openshell sandbox create, which expects a Docker cluster container;
fix by either removing the --gateway nemoclaw flag from the openshell sandbox
create call (so it uses the default/selected gateway) or ensure Docker-based
gateway infrastructure is created before sandbox creation by invoking the
install script used by other tests (run the same install.sh/setup sequence that
creates the openshell-cluster-nemoclaw container) so the openshell sandbox
create --gateway nemoclaw command can find the expected Docker cluster.
---
Nitpick comments:
In `@test/e2e/test-openshell-gateway-upgrade.sh`:
- Around line 419-427: The file declares the variable "marker" twice; remove the
redundant declaration so "marker" is declared only once before it's assigned.
Specifically, delete the earlier standalone "local marker" and keep the single
"local marker" that directly precedes the assignment using openshell sandbox
exec (referencing SURVIVOR_SANDBOX and the cat
/tmp/nemoclaw-gateway-upgrade-marker command); no other logic changes needed and
wait_for_survivor_ready() and subsequent uses of marker should remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 88287808-a6d4-47e2-afd1-a01d1e92aa81
📒 Files selected for processing (1)
test/e2e/test-openshell-gateway-upgrade.sh
Selective E2E Results — ✅ All requested jobs passedRun: 25712886046
|
Selective E2E Results — ❌ Some jobs failedRun: 25713136197
|
| local old_head expected_head | ||
| old_head="$(git -C "$HOME/.nemoclaw/source" rev-parse HEAD 2>/dev/null || true)" | ||
| expected_head="$(git ls-remote https://github.com/NVIDIA/NemoClaw.git "refs/tags/${OLD_NEMOCLAW_REF}" | awk '{print $1}')" | ||
| [ -n "$old_head" ] && [ "$old_head" = "$expected_head" ] \ |
Selective E2E Results — ❌ Some jobs failedRun: 25713253315
|
Selective E2E Results — ❌ Some jobs failedRun: 25716192263
|
| destroyGateway(); | ||
| registry.clearAll(); | ||
| retireLegacyGatewayForDockerDriverUpgrade(); | ||
| gatewayReuseState = "missing"; |
Selective E2E Results — ❌ Some jobs failedRun: 25716841021
|
Selective E2E Results — ✅ All requested jobs passedRun: 25766991915
|
…037-gateway # Conflicts: # src/lib/onboard.ts
## Summary - Add v0.0.40 release notes and update docs version metadata. - Document release-prep behavior changes around onboarding, local inference, policy preset filtering, and config recovery. - Refresh generated `nemoclaw-user-*` skills from the source docs. ## Source summary - #3383 -> `docs/about/release-notes.md`, `docs/reference/commands.md`, `docs/manage-sandboxes/lifecycle.md`: Reflect macOS Docker-driver OpenShell gateway onboarding and upgrade behavior. - #3378 -> `docs/about/release-notes.md`: Capture the Docker-driver gateway TCP readiness fix and clearer startup failures. - #3338 -> `docs/about/release-notes.md`, `docs/inference/use-local-inference.md`: Reflect the Ollama auth proxy token requirement on native API routes. - #3420 -> `docs/about/release-notes.md`, `docs/get-started/prerequisites.md`, `docs/inference/use-local-inference.md`: Document the Linux Ollama `zstd` preflight and sudo messaging. - #3417 -> `docs/about/release-notes.md`, `docs/inference/inference-options.md`, `docs/inference/use-local-inference.md`: Reflect detected running vLLM provider selection. - #3223 -> `docs/about/release-notes.md`, `docs/reference/commands.md`, `docs/reference/network-policies.md`, `docs/get-started/quickstart.md`: Document agent-aware policy preset filtering. - #3385 -> `docs/about/release-notes.md`: Capture the dashboard forward TCP reachability check. - #3160 -> `docs/about/release-notes.md`, `docs/reference/troubleshooting.md`: Document empty `openclaw.json` baseline recovery. - #3367 -> `docs/about/release-notes.md`: Capture OpenClaw plugin compatibility metadata. ## Test plan - [x] `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - [x] `make docs` - [x] `git diff --check` - [x] Skip-term scan for `docs/.docs-skip` terms <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit # Release Notes v0.0.40 * **New Features** * Sandbox configuration recovery when inference changes cause data loss * Policy presets now intelligently filter based on agent capabilities * Enhanced gateway health checks and upgrade reliability * **Documentation** * Improved local inference setup instructions with clearer dependency requirements * Clarified vLLM experimental feature availability and prerequisites * Reorganized architecture documentation for enhanced clarity * Refined security and hardening guidance [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3427) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
… k3s Clarifies that the OpenShell sandbox runs as a sibling Docker container on Linux and macOS Apple Silicon (Docker-driver path, default since 0.0.39 via NVIDIA#3001 and NVIDIA#3383) or as a pod in the embedded k3s cluster on macOS Intel, Windows, and WSL2 (legacy path). architecture.md gets a new platform/compute-path table at the top of Deployment Topology, the existing Mermaid diagram is identified as the legacy k3s path, and the layering table now lists both Sandbox container and Sandbox pod variants. prerequisites.md drops k3s from the RAM-pressure sentence (k3s is an internal detail of the gateway container, not a separate process the user budgets for) and rewords the fuse-overlayfs note to refer to the OpenShell gateway image rather than k3s directly. Refs NVIDIA#3432. Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
… k3s Clarifies that the OpenShell sandbox runs as a sibling Docker container on Linux and macOS Apple Silicon (Docker-driver path, default since 0.0.39 via NVIDIA#3001 and NVIDIA#3383) or as a pod in the embedded k3s cluster on macOS Intel, Windows, and WSL2 (legacy path). architecture.md gets a new platform/compute-path table at the top of Deployment Topology, the existing Mermaid diagram is identified as the legacy k3s path, and the layering table now lists both Sandbox container and Sandbox pod variants. prerequisites.md drops k3s from the RAM-pressure sentence (k3s is an internal detail of the gateway container, not a separate process the user budgets for) and rewords the fuse-overlayfs note to refer to the OpenShell gateway image rather than k3s directly. Refs NVIDIA#3432. Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
Summary
Fixes the macOS install/onboard regression when tip-of-main installs OpenShell 0.0.37 over an older or incomplete OpenShell. OpenShell 0.0.37 removed the legacy
openshell gateway start/destroylifecycle verbs; NemoClaw had switched to the standalone Docker-driver gateway path on Linux only, so macOS still called the removed verbs during onboarding.Changes
openshell-gatewayprocess instead of calling removedopenshell gateway start/destroyverbs.openshell-gateway-aarch64-apple-darwin.tar.gzasset when OpenShell 0.0.37 is present.ensureOpenshellForOnboard(), sonemoclaw onboarditself upgrades stale OpenShell and repairs incomplete Docker-driver installs before gateway start.openshell-gatewayandopenshell-sandbox; macOS requiresopenshell-gateway.openshell-gateway-upgrade-e2eto model the real upgrade path: install NemoClawv0.0.36with the curl-style installer, create a working registered claw on OpenShell 0.0.36, start a running in-sandbox agent process, run the current curl-style installer/onboard path from the PR SHA, and assert the same sandbox and same agent PID remain healthy with OpenShell 0.0.37.Verification
bash -n test/e2e/test-openshell-gateway-upgrade.sh scripts/install-openshell.shshellcheck scripts/install-openshell.sh test/e2e/test-openshell-gateway-upgrade.shgit diff --checknpm run build:clinpx vitest run test/install-openshell-version-check.test.ts test/onboard.test.ts --testTimeout 60000bash test/e2e/test-openshell-gateway-upgrade.shon macOS: passes the synthetic macOS installer regression and skips the live Linux gateway/agent upgrade section as intended.Notes
The live old-install upgrade section of
test/e2e/test-openshell-gateway-upgrade.shremains Linux-only and is wired intoopenshell-gateway-upgrade-e2ein nightly. It now starts from the real NemoClawv0.0.36installer, not a synthetic OpenShell-only setup.Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores