fix(onboard): recover GPU gateway reuse by ericksoa · Pull Request #3670 · NVIDIA/NemoClaw

ericksoa · 2026-05-17T22:49:52Z

Summary

This makes GPU-enabled onboarding recover from a healthy reusable gateway that was previously started without GPU passthrough, instead of aborting with a full nemoclaw uninstall && nemoclaw onboard --gpu recovery path.

The recovery is intentionally conservative:

Reuse is unchanged when GPU is not requested, when the reusable gateway is already a confirmed Docker-driver gateway, or when a legacy gateway already has Docker GPU DeviceRequests.
A CPU-only legacy gateway is automatically retired only when no local sandboxes are registered, or when NEMOCLAW_RECREATE_SANDBOX=1 is recreating the one registered sandbox with the same name.
If other registered sandboxes depend on the gateway, onboarding still aborts non-destructively and prints targeted destroy --cleanup-gateway guidance.
Unknown Docker/container state remains non-destructive.
--no-gpu / NEMOCLAW_SANDBOX_GPU=0 remain explicit opt-outs; this does not silently downshift GPU-selected onboarding to CPU.

I also tightened the manual recovery message so the empty-registry fallback no longer recommends full uninstall first.

Testing for reviewers

The focused unit coverage exercises the issue decision matrix:

GPU requested + healthy CPU-only legacy gateway + empty registry => restart/recreate path.
GPU requested + healthy CPU-only legacy gateway + the one registered sandbox being recreated => restart/recreate path.
GPU requested + healthy CPU-only legacy gateway + shared/different registered sandboxes => abort with targeted recovery.
GPU not requested => reuse path unchanged.
Confirmed Docker-driver gateway => reuse path unchanged.
Unknown legacy Docker inspection state => non-destructive abort.
Empty-registry recovery wording recommends targeted gateway cleanup, not full uninstall.

Local validation run:

npm run build:cli
npm test -- src/lib/onboard/gateway-gpu-passthrough.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/sandbox-gpu-create.test.ts src/lib/onboard/docker-gpu-patch.test.ts
npm run typecheck
npm run source-shape:check
git diff --check
npx prek run --all-files

All of the above passed locally.

I did not run live GPU UAT on Ubuntu GPU, DGX Spark, or DGX Station hardware. That is still the end-to-end validation needed for the exact reporter scenario: first create a CPU/no-GPU gateway, leave the stale gateway behind, then rerun GPU-default onboarding with NEMOCLAW_RECREATE_SANDBOX=1 and verify the new sandbox gets GPU access.

Summary by CodeRabbit

Bug Fixes
- Improved GPU passthrough detection and reuse logic to avoid unsafe reuse and to perform safer restart or abort-with-recovery when needed
- Updated recovery guidance for GPU passthrough issues with clearer cleanup and re-onboarding instructions
New Features
- Added explicit GPU inspection and decision flow to better handle legacy gateway GPU states and restart safety checks
Tests
- Expanded and hardened tests around GPU inspection, reuse decisions, recovery messaging, and CLI timeouts/robustness
Chores
- Hardened pre-commit test hook execution (serialized runs and cleanup before build)

coderabbitai · 2026-05-17T22:50:02Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f7dfa0b0-5334-460e-9fd7-2e78535be2ba

📥 Commits

Reviewing files that changed from the base of the PR and between 9797282 and e485c27.

📒 Files selected for processing (1)

src/lib/onboard.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Adds exported decision types and helpers to inspect legacy gateway GPU passthrough, implements a reconcile flow that may inspect Docker DeviceRequests and read sandbox registry names, wires this into onboarding (refactors destroyGateway for GPU-reuse), updates recovery messaging, expands tests, and tweaks pre-commit and test-runner behavior.

Changes

GPU Passthrough Gateway Reuse Decision Flow

Layer / File(s)	Summary
Gateway GPU passthrough decision types and logic `src/lib/onboard/gateway-gpu-passthrough.ts`	Adds `LegacyGatewayGpuInspection` and `GatewayGpuReuseDecision` types; rewrites `shouldInspectLegacyGatewayGpuPassthrough`; adds `inspectLegacyGatewayGpuPassthroughResult`, `decideGatewayGpuReuseForGpuIntent`, and `canRestartCpuOnlyGatewayForGpuIntent`.
Reconcile flow and registry helpers `src/lib/onboard/gateway-gpu-passthrough.ts`	Implements `reconcileGatewayGpuReuseForGpuIntent`, registry-reading helpers, guarded abort-with-recovery on unknown inspection, and CPU-only restart safety computation.
Onboard integration and destroyGateway refactor `src/lib/onboard.ts`	Replaces inline legacy-gateway GPU inspection with `reconcileGatewayGpuReuseForGpuIntent`, refactors `destroyGateway` to accept registry-clear and docker-driver-enabled callbacks, and invokes `destroyGateway(() => undefined, () => false)` for GPU-reuse cleanup.
Expanded decision/restart tests `src/lib/onboard/gateway-gpu-passthrough.test.ts`	Mocks docker inspect and adds tests for inspection parsing, decision outcomes (reuse/restart/abort), unknown-state handling, and `canRestartCpuOnlyGatewayForGpuIntent` behavior.
Recovery guidance and tests `src/lib/onboard/gpu-recovery.ts`, `src/lib/onboard/gpu-recovery.test.ts`	When no sandboxes are registered, recovery output now recommends `openshell gateway remove nemoclaw` (and optional `openshell gateway destroy -g nemoclaw`) followed by `nemoclaw onboard --gpu`; tests updated and forbidden uninstall suggestions added.
Pre-commit and test runner updates `.pre-commit-config.yaml`, `test/runner.test.ts`, `test/cli.test.ts`	`test-cli` hook now deletes `dist` with `rm -rf dist` and sets `require_serial: true`; test runner skips ENOENT files during reads; two tests receive explicit Vitest timeouts.

Sequence Diagram(s)

sequenceDiagram
  participant Onboard
  participant DockerInspect
  participant Registry
  participant GatewayOps
  participant Recovery
  Onboard->>DockerInspect: inspect legacy gateway DeviceRequests
  DockerInspect-->>Onboard: inspection outcome (gpu-enabled|cpu-only|not-found|unknown)
  alt inspection == "unknown"
    Onboard->>Recovery: abort-with-recovery (report unknown state)
  else inspection == "cpu-only"
    Onboard->>Registry: read registered sandbox names
    Registry-->>Onboard: sandbox list
    Onboard->>Onboard: decideGatewayGpuReuseForGpuIntent(..., cpuOnlyGatewayRestartSafe)
    alt decision == "restart-gateway"
      Onboard->>GatewayOps: stop forwards, run destroyGateway variant, recreate gateway with GPU
    else decision == "abort-with-recovery"
      Onboard->>Recovery: emit recovery instructions (openshell gateway remove/destroy)
    end
  else decision == "reuse"
    Onboard-->>GatewayOps: proceed without restart
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/NemoClaw#3383: Related changes to the isLinuxDockerDriverGatewayEnabled predicate used by the destroyGateway refactor.
NVIDIA/NemoClaw#3640: Overlapping test coverage updates for gateway GPU passthrough inspection helpers.
NVIDIA/NemoClaw#3479: Prior refactor introducing gated legacy gateway GPU inspection used in onboarding.

Suggested labels

NemoClaw CLI, Docker, OpenShell, fix, Sandbox

Suggested reviewers

cjagwani
jyaunches

Poem

🐰 I hopped through code to check the gateway state,
Parsed DeviceRequests, decided reuse or fate.
If CPU-only stops onboarding's new delight,
I whisper openshell commands to set it right.
Then I bound away, happy: rebuild and re-ignite.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(onboard): recover GPU gateway reuse' accurately and specifically summarizes the main change: enabling GPU-enabled onboarding to recover from a healthy reusable gateway previously started without GPU passthrough, avoiding enforced full reinstall.
Linked Issues check	✅ Passed	The code changes comprehensively address issue `#3578` by implementing safe GPU gateway recovery paths, decision logic for reuse vs restart scenarios, non-destructive abort handling, and updated recovery messaging for empty registries.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to GPU gateway recovery: pre-commit hook serialization, onboard entrypoint refactoring, GPU passthrough inspection/decision primitives, recovery messaging updates, and test timeout adjustments are all necessary for the fix.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/3578-gpu-gateway-reuse

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-17T22:51:37Z

E2E Advisor Recommendation

Required E2E: gpu-e2e, gpu-double-onboard-e2e, double-onboard-e2e
Optional E2E: onboard-resume-e2e, onboard-repair-e2e, gateway-drift-preflight-e2e

Dispatch hint: gpu-e2e,gpu-double-onboard-e2e,double-onboard-e2e

Auto-dispatched E2E: gpu-e2e, gpu-double-onboard-e2e, double-onboard-e2e via nightly-e2e.yaml at e485c274de5d26814c3cf8f81209ff71acd4a198 — nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

gpu-e2e (high): Required because the changed code is in the GPU onboarding path. This job runs a real local Ollama GPU onboarding flow, creates a sandbox with GPU passthrough, verifies inference through the sandbox, and exercises gateway creation/cleanup on a GPU runner.
gpu-double-onboard-e2e (high): Required because this PR specifically changes healthy gateway reuse/reconciliation when GPU intent is present. The double-onboard GPU job verifies repeated GPU onboarding against an existing gateway and catches regressions in GPU gateway reuse, Ollama proxy continuity, and live sandbox inference after re-onboard.
double-onboard-e2e (high): Required as a lifecycle safety check because the PR changes gateway reuse and cleanup decisions based on registry state. This job validates repeated onboarding, shared gateway reuse, multiple sandbox preservation, stale registry reconciliation, and gateway rebuild guidance in a real OpenShell environment.

Optional E2E

onboard-resume-e2e (medium): Useful adjacent coverage because the changed callsite passes recorded/requested sandbox names and recreate intent into gateway GPU reconciliation near the resume/reuse gateway branch. It validates interrupted onboard state and resume behavior, but does not directly exercise the GPU mismatch branch.
onboard-repair-e2e (medium): Optional confidence for repair and invalidation paths around resumable onboarding and missing sandboxes, adjacent to the registry/sandbox-name safety checks introduced for GPU gateway replacement.
gateway-drift-preflight-e2e (medium): Optional fail-closed gateway confidence check. The PR preserves non-destructive behavior when legacy gateway GPU state cannot be verified; this regression job validates a related stale/unsafe gateway preflight failure mode.

New E2E recommendations

gpu-gateway-reuse-recovery (high): Existing GPU E2E jobs cover clean GPU onboard and GPU re-onboard, but they do not appear to create a CPU-only legacy gateway first and then rerun onboard with GPU intent. That is the exact recovery path changed here, including empty-registry safe gateway replacement and abort behavior when other sandboxes are registered.
- Suggested test: Add a GPU E2E that first onboards without GPU/local Ollama passthrough or otherwise creates a CPU-only legacy gateway, then runs nemoclaw onboard --gpu with (a) no registered sandboxes, (b) --recreate for the sole current sandbox, and (c) a second registered sandbox, asserting safe replacement only in the first two cases and actionable recovery output in the third.
gateway-gpu-inspection-fail-closed (medium): The new implementation refuses automatic cleanup when Docker inspection returns an unknown result. Unit tests cover the pure decision, but an E2E with a real docker inspect failure mode would guard the user-facing no-destruction behavior.
- Suggested test: Add a regression E2E that sabotages legacy gateway docker inspect for DeviceRequests during GPU onboard and asserts NemoClaw exits non-zero with the 'could not be verified' guidance without destroying the existing gateway or registered sandboxes.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: gpu-e2e,gpu-double-onboard-e2e,double-onboard-e2e

github-actions · 2026-05-17T22:51:37Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26005053028
Target ref: c9ec1fe81adf219eb8d9489dcfefd78fecd9ba13
Workflow ref: main
Requested jobs: gpu-double-onboard-e2e,gpu-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job	Result
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)
2755-2805: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Please extract this GPU-reuse helper block out of src/lib/onboard.ts.

The onboard-entrypoint-budget job is already failing on this PR, and these helpers are self-contained enough to live beside gateway-gpu-passthrough.ts without changing behavior. That should get this file back under budget without mixing in extra cleanup.

Based on learnings: In NVIDIA/NemoClaw PRs that extract code from src/lib/onboard.ts into new behavior-preserving modules (e.g., src/lib/onboard-providers.ts, src/lib/onboard-ollama-proxy.ts, src/lib/onboard-inference-probes.ts), reviewers should avoid requesting refactors/complexity/style improvements in the extracted modules as part of the same PR. Only check for correctness and behavior preservation; defer non-behavior-preserving cleanup to separate follow-up PRs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2755 - 2805, Extract the four GPU-reuse
helper functions readRegisteredSandboxNamesForGatewayGpuReuse,
reportUnreadableSandboxRegistryForGpuReuse, destroyGateway, and
destroyGatewayRuntimeForGpuReuse into a new module colocated with
gateway-gpu-passthrough.ts, export them from that module, and update any callers
in src/lib/onboard.ts to import these functions instead of keeping them inline;
preserve all internal behavior and references to registry, GATEWAY_NAME,
dockerRemoveVolumesByPrefix, gatewayCliSupportsLifecycleCommands,
runCaptureOpenshell, isLinuxDockerDriverGatewayEnabled,
removeDockerDriverGatewayRegistration, runOpenshell, and
stopDockerDriverGatewayProcess so the logic and side effects remain identical.
Ensure the new file has the same runtime dependencies imported and no other
behavioral changes.

🧹 Nitpick comments (1)

src/lib/onboard.ts (1)
9733-9809: Run the gateway lifecycle E2Es for this path.

This branch now decides whether to preserve, restart, or abort reuse of a shared gateway, so sandbox-operations-e2e and openshell-gateway-upgrade-e2e would give much better confidence than the unit coverage alone.

As per coding guidelines: src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow. The recommended E2Es include sandbox-operations-e2e and openshell-gateway-upgrade-e2e.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9733 - 9809, Run the gateway lifecycle
end-to-end tests that exercise the new reuse/restart/abort logic in
src/lib/onboard.ts: specifically execute sandbox-operations-e2e and
openshell-gateway-upgrade-e2e against scenarios that trigger
isConfirmedDockerDriverGatewayForGpuReuse,
shouldInspectLegacyGatewayGpuPassthrough, and the code paths through
inspectLegacyGatewayGpuPassthroughResult → decideGatewayGpuReuseForGpuIntent
(including branches that call retireLegacyGatewayForDockerDriverUpgrade and
destroyGatewayForReuse); verify behavior for "restart-gateway",
"abort-with-recovery", and "cpu-only" flows and confirm registry/readable
sandbox names (readRegisteredSandboxNamesForGatewayGpuReuse) and recovery
reporting (reportGpuPassthroughRecovery) behave as expected.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2750-2753: The helper isConfirmedDockerDriverGatewayForGpuReuse
drops the upstream "healthy but PID-unattributed" signal from
refreshDockerDriverGatewayReuseState by only treating the exact "healthy" value
as eligible; change the check so any healthy variant is preserved (e.g., accept
values that include or start with "healthy" or the specific enum variant that
indicates "healthy but PID-unattributed") before falling back to
isDockerDriverGatewayProcessAlive(), so healthy-but-unattributed gateways are
routed into the GPU-reuse path instead of the legacy openshell-cluster recovery.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 2755-2805: Extract the four GPU-reuse helper functions
readRegisteredSandboxNamesForGatewayGpuReuse,
reportUnreadableSandboxRegistryForGpuReuse, destroyGateway, and
destroyGatewayRuntimeForGpuReuse into a new module colocated with
gateway-gpu-passthrough.ts, export them from that module, and update any callers
in src/lib/onboard.ts to import these functions instead of keeping them inline;
preserve all internal behavior and references to registry, GATEWAY_NAME,
dockerRemoveVolumesByPrefix, gatewayCliSupportsLifecycleCommands,
runCaptureOpenshell, isLinuxDockerDriverGatewayEnabled,
removeDockerDriverGatewayRegistration, runOpenshell, and
stopDockerDriverGatewayProcess so the logic and side effects remain identical.
Ensure the new file has the same runtime dependencies imported and no other
behavioral changes.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9733-9809: Run the gateway lifecycle end-to-end tests that
exercise the new reuse/restart/abort logic in src/lib/onboard.ts: specifically
execute sandbox-operations-e2e and openshell-gateway-upgrade-e2e against
scenarios that trigger isConfirmedDockerDriverGatewayForGpuReuse,
shouldInspectLegacyGatewayGpuPassthrough, and the code paths through
inspectLegacyGatewayGpuPassthroughResult → decideGatewayGpuReuseForGpuIntent
(including branches that call retireLegacyGatewayForDockerDriverUpgrade and
destroyGatewayForReuse); verify behavior for "restart-gateway",
"abort-with-recovery", and "cpu-only" flows and confirm registry/readable
sandbox names (readRegisteredSandboxNamesForGatewayGpuReuse) and recovery
reporting (reportGpuPassthroughRecovery) behave as expected.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7806dbf5-17f7-4ef5-8967-1599df402142

📥 Commits

Reviewing files that changed from the base of the PR and between 647566a and c9ec1fe.

📒 Files selected for processing (7)

.pre-commit-config.yaml
src/lib/onboard.ts
src/lib/onboard/gateway-gpu-passthrough.test.ts
src/lib/onboard/gateway-gpu-passthrough.ts
src/lib/onboard/gpu-recovery.test.ts
src/lib/onboard/gpu-recovery.ts
test/runner.test.ts

ericksoa · 2026-05-17T23:05:28Z

Follow-up pushed in 2e683e149 for the current red checks and CodeRabbit feedback.

What changed after review:

Preserved the Docker-driver healthy reuse signal from refreshDockerDriverGatewayReuseState(). A healthy Docker-driver gateway, including the PID-unattributed-but-port-owned case, now bypasses the legacy openshell-cluster-* GPU inspection path instead of being misclassified as a legacy CPU-only gateway.
Moved the GPU reuse reconciliation out of the onboarding entrypoint and into src/lib/onboard/gateway-gpu-passthrough.ts. Locally, src/lib/onboard.ts is now +21/-23 versus origin/main, so the onboard-entrypoint-budget check should clear on the new head.

Local validation run on this head:

npm test -- src/lib/onboard/gateway-gpu-passthrough.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/sandbox-gpu-create.test.ts src/lib/onboard/docker-gpu-patch.test.ts -> 44 passing tests
npm run build:cli
npm run typecheck
npm run source-shape:check
git diff --check
npm run check / npx prek run --all-files

Issue-specific coverage in the focused tests exercises the decision matrix for GPU requested with a CPU-only legacy gateway, empty registry restart, registered-sandbox abort, GPU opt-out reuse, confirmed Docker-driver reuse, and unknown Docker inspection staying non-destructive. The remaining runtime proof for reviewers is the GPU-host UAT path from #3578: start with a CPU/no-GPU gateway, leave the stale gateway behind, rerun GPU onboarding/recreate, and verify we recreate a GPU-capable sandbox instead of aborting with the old uninstall guidance.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 9689-9700: The code is wrongly treating any "healthy" gateway as a
confirmed Docker-driver; update the value passed for
confirmedDockerDriverGateway in the reconcileGatewayGpuReuseForGpuIntent call so
it only becomes true when the gateway is explicitly confirmed to be the Linux
Docker-driver implementation (not merely gatewayReuseState === "healthy").
Locate the call to reconcileGatewayGpuReuseForGpuIntent and replace the
expression isLinuxDockerDriverGatewayEnabled() && gatewayReuseState ===
"healthy" with a check that inspects whatever explicit confirmation flag or
runtime/type field you have (for example an explicit confirmed-docker-driver
state on gatewayReuseState or a recorded gateway metadata field like
recordedGateway?.dockerDriverConfirmed or an isGatewayConfirmedDockerDriver(...)
helper); ensure the new condition only returns true when that explicit
confirmation is present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa204a70-c5a3-4377-a2bd-14261037e1bf

📥 Commits

Reviewing files that changed from the base of the PR and between c9ec1fe and 2e683e1.

📒 Files selected for processing (3)

src/lib/onboard.ts
src/lib/onboard/gateway-gpu-passthrough.test.ts
src/lib/onboard/gateway-gpu-passthrough.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard/gateway-gpu-passthrough.ts

cv · 2026-05-17T23:11:33Z

        name: Test (CLI)
        entry: >-
-          bash -c 'node -e "require(\"node:fs\").rmSync(\"dist\", { recursive: true, force: true })" && npm run build:cli && npx tsx scripts/check-dist-sourcemaps.ts dist && npx vitest run --project cli --coverage --coverage.reporter=text-summary --coverage.reporter=json-summary --coverage.reportsDirectory=coverage/cli --coverage.include="bin/**/*.js" --coverage.include="dist/lib/**/*.js" --coverage.exclude="test/**/*.js" --coverage.exclude="test/**/*.ts" && npx tsx scripts/check-coverage-ratchet.ts coverage/cli/coverage-summary.json ci/coverage-threshold-cli.json "CLI coverage"'
+          bash -c 'rm -rf dist && npm run build:cli && npx tsx scripts/check-dist-sourcemaps.ts dist && NEMOCLAW_TEST_TIMEOUT=20000 npx vitest run --project cli --coverage --coverage.reporter=text-summary --coverage.reporter=json-summary --coverage.reportsDirectory=coverage/cli --coverage.include="bin/**/*.js" --coverage.include="dist/lib/**/*.js" --coverage.exclude="test/**/*.js" --coverage.exclude="test/**/*.ts" && npx tsx scripts/check-coverage-ratchet.ts coverage/cli/coverage-summary.json ci/coverage-threshold-cli.json "CLI coverage"'


let's not set NEMOCLAW_TEST_TIMEOUT here, and rely on the default

Done in 97972825a: removed NEMOCLAW_TEST_TIMEOUT=20000 from the pre-commit hook. The full npm run check now passes locally using the default hook timeout; the two existing CLI tests that exceeded the default suite timeout now have targeted testTimeoutOptions(...) budgets instead of a global override.

cv

See comments

github-actions · 2026-05-17T23:31:32Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26005369843
Target ref: 2e683e1492042f34af2d14918f319b32cc239d0b
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e,double-onboard-e2e,cloud-onboard-e2e
Summary: 2 passed, 0 failed, 2 skipped

Job	Result
cloud-onboard-e2e	✅ success
double-onboard-e2e	✅ success
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped

ericksoa · 2026-05-18T00:08:58Z

Pushed 97972825a to address the new review items.

Changes in this follow-up:

CodeRabbit: confirmedDockerDriverGateway now requires the Docker-driver platform, healthy gateway state, and absence of legacy lifecycle commands. That preserves the package-managed/PID-unattributed Docker-driver reuse path without skipping legacy gateway GPU inspection.
cv: removed NEMOCLAW_TEST_TIMEOUT=20000 from the test-cli pre-commit hook.
Since removing the global hook timeout exposed two existing CLI tests that legitimately run just over the default 5s budget on this machine, I gave those two tests targeted testTimeoutOptions(...) values instead of restoring a suite-wide override.

Local validation on the pushed head:

focused GPU tests: 44 passed
npx vitest run --project cli test/cli.test.ts -t "doctor accepts a live cloudflared PID"
npx vitest run --project cli test/cli.test.ts -t "connect --probe-only falls back to SSH when sandbox exec never starts"
npm run typecheck
npm run build:cli
npm run source-shape:check
git diff --check
npm run check / npx prek run --all-files

github-actions · 2026-05-18T00:10:18Z

Selective E2E Results — ⚠️ No requested jobs ran

Run: 26006750384
Target ref: 97972825ada94d3a3e3b48f56a8764d23766a4a9
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e
Summary: 0 passed, 0 failed, 2 skipped

Job	Result
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped

github-actions · 2026-05-18T01:16:06Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26007749323
Target ref: e485c274de5d26814c3cf8f81209ff71acd4a198
Workflow ref: main
Requested jobs: gpu-e2e,gpu-double-onboard-e2e,double-onboard-e2e
Summary: 1 passed, 0 failed, 2 skipped

Job	Result
double-onboard-e2e	✅ success
gpu-double-onboard-e2e	⏭️ skipped
gpu-e2e	⏭️ skipped

## Summary Fixes #3578. This makes GPU-enabled onboarding recover from a healthy reusable gateway that was previously started without GPU passthrough, instead of aborting with a full `nemoclaw uninstall && nemoclaw onboard --gpu` recovery path. The recovery is intentionally conservative: - Reuse is unchanged when GPU is not requested, when the reusable gateway is already a confirmed Docker-driver gateway, or when a legacy gateway already has Docker GPU DeviceRequests. - A CPU-only legacy gateway is automatically retired only when no local sandboxes are registered, or when `NEMOCLAW_RECREATE_SANDBOX=1` is recreating the one registered sandbox with the same name. - If other registered sandboxes depend on the gateway, onboarding still aborts non-destructively and prints targeted `destroy --cleanup-gateway` guidance. - Unknown Docker/container state remains non-destructive. - `--no-gpu` / `NEMOCLAW_SANDBOX_GPU=0` remain explicit opt-outs; this does not silently downshift GPU-selected onboarding to CPU. I also tightened the manual recovery message so the empty-registry fallback no longer recommends full uninstall first. ## Testing for reviewers The focused unit coverage exercises the issue decision matrix: - GPU requested + healthy CPU-only legacy gateway + empty registry => restart/recreate path. - GPU requested + healthy CPU-only legacy gateway + the one registered sandbox being recreated => restart/recreate path. - GPU requested + healthy CPU-only legacy gateway + shared/different registered sandboxes => abort with targeted recovery. - GPU not requested => reuse path unchanged. - Confirmed Docker-driver gateway => reuse path unchanged. - Unknown legacy Docker inspection state => non-destructive abort. - Empty-registry recovery wording recommends targeted gateway cleanup, not full uninstall. Local validation run: ```text npm run build:cli npm test -- src/lib/onboard/gateway-gpu-passthrough.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/sandbox-gpu-create.test.ts src/lib/onboard/docker-gpu-patch.test.ts npm run typecheck npm run source-shape:check git diff --check npx prek run --all-files ``` All of the above passed locally. I did not run live GPU UAT on Ubuntu GPU, DGX Spark, or DGX Station hardware. That is still the end-to-end validation needed for the exact reporter scenario: first create a CPU/no-GPU gateway, leave the stale gateway behind, then rerun GPU-default onboarding with `NEMOCLAW_RECREATE_SANDBOX=1` and verify the new sandbox gets GPU access.  ## Summary by CodeRabbit * **Bug Fixes** * Improved GPU passthrough detection and reuse logic to avoid unsafe reuse and to perform safer restart or abort-with-recovery when needed * Updated recovery guidance for GPU passthrough issues with clearer cleanup and re-onboarding instructions * **New Features** * Added explicit GPU inspection and decision flow to better handle legacy gateway GPU states and restart safety checks * **Tests** * Expanded and hardened tests around GPU inspection, reuse decisions, recovery messaging, and CLI timeouts/robustness * **Chores** * Hardened pre-commit test hook execution (serialized runs and cleanup before build)  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3670?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. ---  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

Fix GPU gateway reuse recovery

c9ec1fe

ericksoa changed the title ~~Fix GPU gateway reuse recovery~~ fix(onboard): recover GPU gateway reuse May 17, 2026

ericksoa self-assigned this May 17, 2026

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

fix(onboard): address GPU gateway reuse feedback

2e683e1

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

Comment thread src/lib/onboard.ts

cv reviewed May 17, 2026

View reviewed changes

cv requested changes May 17, 2026

View reviewed changes

fix(onboard): confirm Docker-driver gateway reuse

9797282

Merge branch 'main' into fix/3578-gpu-gateway-reuse

e485c27

ericksoa added fix labels May 18, 2026

cv approved these changes May 18, 2026

View reviewed changes

cv merged commit 32dbdc5 into main May 18, 2026
27 checks passed

miyoungc mentioned this pull request May 18, 2026

docs: update release notes for v0.0.45 #3755

Merged

12 tasks

coderabbitai Bot mentioned this pull request May 21, 2026

fix(onboard): reject Jetson sandbox GPU passthrough #3965

Merged

12 tasks

wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression labels Jun 3, 2026

wscurran removed fix labels Jun 3, 2026

Conversation

ericksoa commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing for reviewers

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 17, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericksoa commented May 17, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cv May 17, 2026

Choose a reason for hiding this comment

Uh oh!

ericksoa May 18, 2026

Choose a reason for hiding this comment

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 17, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

ericksoa commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Selective E2E Results — ⚠️ No requested jobs ran

Uh oh!

github-actions Bot commented May 18, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ericksoa commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading

github-actions Bot commented May 17, 2026 •

edited

Loading