fix(onboard): recover GPU gateway reuse#3670
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds exported decision types and helpers to inspect legacy gateway GPU passthrough, implements a reconcile flow that may inspect Docker DeviceRequests and read sandbox registry names, wires this into onboarding (refactors destroyGateway for GPU-reuse), updates recovery messaging, expands tests, and tweaks pre-commit and test-runner behavior. ChangesGPU Passthrough Gateway Reuse Decision Flow
Sequence Diagram(s)sequenceDiagram
participant Onboard
participant DockerInspect
participant Registry
participant GatewayOps
participant Recovery
Onboard->>DockerInspect: inspect legacy gateway DeviceRequests
DockerInspect-->>Onboard: inspection outcome (gpu-enabled|cpu-only|not-found|unknown)
alt inspection == "unknown"
Onboard->>Recovery: abort-with-recovery (report unknown state)
else inspection == "cpu-only"
Onboard->>Registry: read registered sandbox names
Registry-->>Onboard: sandbox list
Onboard->>Onboard: decideGatewayGpuReuseForGpuIntent(..., cpuOnlyGatewayRestartSafe)
alt decision == "restart-gateway"
Onboard->>GatewayOps: stop forwards, run destroyGateway variant, recreate gateway with GPU
else decision == "abort-with-recovery"
Onboard->>Recovery: emit recovery instructions (openshell gateway remove/destroy)
end
else decision == "reuse"
Onboard-->>GatewayOps: proceed without restart
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
2755-2805: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winPlease extract this GPU-reuse helper block out of
src/lib/onboard.ts.The onboard-entrypoint-budget job is already failing on this PR, and these helpers are self-contained enough to live beside
gateway-gpu-passthrough.tswithout changing behavior. That should get this file back under budget without mixing in extra cleanup.Based on learnings: In NVIDIA/NemoClaw PRs that extract code from
src/lib/onboard.tsinto new behavior-preserving modules (e.g.,src/lib/onboard-providers.ts,src/lib/onboard-ollama-proxy.ts,src/lib/onboard-inference-probes.ts), reviewers should avoid requesting refactors/complexity/style improvements in the extracted modules as part of the same PR. Only check for correctness and behavior preservation; defer non-behavior-preserving cleanup to separate follow-up PRs.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2755 - 2805, Extract the four GPU-reuse helper functions readRegisteredSandboxNamesForGatewayGpuReuse, reportUnreadableSandboxRegistryForGpuReuse, destroyGateway, and destroyGatewayRuntimeForGpuReuse into a new module colocated with gateway-gpu-passthrough.ts, export them from that module, and update any callers in src/lib/onboard.ts to import these functions instead of keeping them inline; preserve all internal behavior and references to registry, GATEWAY_NAME, dockerRemoveVolumesByPrefix, gatewayCliSupportsLifecycleCommands, runCaptureOpenshell, isLinuxDockerDriverGatewayEnabled, removeDockerDriverGatewayRegistration, runOpenshell, and stopDockerDriverGatewayProcess so the logic and side effects remain identical. Ensure the new file has the same runtime dependencies imported and no other behavioral changes.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
9733-9809: Run the gateway lifecycle E2Es for this path.This branch now decides whether to preserve, restart, or abort reuse of a shared gateway, so
sandbox-operations-e2eandopenshell-gateway-upgrade-e2ewould give much better confidence than the unit coverage alone.As per coding guidelines:
src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow. The recommended E2Es includesandbox-operations-e2eandopenshell-gateway-upgrade-e2e.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 9733 - 9809, Run the gateway lifecycle end-to-end tests that exercise the new reuse/restart/abort logic in src/lib/onboard.ts: specifically execute sandbox-operations-e2e and openshell-gateway-upgrade-e2e against scenarios that trigger isConfirmedDockerDriverGatewayForGpuReuse, shouldInspectLegacyGatewayGpuPassthrough, and the code paths through inspectLegacyGatewayGpuPassthroughResult → decideGatewayGpuReuseForGpuIntent (including branches that call retireLegacyGatewayForDockerDriverUpgrade and destroyGatewayForReuse); verify behavior for "restart-gateway", "abort-with-recovery", and "cpu-only" flows and confirm registry/readable sandbox names (readRegisteredSandboxNamesForGatewayGpuReuse) and recovery reporting (reportGpuPassthroughRecovery) behave as expected.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2750-2753: The helper isConfirmedDockerDriverGatewayForGpuReuse
drops the upstream "healthy but PID-unattributed" signal from
refreshDockerDriverGatewayReuseState by only treating the exact "healthy" value
as eligible; change the check so any healthy variant is preserved (e.g., accept
values that include or start with "healthy" or the specific enum variant that
indicates "healthy but PID-unattributed") before falling back to
isDockerDriverGatewayProcessAlive(), so healthy-but-unattributed gateways are
routed into the GPU-reuse path instead of the legacy openshell-cluster recovery.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 2755-2805: Extract the four GPU-reuse helper functions
readRegisteredSandboxNamesForGatewayGpuReuse,
reportUnreadableSandboxRegistryForGpuReuse, destroyGateway, and
destroyGatewayRuntimeForGpuReuse into a new module colocated with
gateway-gpu-passthrough.ts, export them from that module, and update any callers
in src/lib/onboard.ts to import these functions instead of keeping them inline;
preserve all internal behavior and references to registry, GATEWAY_NAME,
dockerRemoveVolumesByPrefix, gatewayCliSupportsLifecycleCommands,
runCaptureOpenshell, isLinuxDockerDriverGatewayEnabled,
removeDockerDriverGatewayRegistration, runOpenshell, and
stopDockerDriverGatewayProcess so the logic and side effects remain identical.
Ensure the new file has the same runtime dependencies imported and no other
behavioral changes.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9733-9809: Run the gateway lifecycle end-to-end tests that
exercise the new reuse/restart/abort logic in src/lib/onboard.ts: specifically
execute sandbox-operations-e2e and openshell-gateway-upgrade-e2e against
scenarios that trigger isConfirmedDockerDriverGatewayForGpuReuse,
shouldInspectLegacyGatewayGpuPassthrough, and the code paths through
inspectLegacyGatewayGpuPassthroughResult → decideGatewayGpuReuseForGpuIntent
(including branches that call retireLegacyGatewayForDockerDriverUpgrade and
destroyGatewayForReuse); verify behavior for "restart-gateway",
"abort-with-recovery", and "cpu-only" flows and confirm registry/readable
sandbox names (readRegisteredSandboxNamesForGatewayGpuReuse) and recovery
reporting (reportGpuPassthroughRecovery) behave as expected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7806dbf5-17f7-4ef5-8967-1599df402142
📒 Files selected for processing (7)
.pre-commit-config.yamlsrc/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.test.tssrc/lib/onboard/gateway-gpu-passthrough.tssrc/lib/onboard/gpu-recovery.test.tssrc/lib/onboard/gpu-recovery.tstest/runner.test.ts
|
Follow-up pushed in What changed after review:
Local validation run on this head:
Issue-specific coverage in the focused tests exercises the decision matrix for GPU requested with a CPU-only legacy gateway, empty registry restart, registered-sandbox abort, GPU opt-out reuse, confirmed Docker-driver reuse, and unknown Docker inspection staying non-destructive. The remaining runtime proof for reviewers is the GPU-host UAT path from #3578: start with a CPU/no-GPU gateway, leave the stale gateway behind, rerun GPU onboarding/recreate, and verify we recreate a GPU-capable sandbox instead of aborting with the old uninstall guidance. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 9689-9700: The code is wrongly treating any "healthy" gateway as a
confirmed Docker-driver; update the value passed for
confirmedDockerDriverGateway in the reconcileGatewayGpuReuseForGpuIntent call so
it only becomes true when the gateway is explicitly confirmed to be the Linux
Docker-driver implementation (not merely gatewayReuseState === "healthy").
Locate the call to reconcileGatewayGpuReuseForGpuIntent and replace the
expression isLinuxDockerDriverGatewayEnabled() && gatewayReuseState ===
"healthy" with a check that inspects whatever explicit confirmation flag or
runtime/type field you have (for example an explicit confirmed-docker-driver
state on gatewayReuseState or a recorded gateway metadata field like
recordedGateway?.dockerDriverConfirmed or an isGatewayConfirmedDockerDriver(...)
helper); ensure the new condition only returns true when that explicit
confirmation is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: aa204a70-c5a3-4377-a2bd-14261037e1bf
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.test.tssrc/lib/onboard/gateway-gpu-passthrough.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/gateway-gpu-passthrough.ts
| name: Test (CLI) | ||
| entry: >- | ||
| bash -c 'node -e "require(\"node:fs\").rmSync(\"dist\", { recursive: true, force: true })" && npm run build:cli && npx tsx scripts/check-dist-sourcemaps.ts dist && npx vitest run --project cli --coverage --coverage.reporter=text-summary --coverage.reporter=json-summary --coverage.reportsDirectory=coverage/cli --coverage.include="bin/**/*.js" --coverage.include="dist/lib/**/*.js" --coverage.exclude="test/**/*.js" --coverage.exclude="test/**/*.ts" && npx tsx scripts/check-coverage-ratchet.ts coverage/cli/coverage-summary.json ci/coverage-threshold-cli.json "CLI coverage"' | ||
| bash -c 'rm -rf dist && npm run build:cli && npx tsx scripts/check-dist-sourcemaps.ts dist && NEMOCLAW_TEST_TIMEOUT=20000 npx vitest run --project cli --coverage --coverage.reporter=text-summary --coverage.reporter=json-summary --coverage.reportsDirectory=coverage/cli --coverage.include="bin/**/*.js" --coverage.include="dist/lib/**/*.js" --coverage.exclude="test/**/*.js" --coverage.exclude="test/**/*.ts" && npx tsx scripts/check-coverage-ratchet.ts coverage/cli/coverage-summary.json ci/coverage-threshold-cli.json "CLI coverage"' |
There was a problem hiding this comment.
let's not set NEMOCLAW_TEST_TIMEOUT here, and rely on the default
There was a problem hiding this comment.
Done in 97972825a: removed NEMOCLAW_TEST_TIMEOUT=20000 from the pre-commit hook. The full npm run check now passes locally using the default hook timeout; the two existing CLI tests that exceeded the default suite timeout now have targeted testTimeoutOptions(...) budgets instead of a global override.
Selective E2E Results — ✅ All requested jobs passedRun: 26005369843
|
|
Pushed Changes in this follow-up:
Local validation on the pushed head:
|
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-double-onboard-e2e | ⏭️ skipped |
| gpu-e2e | ⏭️ skipped |
Selective E2E Results — ✅ All requested jobs passedRun: 26007749323
|
## Summary Fixes #3578. This makes GPU-enabled onboarding recover from a healthy reusable gateway that was previously started without GPU passthrough, instead of aborting with a full `nemoclaw uninstall && nemoclaw onboard --gpu` recovery path. The recovery is intentionally conservative: - Reuse is unchanged when GPU is not requested, when the reusable gateway is already a confirmed Docker-driver gateway, or when a legacy gateway already has Docker GPU DeviceRequests. - A CPU-only legacy gateway is automatically retired only when no local sandboxes are registered, or when `NEMOCLAW_RECREATE_SANDBOX=1` is recreating the one registered sandbox with the same name. - If other registered sandboxes depend on the gateway, onboarding still aborts non-destructively and prints targeted `destroy --cleanup-gateway` guidance. - Unknown Docker/container state remains non-destructive. - `--no-gpu` / `NEMOCLAW_SANDBOX_GPU=0` remain explicit opt-outs; this does not silently downshift GPU-selected onboarding to CPU. I also tightened the manual recovery message so the empty-registry fallback no longer recommends full uninstall first. ## Testing for reviewers The focused unit coverage exercises the issue decision matrix: - GPU requested + healthy CPU-only legacy gateway + empty registry => restart/recreate path. - GPU requested + healthy CPU-only legacy gateway + the one registered sandbox being recreated => restart/recreate path. - GPU requested + healthy CPU-only legacy gateway + shared/different registered sandboxes => abort with targeted recovery. - GPU not requested => reuse path unchanged. - Confirmed Docker-driver gateway => reuse path unchanged. - Unknown legacy Docker inspection state => non-destructive abort. - Empty-registry recovery wording recommends targeted gateway cleanup, not full uninstall. Local validation run: ```text npm run build:cli npm test -- src/lib/onboard/gateway-gpu-passthrough.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/sandbox-gpu-create.test.ts src/lib/onboard/docker-gpu-patch.test.ts npm run typecheck npm run source-shape:check git diff --check npx prek run --all-files ``` All of the above passed locally. I did not run live GPU UAT on Ubuntu GPU, DGX Spark, or DGX Station hardware. That is still the end-to-end validation needed for the exact reporter scenario: first create a CPU/no-GPU gateway, leave the stale gateway behind, then rerun GPU-default onboarding with `NEMOCLAW_RECREATE_SANDBOX=1` and verify the new sandbox gets GPU access. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved GPU passthrough detection and reuse logic to avoid unsafe reuse and to perform safer restart or abort-with-recovery when needed * Updated recovery guidance for GPU passthrough issues with clearer cleanup and re-onboarding instructions * **New Features** * Added explicit GPU inspection and decision flow to better handle legacy gateway GPU states and restart safety checks * **Tests** * Expanded and hardened tests around GPU inspection, reuse decisions, recovery messaging, and CLI timeouts/robustness * **Chores** * Hardened pre-commit test hook execution (serialized runs and cleanup before build) <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3670?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fixes #3578.
This makes GPU-enabled onboarding recover from a healthy reusable gateway that was previously started without GPU passthrough, instead of aborting with a full
nemoclaw uninstall && nemoclaw onboard --gpurecovery path.The recovery is intentionally conservative:
NEMOCLAW_RECREATE_SANDBOX=1is recreating the one registered sandbox with the same name.destroy --cleanup-gatewayguidance.--no-gpu/NEMOCLAW_SANDBOX_GPU=0remain explicit opt-outs; this does not silently downshift GPU-selected onboarding to CPU.I also tightened the manual recovery message so the empty-registry fallback no longer recommends full uninstall first.
Testing for reviewers
The focused unit coverage exercises the issue decision matrix:
Local validation run:
All of the above passed locally.
I did not run live GPU UAT on Ubuntu GPU, DGX Spark, or DGX Station hardware. That is still the end-to-end validation needed for the exact reporter scenario: first create a CPU/no-GPU gateway, leave the stale gateway behind, then rerun GPU-default onboarding with
NEMOCLAW_RECREATE_SANDBOX=1and verify the new sandbox gets GPU access.Summary by CodeRabbit
Bug Fixes
New Features
Tests
Chores