fix(onboard): prefer CDI GPU mode over --gpus on CDI hosts#4956
Conversation
On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g. /etc/cdi/nvidia.yaml on Ubuntu 24.04/26.04), `nemoclaw onboard` selected the `--gpus all` mode for the GPU patch recreate. `docker create --gpus all` is accepted on these hosts so the create-only probe passed, but OpenShell's `gateway start --gpu` injects devices from the CDI spec, so a container recreated via the legacy --gpus path diverges from how the supervisor expects the GPU container to be wired up. The supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. Reorder GPU mode candidates so the CDI mode (`--device nvidia.com/gpu=all`) is preferred ahead of --gpus whenever a CDI spec is detected; --gpus and the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (candidate order unchanged). Note: the supervisor-reconnect failure is a runtime symptom on real GPU hardware; final validation requires the GPU E2E path (see "verify" below). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughReorders GPU patch mode probing to prefer CDI when Docker reports a readable NVIDIA CDI spec, reorganizes related re-exports, and adds tests validating CDI-first selection, fallback to --gpus and nvidia-runtime, and propagation of the CDI device flag into recreate flows. ChangesCDI-first GPU patch mode candidate selection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
PR Review AdvisorFindings: 1 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
…4948) Address PR review-advisor findings on the CDI-first GPU mode change: - Move the CDI mode-selection tests out of the docker-gpu-patch.test.ts monolith into a focused docker-gpu-patch-mode-selection.test.ts spec, offsetting the flagged monolith growth (back to ~baseline line count). - Pin the fallback chain on a CDI host: CDI probe fails -> --gpus selected; CDI and --gpus probes fail -> NVIDIA runtime selected (attempt order starts with cdi in both cases). - Add a recreate-boundary assertion: recreateOpenShellDockerSandboxWithGpu passes --device nvidia.com/gpu=all to dockerRunDetached on a CDI host and never emits --gpus, proving the selected CDI mode reaches the real recreate command (the patched_create_option the issue logs). All four new tests fail under the previous --gpus-first ordering, confirming they pin the fix rather than restating it. No production code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Selective E2E Results —
|
| Job | Result |
|---|---|
| gpu-e2e | ⏭️ skipped |
|
Runtime validation update for #4948 on PR head
Relevant acceptance evidence from the second run artifacts:
The manual scenario run is still red, but only after the linked issue acceptance point: two later runtime provider assertions failed because the host Ollama socket disappeared between probes ( |
cv
left a comment
There was a problem hiding this comment.
Standard PR CI is green. I resolved the main conflict, added CDI source-boundary coverage, and followed up on the GPU runtime failure. The manual Docker-CDI scenario now passes the linked issue acceptance point: CDI recreate, supervisor reconnect/onboard, Ready/dashboard, GPU/CUDA proof, and the local inference route. The remaining red in that manual run is later host-Ollama runtime probe churn, not the onboard fix.
|
Current-head runtime validation update for #4948 on PR head
Acceptance evidence from the current-head artifacts:
The typed scenario still exits red after the linked issue acceptance point, with the same provider-runner signature as the previous run: |
## Summary - Add v0.0.62 release notes from Discussion #5100 and link release highlights to the relevant docs pages. - Document the release's GPU sandbox recreation, sandbox-side local inference verification, and Hermes dashboard port guard in the command and inference references. - Refresh generated NemoClaw user skills for the release-prep docs set. ## Source Summary - #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU recreation behavior for Linux Docker-driver sandboxes. - #5024 -> `docs/inference/use-local-inference.mdx`: Document sandbox-runtime verification of the `inference.local` local inference route. - #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra device-node group propagation for sandbox CUDA initialization. - #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`: Summarize onboarding and recovery reliability fixes, including the reserved Hermes API port guard. - #5017 and #5043 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Summarize mutable OpenClaw config recovery and host-side `agents list` coverage. - #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes upstream metadata visibility and WhatsApp QR rendering reliability. - #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`: Refresh generated user-skill references from the current docs source. ## Skipped - #5019 -> skipped for new prose because it touched `openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`. Existing source docs remain the source for generated skill synchronization. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning) - Pre-commit hooks passed during commit, including docs-to-skills verification, markdown lint, gitleaks, and skills YAML tests. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw <name> agents list` command. * v0.0.62 release notes added summarizing onboarding and recovery improvements. * **Bug Fixes** * Improved GPU sandbox onboarding reliability (NVIDIA CDI path, Jetson/Tegra device handling). * Better local inference verification and recovery for Linux Docker-driver GPU sandboxes. * Quieter/earlier handling of onboarding drift and port collisions. * **Documentation** * Expanded GPU passthrough, inference verification, writable paths (`/dev/pts`), port 8642 restriction, and command examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Summary
On Docker-driver GPU hosts that advertise an NVIDIA CDI spec (e.g.
/etc/cdi/nvidia.yamlon Ubuntu 24.04/26.04),nemoclaw onboardselected--gpus allfor the GPU patch recreate, the OpenShell supervisor never reconnected, the sandbox entered Error phase before the GPU proof, and onboard aborted with exit 1. This reorders GPU mode selection to prefer the CDI mode (--device nvidia.com/gpu=all) ahead of--gpuswhenever a CDI spec is detected, matching how OpenShell'sgateway start --gpuinjects GPU devices.Related Issue
Fixes #4948
Changes
src/lib/onboard/docker-gpu-patch.ts:buildDockerGpuModeCandidatesnow puts thecdicandidate first whencdiAvailableis true;--gpusand the NVIDIA runtime remain as fallbacks if the CDI probe fails. Non-CDI hosts are unaffected (order unchanged:gpus,nvidia-runtime). Jetson path unchanged.src/lib/onboard/docker-gpu-patch.test.ts: adds a repro test asserting that on a CDI host where the--gpusprobe would pass, thecdimode is selected; updates the existing candidate-order assertion and stale comments to the corrected ordering.Why this is the right layer
The create-only probe (
docker create --gpus all) is accepted on these hosts, so--gpus alllooked viable but diverges at runtime from OpenShell's CDI-based injection — the supervisor then never reconnects to the recreated container. Preferring CDI when a spec is present removes that divergence.Validation caveat
The supervisor-reconnect failure is a runtime symptom that only manifests on real GPU + Docker-CDI hardware. Unit tests pin the deterministic mode-selection decision; final confirmation requires the GPU E2E path (
e2e-branch-validation:gpu) on an affected host. The existingNEMOCLAW_DOCKER_GPU_PATCH=0escape hatch is unchanged.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Verification notes: the full onboard suite passes (
npx vitest run src/lib/onboard/— 1189 tests, including the new #4948 repro),npm run typecheck:clipasses, and Biome is clean on the changed files. The fullnpm testrun has 16 pre-existing, environment-dependent failures (network/port-binding e2e-framework fixtures, aMODULE_NOT_FOUNDinssrf-parity, and missing Docker/OpenShell fixtures infetch-guard-patch-regression) that are unrelated to this change and do not reference the modified module — hencenpm testis left unchecked.Signed-off-by: Jason Ma jama@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests