fix(onboard): accurate preflight message when nvidia-smi works but GPU passthrough disabled#3181
Conversation
📝 WalkthroughWalkthroughThis PR refines the Linux GPU hardware detection logic during onboarding preflight checks. When a user has not opted out of GPU passthrough ( ChangesGPU Detection Hint Refinement
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
✨ Thanks for submitting this PR that fixes the inaccurate preflight message when nvidia-smi works but GPU passthrough is disabled. This change involves adding a guard to skip the hint when --no-gpu is used and actually running nvidia-smi before claiming it is unavailable. Related open issues: |
1 similar comment
|
✨ Thanks for submitting this PR that fixes the inaccurate preflight message when nvidia-smi works but GPU passthrough is disabled. This change involves adding a guard to skip the hint when --no-gpu is used and actually running nvidia-smi before claiming it is unavailable. Related open issues: |
|
Hi @wscurran 👋 friendly ping — any additional changes needed here? Let me know if there's anything I should adjust. |
|
@kagura-agent there are a few failing checks. Mind taking a look? |
d8f1e09 to
c234775
Compare
|
Rebased onto current Changes in this push:
|
c234775 to
0a26893
Compare
…U passthrough disabled (NVIDIA#3174) Two changes to the GPU hint in preflight: 1. Skip the hint entirely when the user explicitly opted out with --no-gpu — they chose not to use GPU, so the message is noise. 2. When the hint does show, verify nvidia-smi is actually unavailable before claiming it is. If nvidia-smi works (drivers installed) but GPU passthrough was not enabled (e.g. container toolkit missing), print an actionable message about nvidia-container-toolkit instead of the misleading 'nvidia-smi is not available' text. Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>
0a26893 to
f49dec5
Compare
## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Issue
Closes #3174
Type of change
What does this PR do?
The preflight GPU hint in
onboardhad two problems:Showed when user used
--no-gpu: The hint "NVIDIA GPU hardware detected but nvidia-smi is not available" appeared even when the user explicitly opted out of GPU passthrough. This is noise — they chose--no-gpu, so telling them about missing GPU support is unhelpful.Never verified nvidia-smi availability: The hint assumed that if
gpuPassthroughwas false andlspcidetected NVIDIA hardware, then drivers must be missing. But the actual issue could be that nvidia-smi works fine (drivers installed) whilenvidia-container-toolkitis missing (Docker can't use the GPU). The message sent users to debug drivers instead of the container toolkit.Changes
!opts.noGpuguard so the hint is skipped when--no-gpuwas usedsrc/lib/onboard/gpu-hint.tsto satisfy theonboard-entrypoint-budgetCI check (onboard.ts is net -5 lines)nvidia-smibefore claiming it is unavailablenvidia-container-toolkitinstallation instead of the misleading driver messagesrc/lib/onboard/gpu-hint.test.tsHow did you verify your code works?
npx tsc -p tsconfig.src.json --noEmit— clean compilationnpx vitest run src/lib/onboard/gpu-hint.test.ts— 5/5 tests pass--no-gpu→opts.noGpu = true→ hint block is skipped entirely ✅--no-gpu, nvidia-smi unavailable → original message preserved ✅--no-gpu, nvidia-smi works but container toolkit missing → new actionable message ✅Checklist
Signed-off-by: kagura-agent kagura.agent.ai@gmail.com