fix(onboard): accurate preflight message when nvidia-smi works but GPU passthrough disabled by kagura-agent · Pull Request #3181 · NVIDIA/NemoClaw

kagura-agent · 2026-05-07T12:25:46Z

Issue

Closes #3174

Type of change

Bug fix

What does this PR do?

The preflight GPU hint in onboard had two problems:

Showed when user used --no-gpu: The hint "NVIDIA GPU hardware detected but nvidia-smi is not available" appeared even when the user explicitly opted out of GPU passthrough. This is noise — they chose --no-gpu, so telling them about missing GPU support is unhelpful.
Never verified nvidia-smi availability: The hint assumed that if gpuPassthrough was false and lspci detected NVIDIA hardware, then drivers must be missing. But the actual issue could be that nvidia-smi works fine (drivers installed) while nvidia-container-toolkit is missing (Docker can't use the GPU). The message sent users to debug drivers instead of the container toolkit.

Changes

Add !opts.noGpu guard so the hint is skipped when --no-gpu was used
Extract GPU driver hint logic into src/lib/onboard/gpu-hint.ts to satisfy the onboard-entrypoint-budget CI check (onboard.ts is net -5 lines)
Actually run nvidia-smi before claiming it is unavailable
When nvidia-smi works but GPU passthrough was not enabled, print an actionable message about nvidia-container-toolkit installation instead of the misleading driver message
Add unit tests for the extracted helper in src/lib/onboard/gpu-hint.test.ts

How did you verify your code works?

npx tsc -p tsconfig.src.json --noEmit — clean compilation
npx vitest run src/lib/onboard/gpu-hint.test.ts — 5/5 tests pass
Code review of the logic flow:
- --no-gpu → opts.noGpu = true → hint block is skipped entirely ✅
- No --no-gpu, nvidia-smi unavailable → original message preserved ✅
- No --no-gpu, nvidia-smi works but container toolkit missing → new actionable message ✅

Checklist

Signed off (DCO)
Minimal diff — only the hint block changed
Tests added for new behavior
onboard-entrypoint-budget satisfied (onboard.ts net -5 lines)

Signed-off-by: kagura-agent kagura.agent.ai@gmail.com

copy-pr-bot · 2026-05-07T12:25:50Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-07T12:25:59Z

📝 Walkthrough

Walkthrough

This PR refines the Linux GPU hardware detection logic during onboarding preflight checks. When a user has not opted out of GPU passthrough (--no-gpu), the code now verifies that nvidia-smi actually works before reporting "drivers missing", distinguishing between missing NVIDIA drivers versus missing container toolkit components.

Changes

GPU Detection Hint Refinement

Layer / File(s)	Summary
GPU Hardware Detection Logic `src/lib/onboard.ts`	Linux GPU detection now runs `lspci` and verifies `nvidia-smi --query-gpu` availability before emitting hardware-present hints. Conditional messages distinguish "nvidia-smi unavailable" from "drivers available but passthrough not enabled" based on actual `nvidia-smi` exit status and output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A cunning check for GPUs found,
With lspci scanning all around,
We ask nvidia-smi true,
Does the toolkit work for you?
Now the message fits the clue! 🚀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding logic to detect when nvidia-smi works but GPU passthrough is disabled, and showing a better preflight message.
Linked Issues check	✅ Passed	The PR fully addresses issue `#3174` by adding nvidia-smi verification, skipping hints when --no-gpu is used, and providing actionable messages for missing nvidia-container-toolkit.
Out of Scope Changes check	✅ Passed	All changes in src/lib/onboard.ts are scoped to fixing the GPU detection hint logic as specified in issue `#3174`; no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wscurran · 2026-05-07T14:41:54Z

✨ Thanks for submitting this PR that fixes the inaccurate preflight message when nvidia-smi works but GPU passthrough is disabled. This change involves adding a guard to skip the hint when --no-gpu is used and actually running nvidia-smi before claiming it is unavailable.

Related open issues:

#3174 [Ubuntu 24.04][Onboard] preflight reports "nvidia-smi is not available" when nvidia-smi works but nvidia-container-toolkit is missing

wscurran · 2026-05-07T14:41:54Z

✨ Thanks for submitting this PR that fixes the inaccurate preflight message when nvidia-smi works but GPU passthrough is disabled. This change involves adding a guard to skip the hint when --no-gpu is used and actually running nvidia-smi before claiming it is unavailable.

Related open issues:

#3174 [Ubuntu 24.04][Onboard] preflight reports "nvidia-smi is not available" when nvidia-smi works but nvidia-container-toolkit is missing

kagura-agent · 2026-05-15T00:09:16Z

Hi @wscurran 👋 friendly ping — any additional changes needed here? Let me know if there's anything I should adjust.

cv · 2026-05-15T00:18:43Z

@kagura-agent there are a few failing checks. Mind taking a look?

kagura-agent · 2026-05-15T03:19:54Z

Rebased onto current main and fixed all CI failures:

Changes in this push:

DCO check — Added Signed-off-by trailer to PR body and commit message (git commit --signoff)
TypeScript compilation (checks + cli-parity) — The merge commit introduced a scope error: optedOutGpuPassthrough was referenced but not defined at that point in onboard(). Fixed by using opts.noGpu directly, which is the opts parameter already in scope
onboard-entrypoint-budget — Extracted GPU driver hint logic into src/lib/onboard/gpu-hint.ts so onboard.ts is net -5 lines (budget requires net-neutral or smaller). The new module lives under src/lib/onboard/ which the budget check explicitly allows to grow
Tests — Added src/lib/onboard/gpu-hint.test.ts with 5 tests covering: lspci failure, no NVIDIA hardware, nvidia-smi failure (driver hint), nvidia-smi success (container toolkit hint), and lspci exception
Removed the merge commit — clean single commit on top of main

…U passthrough disabled (NVIDIA#3174) Two changes to the GPU hint in preflight: 1. Skip the hint entirely when the user explicitly opted out with --no-gpu — they chose not to use GPU, so the message is noise. 2. When the hint does show, verify nvidia-smi is actually unavailable before claiming it is. If nvidia-smi works (drivers installed) but GPU passthrough was not enabled (e.g. container toolkit missing), print an actionable message about nvidia-container-toolkit instead of the misleading 'nvidia-smi is not available' text. Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>

## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. ---  Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45.  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

wscurran added NemoClaw CLI labels May 7, 2026

jyaunches mentioned this pull request May 7, 2026

refactor(cli): group inference and onboard support modules #3195

Merged

12 tasks

cv closed this May 12, 2026

cv reopened this May 12, 2026

kagura-agent force-pushed the fix/preflight-nvidia-smi-message-3174 branch from d8f1e09 to c234775 Compare May 15, 2026 03:19

kagura-agent force-pushed the fix/preflight-nvidia-smi-message-3174 branch from c234775 to 0a26893 Compare May 15, 2026 23:07

kagura-agent force-pushed the fix/preflight-nvidia-smi-message-3174 branch from 0a26893 to f49dec5 Compare May 16, 2026 06:18

Merge branch 'main' into fix/preflight-nvidia-smi-message-3174

a605292

cv added the v0.0.45 label May 17, 2026

cv approved these changes May 17, 2026

View reviewed changes

cv enabled auto-merge (squash) May 17, 2026 19:08

cv merged commit ae4229d into NVIDIA:main May 17, 2026
16 checks passed

miyoungc mentioned this pull request May 18, 2026

docs: update release notes for v0.0.45 #3755

Merged

12 tasks

coderabbitai Bot mentioned this pull request May 21, 2026

fix(onboard): reject Jetson sandbox GPU passthrough #3965

Merged

12 tasks

prekshivyas mentioned this pull request May 28, 2026

[Ubuntu 24.04] NO GPU Detected when onboarding but GPU is present #1182

Closed

wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026

kagura-agent mentioned this pull request Jun 5, 2026

[All Platforms][Install] Host DNS blocking via iptables OUTPUT is not caught in preflight; failures only surface during provider validation #4784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): accurate preflight message when nvidia-smi works but GPU passthrough disabled#3181

fix(onboard): accurate preflight message when nvidia-smi works but GPU passthrough disabled#3181
cv merged 2 commits into
NVIDIA:mainfrom
kagura-agent:fix/preflight-nvidia-smi-message-3174

kagura-agent commented May 7, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

wscurran commented May 7, 2026

Uh oh!

wscurran commented May 7, 2026

Uh oh!

kagura-agent commented May 15, 2026

Uh oh!

cv commented May 15, 2026

Uh oh!

kagura-agent commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kagura-agent commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Type of change

What does this PR do?

Changes

How did you verify your code works?

Checklist

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

wscurran commented May 7, 2026

Uh oh!

wscurran commented May 7, 2026

Uh oh!

kagura-agent commented May 15, 2026

Uh oh!

cv commented May 15, 2026

Uh oh!

kagura-agent commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kagura-agent commented May 7, 2026 •

edited

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading