fix(onboard): surface per-GPU breakdown on mixed-model hosts by laitingsheng · Pull Request #3346 · NVIDIA/NemoClaw

laitingsheng · 2026-05-11T11:42:30Z

Summary

The original #2669 fix added the GPU model to the preflight line only when every GPU on the host shares the same name. The QA verification on a 2-GPU mixed-model host (RTX PRO 6000 Blackwell Max-Q + GB300) showed the regression: with name deliberately dropped to avoid misattribution, the preflight fell back to 2 GPU(s), 354590 MB VRAM — no model info. This PR adds a per-GPU breakdown for that case while keeping the compact single-line format for the common homogeneous path.

Related Issue

Fixes #2669

Changes

detectGpu now also populates a gpus: { name, memoryMB }[] field alongside the existing name field, on both the primary --query-gpu=name,memory.total path and the unified-memory fallback (GB10, Orin, Xavier).
New groupGpusByName helper: groups by normalized name, preserves first-appearance order, sums memory within each group, drops blank-name rows. Memory is deliberately not part of the group key — nvidia-smi already disambiguates memory variants in the name string.
New formatNvidiaGpuPreflightLines renderer with three branches:
- Homogeneous (1 GPU or N of the same model) → compact NVIDIA GPU detected (<model>, <vram> MB) / (Nx <model>, <vram> MB) (unchanged).
- Mixed-model → aggregate header + indented breakdown, with Nx prefix applied across the whole block when any group has count > 1 (drops the prefix when every group is a singleton).
- All-blank names → existing count-only fallback.
onboard.ts preflight call site reduced to one helper call.

Example output on a mixed host:

  ✓ NVIDIA GPU detected: 2 GPUs, 354590 MB VRAM
      - NVIDIA RTX PRO 6000 Blackwell Max-Q (97887 MB)
      - NVIDIA GB300 (256703 MB)

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

Release Notes

New Features
- Enhanced GPU detection now provides per-device memory and model information for multi-GPU systems
- Improved NVIDIA preflight output with better formatting for homogeneous, mixed-model, and unified-memory GPU configurations

Fixes #2669 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

copy-pr-bot · 2026-05-11T11:42:34Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-11T11:43:45Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e16b8d49-0eab-499b-b92d-cf10d5af5963

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/2669-multi-gpu-mixed-model

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/lib/onboard.ts (1)
4121-4124: Please run the onboarding-focused E2E set before merge.

This touches preflight output in core onboarding flow, so validating the recommended E2E jobs is worthwhile before release.

As per coding guidelines: src/lib/onboard.ts “contains core onboarding logic” and changes “affect the full sandbox creation and configuration flow,” with recommended E2E jobs (cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, messaging-compatible-endpoint-e2e, hermes-discord-e2e, hermes-slack-e2e).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 4121 - 4124, This change touches preflight
output in the core onboarding flow (see nim.formatNvidiaGpuPreflightLines usage
in src/lib/onboard.ts), so before merging run the onboarding-focused end-to-end
test suite: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
messaging-compatible-endpoint-e2e, hermes-discord-e2e and hermes-slack-e2e to
validate full sandbox creation/configuration and the updated GPU preflight
output; if any test fails, revert or adjust the preflight formatting in the
block that logs lines[0] and lines.slice(1) and re-run the E2Es until all pass.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 298-302: The returned Nvidia unified-memory object is forcing a
single-model label by setting name to unifiedGpuNames[0]; update the return in
the function that constructs this object so it does not populate a global name
(e.g., remove the name property or set it to undefined/null) and keep the gpus
array with per-gpu name entries and memoryMB; this ensures
formatNvidiaGpuPreflightLines will inspect each gpu.name and render a per-model
breakdown instead of collapsing to "Nx <first model>".

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4121-4124: This change touches preflight output in the core
onboarding flow (see nim.formatNvidiaGpuPreflightLines usage in
src/lib/onboard.ts), so before merging run the onboarding-focused end-to-end
test suite: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
messaging-compatible-endpoint-e2e, hermes-discord-e2e and hermes-slack-e2e to
validate full sandbox creation/configuration and the updated GPU preflight
output; if any test fails, revert or adjust the preflight formatting in the
block that logs lines[0] and lines.slice(1) and re-run the E2Es until all pass.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: efc21865-80a1-4777-b9de-cf9212e63a46

📥 Commits

Reviewing files that changed from the base of the PR and between 118541a and c39addc.

📒 Files selected for processing (3)

src/lib/inference/nim.test.ts
src/lib/inference/nim.ts
src/lib/onboard.ts

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

ericksoa

Reviewed the current head against main. This keeps the existing compact GPU preflight output for homogeneous hosts while adding a per-GPU breakdown for mixed-model hosts without attributing one model name to the whole machine. The unified-memory fallback now follows the same invariant, and the focused build/test plus visible CI/E2E checks are green.

## Summary Refreshes the release-prep docs for v0.0.39 based on changes merged since the Friday 4pm doc refresh. Updates the source docs, bumps the docs version metadata, and regenerates the NemoClaw user skills from the refreshed docs. ## Changes - #3314 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documents installer Docker setup, Docker group activation, and retry guidance. - #3317 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Documents the DGX Spark and DGX Station express install prompt and `NEMOCLAW_NO_EXPRESS`. - #3328 and #3329 -> `docs/security/best-practices.md`, `docs/deployment/sandbox-hardening.md`: Updates sandbox capability hardening docs for the stricter bounding-set and `setpriv` step-down behavior. - #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`: Documents Windows-host Ollama relaunch behavior, NIM key passthrough, early health-fail diagnostics, and mixed-GPU preflight detail. - #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 -> `docs/about/release-notes.md`: Adds the v0.0.39 release-prep section while keeping the v0.0.38 release notes intact. - Advances the release-prep docs metadata from v0.0.38 to v0.0.39. - Regenerates `.agents/skills/nemoclaw-user-*` from the updated source docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit ## Release Notes v0.0.39 * **New Features** * Host alias management commands for easier configuration * Sandbox GPU control options during onboarding * Update command with check and confirmation modes * **Documentation** * Enhanced Linux installer guidance with Docker and group membership handling * Expanded troubleshooting for permission and connectivity issues * Improved capability-dropping security documentation * Updated inference model switching commands * Brev environment-specific troubleshooting * **Improvements** * DGX Spark/Station express install flow * Windows Ollama relay and health-check enhancements * NVIDIA NIM preflight GPU reporting [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

fix(onboard): surface per-GPU breakdown on mixed-model hosts

c39addc

Fixes #2669 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/lib/inference/nim.ts

test(nim): fold redundant cases; fix unified-memory misattribution

07db5a2

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng marked this pull request as ready for review May 11, 2026 12:07

laitingsheng added the v0.0.39 label May 11, 2026

ericksoa approved these changes May 11, 2026

View reviewed changes

ericksoa merged commit 2e53919 into main May 11, 2026
22 checks passed

ericksoa mentioned this pull request May 11, 2026

fix(nim): restore NGC_API_KEY env passthrough and fast-fail health check #3335

Merged

12 tasks

miyoungc mentioned this pull request May 12, 2026

docs: refresh 0.0.39 release prep #3375

Merged

12 tasks

wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): surface per-GPU breakdown on mixed-model hosts#3346

fix(onboard): surface per-GPU breakdown on mixed-model hosts#3346
ericksoa merged 2 commits into
mainfrom
fix/2669-multi-gpu-mixed-model

laitingsheng commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

ericksoa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

laitingsheng commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ericksoa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

laitingsheng commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading