Skip to content

fix(onboard): surface per-GPU breakdown on mixed-model hosts#3346

Merged
ericksoa merged 2 commits into
mainfrom
fix/2669-multi-gpu-mixed-model
May 11, 2026
Merged

fix(onboard): surface per-GPU breakdown on mixed-model hosts#3346
ericksoa merged 2 commits into
mainfrom
fix/2669-multi-gpu-mixed-model

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

The original #2669 fix added the GPU model to the preflight line only when every GPU on the host shares the same name. The QA verification on a 2-GPU mixed-model host (RTX PRO 6000 Blackwell Max-Q + GB300) showed the regression: with name deliberately dropped to avoid misattribution, the preflight fell back to 2 GPU(s), 354590 MB VRAM — no model info. This PR adds a per-GPU breakdown for that case while keeping the compact single-line format for the common homogeneous path.

Related Issue

Fixes #2669

Changes

  • detectGpu now also populates a gpus: { name, memoryMB }[] field alongside the existing name field, on both the primary --query-gpu=name,memory.total path and the unified-memory fallback (GB10, Orin, Xavier).
  • New groupGpusByName helper: groups by normalized name, preserves first-appearance order, sums memory within each group, drops blank-name rows. Memory is deliberately not part of the group key — nvidia-smi already disambiguates memory variants in the name string.
  • New formatNvidiaGpuPreflightLines renderer with three branches:
    • Homogeneous (1 GPU or N of the same model) → compact NVIDIA GPU detected (<model>, <vram> MB) / (Nx <model>, <vram> MB) (unchanged).
    • Mixed-model → aggregate header + indented breakdown, with Nx prefix applied across the whole block when any group has count > 1 (drops the prefix when every group is a singleton).
    • All-blank names → existing count-only fallback.
  • onboard.ts preflight call site reduced to one helper call.

Example output on a mixed host:

  ✓ NVIDIA GPU detected: 2 GPUs, 354590 MB VRAM
      - NVIDIA RTX PRO 6000 Blackwell Max-Q (97887 MB)
      - NVIDIA GB300 (256703 MB)

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

Release Notes

  • New Features
    • Enhanced GPU detection now provides per-device memory and model information for multi-GPU systems
    • Improved NVIDIA preflight output with better formatting for homogeneous, mixed-model, and unified-memory GPU configurations

Review Change Stack

Fixes #2669

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e16b8d49-0eab-499b-b92d-cf10d5af5963

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2669-multi-gpu-mixed-model

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

4121-4124: Please run the onboarding-focused E2E set before merge.

This touches preflight output in core onboarding flow, so validating the recommended E2E jobs is worthwhile before release.

As per coding guidelines: src/lib/onboard.ts “contains core onboarding logic” and changes “affect the full sandbox creation and configuration flow,” with recommended E2E jobs (cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, messaging-compatible-endpoint-e2e, hermes-discord-e2e, hermes-slack-e2e).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 4121 - 4124, This change touches preflight
output in the core onboarding flow (see nim.formatNvidiaGpuPreflightLines usage
in src/lib/onboard.ts), so before merging run the onboarding-focused end-to-end
test suite: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
messaging-compatible-endpoint-e2e, hermes-discord-e2e and hermes-slack-e2e to
validate full sandbox creation/configuration and the updated GPU preflight
output; if any test fails, revert or adjust the preflight formatting in the
block that logs lines[0] and lines.slice(1) and re-run the E2Es until all pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 298-302: The returned Nvidia unified-memory object is forcing a
single-model label by setting name to unifiedGpuNames[0]; update the return in
the function that constructs this object so it does not populate a global name
(e.g., remove the name property or set it to undefined/null) and keep the gpus
array with per-gpu name entries and memoryMB; this ensures
formatNvidiaGpuPreflightLines will inspect each gpu.name and render a per-model
breakdown instead of collapsing to "Nx <first model>".

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4121-4124: This change touches preflight output in the core
onboarding flow (see nim.formatNvidiaGpuPreflightLines usage in
src/lib/onboard.ts), so before merging run the onboarding-focused end-to-end
test suite: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
messaging-compatible-endpoint-e2e, hermes-discord-e2e and hermes-slack-e2e to
validate full sandbox creation/configuration and the updated GPU preflight
output; if any test fails, revert or adjust the preflight formatting in the
block that logs lines[0] and lines.slice(1) and re-run the E2Es until all pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: efc21865-80a1-4777-b9de-cf9212e63a46

📥 Commits

Reviewing files that changed from the base of the PR and between 118541a and c39addc.

📒 Files selected for processing (3)
  • src/lib/inference/nim.test.ts
  • src/lib/inference/nim.ts
  • src/lib/onboard.ts

Comment thread src/lib/inference/nim.ts
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 11, 2026 12:07

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the current head against main. This keeps the existing compact GPU preflight output for homogeneous hosts while adding a per-GPU breakdown for mixed-model hosts without attributing one model name to the whole machine. The unified-memory fallback now follows the same invariant, and the focused build/test plus visible CI/E2E checks are green.

@ericksoa ericksoa merged commit 2e53919 into main May 11, 2026
22 checks passed
@miyoungc miyoungc mentioned this pull request May 12, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 12, 2026
## Summary
Refreshes the release-prep docs for v0.0.39 based on changes merged
since the Friday 4pm doc refresh. Updates the source docs, bumps the
docs version metadata, and regenerates the NemoClaw user skills from the
refreshed docs.

## Changes
- #3314 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`:
Documents installer Docker setup, Docker group activation, and retry
guidance.
- #3317 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Documents the DGX Spark and DGX Station
express install prompt and `NEMOCLAW_NO_EXPRESS`.
- #3328 and #3329 -> `docs/security/best-practices.md`,
`docs/deployment/sandbox-hardening.md`: Updates sandbox capability
hardening docs for the stricter bounding-set and `setpriv` step-down
behavior.
- #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`:
Documents Windows-host Ollama relaunch behavior, NIM key passthrough,
early health-fail diagnostics, and mixed-GPU preflight detail.
- #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 ->
`docs/about/release-notes.md`: Adds the v0.0.39 release-prep section
while keeping the v0.0.38 release notes intact.
- Advances the release-prep docs metadata from v0.0.38 to v0.0.39.
- Regenerates `.agents/skills/nemoclaw-user-*` from the updated source
docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes v0.0.39

* **New Features**
  * Host alias management commands for easier configuration
  * Sandbox GPU control options during onboarding
  * Update command with check and confirmation modes

* **Documentation**
* Enhanced Linux installer guidance with Docker and group membership
handling
  * Expanded troubleshooting for permission and connectivity issues
  * Improved capability-dropping security documentation
  * Updated inference model switching commands
  * Brev environment-specific troubleshooting

* **Improvements**
  * DGX Spark/Station express install flow
  * Windows Ollama relay and health-check enhancements
  * NVIDIA NIM preflight GPU reporting

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DGX Spark][Onboard] preflight GPU detection prints "1 GPU(s), 284208 MB VRAM" without GPU model name (GB300)

3 participants