fix(onboard): surface per-GPU breakdown on mixed-model hosts#3346
Conversation
Fixes #2669 Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
4121-4124: Please run the onboarding-focused E2E set before merge.This touches preflight output in core onboarding flow, so validating the recommended E2E jobs is worthwhile before release.
As per coding guidelines:
src/lib/onboard.ts“contains core onboarding logic” and changes “affect the full sandbox creation and configuration flow,” with recommended E2E jobs (cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,messaging-compatible-endpoint-e2e,hermes-discord-e2e,hermes-slack-e2e).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 4121 - 4124, This change touches preflight output in the core onboarding flow (see nim.formatNvidiaGpuPreflightLines usage in src/lib/onboard.ts), so before merging run the onboarding-focused end-to-end test suite: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, messaging-compatible-endpoint-e2e, hermes-discord-e2e and hermes-slack-e2e to validate full sandbox creation/configuration and the updated GPU preflight output; if any test fails, revert or adjust the preflight formatting in the block that logs lines[0] and lines.slice(1) and re-run the E2Es until all pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/inference/nim.ts`:
- Around line 298-302: The returned Nvidia unified-memory object is forcing a
single-model label by setting name to unifiedGpuNames[0]; update the return in
the function that constructs this object so it does not populate a global name
(e.g., remove the name property or set it to undefined/null) and keep the gpus
array with per-gpu name entries and memoryMB; this ensures
formatNvidiaGpuPreflightLines will inspect each gpu.name and render a per-model
breakdown instead of collapsing to "Nx <first model>".
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 4121-4124: This change touches preflight output in the core
onboarding flow (see nim.formatNvidiaGpuPreflightLines usage in
src/lib/onboard.ts), so before merging run the onboarding-focused end-to-end
test suite: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e,
messaging-compatible-endpoint-e2e, hermes-discord-e2e and hermes-slack-e2e to
validate full sandbox creation/configuration and the updated GPU preflight
output; if any test fails, revert or adjust the preflight formatting in the
block that logs lines[0] and lines.slice(1) and re-run the E2Es until all pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: efc21865-80a1-4777-b9de-cf9212e63a46
📒 Files selected for processing (3)
src/lib/inference/nim.test.tssrc/lib/inference/nim.tssrc/lib/onboard.ts
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Reviewed the current head against main. This keeps the existing compact GPU preflight output for homogeneous hosts while adding a per-GPU breakdown for mixed-model hosts without attributing one model name to the whole machine. The unified-memory fallback now follows the same invariant, and the focused build/test plus visible CI/E2E checks are green.
## Summary Refreshes the release-prep docs for v0.0.39 based on changes merged since the Friday 4pm doc refresh. Updates the source docs, bumps the docs version metadata, and regenerates the NemoClaw user skills from the refreshed docs. ## Changes - #3314 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documents installer Docker setup, Docker group activation, and retry guidance. - #3317 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Documents the DGX Spark and DGX Station express install prompt and `NEMOCLAW_NO_EXPRESS`. - #3328 and #3329 -> `docs/security/best-practices.md`, `docs/deployment/sandbox-hardening.md`: Updates sandbox capability hardening docs for the stricter bounding-set and `setpriv` step-down behavior. - #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`: Documents Windows-host Ollama relaunch behavior, NIM key passthrough, early health-fail diagnostics, and mixed-GPU preflight detail. - #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 -> `docs/about/release-notes.md`: Adds the v0.0.39 release-prep section while keeping the v0.0.38 release notes intact. - Advances the release-prep docs metadata from v0.0.38 to v0.0.39. - Regenerates `.agents/skills/nemoclaw-user-*` from the updated source docs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes v0.0.39 * **New Features** * Host alias management commands for easier configuration * Sandbox GPU control options during onboarding * Update command with check and confirmation modes * **Documentation** * Enhanced Linux installer guidance with Docker and group membership handling * Expanded troubleshooting for permission and connectivity issues * Improved capability-dropping security documentation * Updated inference model switching commands * Brev environment-specific troubleshooting * **Improvements** * DGX Spark/Station express install flow * Windows Ollama relay and health-check enhancements * NVIDIA NIM preflight GPU reporting [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
The original #2669 fix added the GPU model to the preflight line only when every GPU on the host shares the same name. The QA verification on a 2-GPU mixed-model host (RTX PRO 6000 Blackwell Max-Q + GB300) showed the regression: with
namedeliberately dropped to avoid misattribution, the preflight fell back to2 GPU(s), 354590 MB VRAM— no model info. This PR adds a per-GPU breakdown for that case while keeping the compact single-line format for the common homogeneous path.Related Issue
Fixes #2669
Changes
detectGpunow also populates agpus: { name, memoryMB }[]field alongside the existingnamefield, on both the primary--query-gpu=name,memory.totalpath and the unified-memory fallback (GB10, Orin, Xavier).groupGpusByNamehelper: groups by normalized name, preserves first-appearance order, sums memory within each group, drops blank-name rows. Memory is deliberately not part of the group key — nvidia-smi already disambiguates memory variants in the name string.formatNvidiaGpuPreflightLinesrenderer with three branches:NVIDIA GPU detected (<model>, <vram> MB)/(Nx <model>, <vram> MB)(unchanged).Nxprefix applied across the whole block when any group has count > 1 (drops the prefix when every group is a singleton).onboard.tspreflight call site reduced to one helper call.Example output on a mixed host:
Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Release Notes