feat: add DGX Spark (GB10) unified memory architecture support by inureyes · Pull Request #146 · lablup/all-smi

inureyes · 2026-04-08T02:17:56Z

Summary

Detect UMA (Unified Memory Architecture) on NVIDIA GB10/Blackwell GPUs where nvmlDeviceGetMemoryInfo() returns unavailable
Fall back to system memory from /proc/meminfo when NVML memory reporting is unavailable on UMA devices
Suppress misleading PCIe generation/width metrics for UMA devices (uses internal interconnect, not PCIe)
Add "Memory Type: Unified" and "Interconnect: Integrated" annotations in device details
Handle UMA detection in both NVML and nvidia-smi fallback code paths

Closes #80

Test plan

Build passes on GB10 (aarch64)
GPU memory reports system memory total (~128GB) instead of 0
Device detail shows "Memory Type: Unified"
PCIe metrics suppressed for UMA devices
No regression on discrete GPU systems
Unit tests pass for meminfo parsing and memory value parsing

Detect UMA on Blackwell-class GPUs where NVML memory_info() returns unavailable, and fall back to system memory from /proc/meminfo. Suppress misleading PCIe metrics for UMA devices and annotate device details with memory type and interconnect information.

get_system_memory_for_uma() returns (total, used) but the NVML code path destructured it as (used, total), causing memory values to be swapped on GB10 systems.

The previous fix (af86da0) swapped the destructuring order to fix UMA devices but inadvertently broke discrete GPUs by binding m.used to total_memory and m.total to used_memory. Align the tuple elements with the (total_memory, used_memory) destructuring.

inureyes · 2026-04-08T02:25:55Z

Call memory_info() once per device per tick instead of 3 times. Cache the result and pass memory_total to is_uma_device_with_mem(). Remove redundant Memory Type insertion from hot path (already cached in device detail).

Extract a pure `is_uma_device_name` function from the duplicated name-matching logic in `is_uma_device_with_mem` and the nvidia-smi fallback path, eliminating duplication and making the predicate directly testable. Add 10 new unit tests covering: - is_uma_device_name: gb10, dgx spark (case-insensitive), non-UMA GPUs, empty string - read_meminfo_memory: successful parse from a temp file, missing file - parse_meminfo_content: available > total saturating_sub, malformed lines

inureyes · 2026-04-08T02:35:42Z

PR Finalization Complete

Summary

Tests: Added 10 new unit tests (nvidia module now has 17 tests, up from 7). Coverage additions:
- is_uma_device_name: gb10/DGX Spark matching (case-insensitive), non-UMA GPU rejection, empty string
- read_meminfo_memory: successful parse from a temp file, graceful handling of a missing file path
- parse_meminfo_content: saturating_sub correctness when available > total, malformed line handling
Refactor: Extracted is_uma_device_name(name: &str) -> bool pure helper, eliminating the duplicated name-check logic between is_uma_device_with_mem and the nvidia-smi fallback path
Lint/Format: Collapsed nested if let + inner if into a single if let ... && ... per clippy collapsible_if rule; all other checks were already passing

All checks passing (245 tests, cargo clippy clean, cargo fmt clean). Ready for merge.

inureyes added type:enhancement New feature or request status:review Under review labels Apr 8, 2026

fix: correct swapped total/used memory destructuring for UMA devices

af86da0

get_system_memory_for_uma() returns (total, used) but the NVML code path destructured it as (used, total), causing memory values to be swapped on GB10 systems.

inureyes added priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related labels Apr 8, 2026

inureyes added 2 commits April 8, 2026 11:32

refactor: reduce redundant NVML calls for UMA detection

7bba117

Call memory_info() once per device per tick instead of 3 times. Cache the result and pass memory_total to is_uma_device_with_mem(). Remove redundant Memory Type insertion from hot path (already cached in device detail).

inureyes added status:done Completed and removed status:review Under review labels Apr 8, 2026

inureyes merged commit 976ac04 into main Apr 8, 2026
2 checks passed

inureyes deleted the feature/issue-80-gb10-uma-support branch April 8, 2026 02:36

inureyes self-assigned this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add DGX Spark (GB10) unified memory architecture support#146

feat: add DGX Spark (GB10) unified memory architecture support#146
inureyes merged 5 commits into
mainfrom
feature/issue-80-gb10-uma-support

inureyes commented Apr 8, 2026

Uh oh!

inureyes commented Apr 8, 2026

Uh oh!

inureyes commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 8, 2026

Summary

Test plan

Uh oh!

inureyes commented Apr 8, 2026

Implementation Review Summary

Intent

Findings Addressed

Remaining Items

Verification

Uh oh!

inureyes commented Apr 8, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant