Skip to content

feat: add DGX Spark (GB10) unified memory architecture support#146

Merged
inureyes merged 5 commits into
mainfrom
feature/issue-80-gb10-uma-support
Apr 8, 2026
Merged

feat: add DGX Spark (GB10) unified memory architecture support#146
inureyes merged 5 commits into
mainfrom
feature/issue-80-gb10-uma-support

Conversation

@inureyes

@inureyes inureyes commented Apr 8, 2026

Copy link
Copy Markdown
Member

Summary

  • Detect UMA (Unified Memory Architecture) on NVIDIA GB10/Blackwell GPUs where nvmlDeviceGetMemoryInfo() returns unavailable
  • Fall back to system memory from /proc/meminfo when NVML memory reporting is unavailable on UMA devices
  • Suppress misleading PCIe generation/width metrics for UMA devices (uses internal interconnect, not PCIe)
  • Add "Memory Type: Unified" and "Interconnect: Integrated" annotations in device details
  • Handle UMA detection in both NVML and nvidia-smi fallback code paths

Closes #80

Test plan

  • Build passes on GB10 (aarch64)
  • GPU memory reports system memory total (~128GB) instead of 0
  • Device detail shows "Memory Type: Unified"
  • PCIe metrics suppressed for UMA devices
  • No regression on discrete GPU systems
  • Unit tests pass for meminfo parsing and memory value parsing

Detect UMA on Blackwell-class GPUs where NVML memory_info() returns
unavailable, and fall back to system memory from /proc/meminfo.
Suppress misleading PCIe metrics for UMA devices and annotate device
details with memory type and interconnect information.
@inureyes inureyes added type:enhancement New feature or request status:review Under review labels Apr 8, 2026
get_system_memory_for_uma() returns (total, used) but the NVML code
path destructured it as (used, total), causing memory values to be
swapped on GB10 systems.
@inureyes inureyes added priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related labels Apr 8, 2026
The previous fix (af86da0) swapped the destructuring order to fix UMA
devices but inadvertently broke discrete GPUs by binding m.used to
total_memory and m.total to used_memory. Align the tuple elements with
the (total_memory, used_memory) destructuring.
@inureyes

inureyes commented Apr 8, 2026

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Add UMA (Unified Memory Architecture) support for NVIDIA DGX Spark (GB10) / Blackwell GPUs, falling back to system memory when NVML memory reporting is unavailable and suppressing misleading PCIe metrics.

Findings Addressed

  • CRITICAL: Swapped total_memory/used_memory for discrete GPUs in NVML path -- The fix commit (af86da0) corrected UMA destructuring order (total_memory, used_memory) but left the else branch mapping m.used to total_memory and m.total to used_memory. This would have caused all discrete NVIDIA GPUs to display used memory as total and vice versa. Fixed in e511090.

Remaining Items

  • MEDIUM: Redundant NVML memory_info() calls -- is_uma_device() calls device.memory_info() once, then the non-UMA branch calls it two more times (one for total, one for used). A single call with destructuring would reduce 3 NVML round-trips to 1. Not blocking since NVML calls are fast, but worth a follow-up.
  • LOW: Redundant "Memory Type: Unified" insertion -- The detail map gets "Memory Type: Unified" inserted in both create_device_detail() (cached static info) and again in get_gpu_info_nvml() (per-refresh). The second insertion is harmless but unnecessary since the detail is cloned from the cached version which already contains it.
  • LOW: No architecture check in nvidia-smi fallback -- The nvidia-smi fallback path only matches by device name ("gb10", "dgx spark") and cannot check DeviceArchitecture::Blackwell. This is acceptable since the nvidia-smi fallback is inherently limited, but future Blackwell UMA products with different names would need name additions here.

Verification

  • All stated requirements implemented
  • No placeholder/mock code remaining
  • Integrated into project code flow (both NVML and nvidia-smi fallback paths)
  • Project conventions followed
  • Existing modules reused where applicable (follows Jetson's /proc/meminfo pattern)
  • No unintended structural changes
  • Tests pass (cargo test: 19 passed, cargo clippy: clean)

inureyes added 2 commits April 8, 2026 11:32
Call memory_info() once per device per tick instead of 3 times.
Cache the result and pass memory_total to is_uma_device_with_mem().
Remove redundant Memory Type insertion from hot path (already cached
in device detail).
Extract a pure `is_uma_device_name` function from the duplicated
name-matching logic in `is_uma_device_with_mem` and the nvidia-smi
fallback path, eliminating duplication and making the predicate
directly testable.

Add 10 new unit tests covering:
- is_uma_device_name: gb10, dgx spark (case-insensitive), non-UMA GPUs,
  empty string
- read_meminfo_memory: successful parse from a temp file, missing file
- parse_meminfo_content: available > total saturating_sub, malformed lines
@inureyes

inureyes commented Apr 8, 2026

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

  • Tests: Added 10 new unit tests (nvidia module now has 17 tests, up from 7). Coverage additions:
    • is_uma_device_name: gb10/DGX Spark matching (case-insensitive), non-UMA GPU rejection, empty string
    • read_meminfo_memory: successful parse from a temp file, graceful handling of a missing file path
    • parse_meminfo_content: saturating_sub correctness when available > total, malformed line handling
  • Refactor: Extracted is_uma_device_name(name: &str) -> bool pure helper, eliminating the duplicated name-check logic between is_uma_device_with_mem and the nvidia-smi fallback path
  • Lint/Format: Collapsed nested if let + inner if into a single if let ... && ... per clippy collapsible_if rule; all other checks were already passing

All checks passing (245 tests, cargo clippy clean, cargo fmt clean). Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 8, 2026
@inureyes inureyes merged commit 976ac04 into main Apr 8, 2026
2 checks passed
@inureyes inureyes deleted the feature/issue-80-gb10-uma-support branch April 8, 2026 02:36
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add support for DGX Spark (GB10) and Unified Memory Architecture NVIDIA GPUs

1 participant