Skip to content

feat: Add support for DGX Spark (GB10) and Unified Memory Architecture NVIDIA GPUs #80

Description

@inureyes

Problem / Background

NVIDIA has released DGX Spark, a desktop AI system based on the GB10 Grace Blackwell chip. This system uses Unified Memory Architecture (UMA) where CPU and GPU share the same physical memory, which is fundamentally different from traditional discrete GPUs with dedicated VRAM.

Current all-smi NVIDIA GPU monitoring assumes discrete GPUs with:

  • Dedicated GPU memory (VRAM) separate from system RAM
  • device.memory_info() returning GPU-specific memory metrics
  • Clear distinction between used_memory and total_memory for the GPU

On UMA systems like DGX Spark:

  • CPU and GPU share the same physical memory pool
  • Traditional memory reporting concepts may not apply directly
  • NVML may report memory differently or require different API calls
  • Memory usage attribution between CPU and GPU workloads may differ

Affected Products

  • NVIDIA DGX Spark (GB10 Grace Blackwell)
  • Future Grace-based products with unified memory
  • Similar architectures that NVIDIA may release

Verified on Real Hardware (2026-04-08)

Tested on DGX Spark (GB10), Driver 580.126.09, CUDA 13.0, Linux 6.17.0-1008-nvidia (aarch64).

What Works

Metric Value Status
GPU Detection NVIDIA GB10 OK
Architecture Blackwell OK
GPU Utilization 0% (idle) OK
Temperature 56°C OK
Power Draw ~5W OK
GPU Frequency 208 MHz OK
System Memory 121.7 GB / 130.7 GB OK (via system memory reader)

What's Broken

Metric Reported Expected Root Cause
GPU Memory Used 0 bytes Should reflect shared memory usage NVML memory_info() returns [N/A]
GPU Memory Total 0 bytes ~128 GB (unified with system) NVML memory_info() returns [N/A]
Power Limit N/A N/A Not available on GB10
PCIe Gen/Width Gen 1 x1 (max x16) N/A Misleading — GB10 uses internal interconnect, not PCIe
Brand NvidiaRTX DGX Spark or similar Does not distinguish UMA architecture

Key Findings

  1. NVML nvmlDeviceGetMemoryInfo() returns [N/A] on GB10 — confirmed via both nvidia-smi --query-gpu=memory.total and device.memory_info() in nvml-wrapper. The code at src/device/readers/nvidia.rs:184-185 falls back to 0 via unwrap_or(0).

  2. nvidia-smi shows per-process GPU memory (e.g., Xorg 18MiB, gnome-shell 6MiB) even though aggregate memory queries fail — suggesting process-level memory attribution may still work via NVML.

  3. System memory reader works correctly — reports 130.7 GB total, which IS the unified memory pool. The information exists but is not associated with the GPU.

  4. PCIe reporting is misleading — GB10 uses an internal interconnect, so reporting "Gen 1 x1 (max x16)" is confusing and inaccurate.

Proposed Solution

Priority 1: UMA Memory Fallback

  1. Detect UMA architecture — when memory_info() returns 0 for total AND the device is GB10/Blackwell, flag it as UMA
  2. Use system memory as GPU memory — fallback to /proc/meminfo total as total_memory, similar to Jetson approach in nvidia_jetson.rs
  3. Reference: src/device/readers/nvidia_jetson.rs:145 already does this for Jetson

Priority 2: Suppress Misleading Metrics

  1. PCIe metrics — suppress or annotate PCIe Gen/Width for UMA devices (not meaningful)
  2. Power Limit — gracefully handle N/A (already unwrap_or but should be explicit)

Priority 3: UMA Identification

  1. Add memory type indicator — "Unified" vs "Discrete" in device detail
  2. Fix brand detection — identify DGX Spark properly instead of NvidiaRTX

Acceptance Criteria

  • Document NVML behavior on DGX Spark / GB10 systems (see "Verified on Real Hardware" above)
  • Implement detection for UMA-based NVIDIA GPUs
  • Memory metrics are reported accurately and meaningfully for UMA systems
  • Device details include memory architecture type (Discrete/Unified)
  • Suppress or annotate misleading PCIe metrics for UMA devices
  • No regression in existing discrete GPU support
  • Unit tests cover UMA detection and reporting logic

Technical Considerations

Confirmed NVML Behavior on GB10

$ nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
memory.total [MiB], memory.used [MiB], memory.free [MiB]
[N/A], [N/A], [N/A]

$ nvidia-smi --query-gpu=name,compute_mode,temperature.gpu,power.draw,utilization.gpu --format=csv
NVIDIA GB10, Default, 56, 5.05 W, 0 %
  • nvmlDeviceGetMemoryInfo() → returns [N/A] (nvml-wrapper maps to error, unwrap_or(0))
  • nvmlDeviceGetArchitecture()Blackwell
  • nvmlDeviceGetBrand()NvidiaRTX (incorrect for DGX Spark)
  • nvmlDeviceGetPciInfo() → Gen 1 x1 (not meaningful for UMA)
  • Temperature, power, utilization, frequency → all work correctly

Architecture Reference

Current relevant implementations:

  • src/device/readers/nvidia.rs:184-185 — where memory_info() returns 0 for UMA
  • src/device/readers/nvidia_jetson.rs — Jetson reader handling integrated GPU with shared memory (uses /proc/meminfo fallback)
  • src/device/memory_linux.rs — System memory reader (already reports correct 128GB unified pool)

Potential New Fields in GpuInfo

// Consider adding to device detail or as new fields
memory_type: Option<String>,  // "Discrete", "Unified", "Shared"
unified_memory_total: Option<u64>,  // Total unified memory pool

Graceful Degradation

If NVML on UMA systems doesn't provide expected metrics:

  1. Fall back to system memory reporting (similar to Jetson approach)
  2. Use /proc/meminfo for unified memory systems
  3. Log warnings for unsupported queries

Additional Context

Related Implementations

  • NVIDIA Jetson (nvidia_jetson.rs): Uses tegrastats and system memory fallback for integrated GPU
  • Apple Silicon (apple.rs): Unified memory architecture with shared CPU/GPU memory pool

References

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions