feat: Add support for DGX Spark (GB10) and Unified Memory Architecture NVIDIA GPUs

## Problem / Background

NVIDIA has released DGX Spark, a desktop AI system based on the GB10 Grace Blackwell chip. This system uses **Unified Memory Architecture (UMA)** where CPU and GPU share the same physical memory, which is fundamentally different from traditional discrete GPUs with dedicated VRAM.

Current `all-smi` NVIDIA GPU monitoring assumes discrete GPUs with:
- Dedicated GPU memory (VRAM) separate from system RAM
- `device.memory_info()` returning GPU-specific memory metrics
- Clear distinction between `used_memory` and `total_memory` for the GPU

On UMA systems like DGX Spark:
- CPU and GPU share the same physical memory pool
- Traditional memory reporting concepts may not apply directly
- NVML may report memory differently or require different API calls
- Memory usage attribution between CPU and GPU workloads may differ

### Affected Products
- **NVIDIA DGX Spark** (GB10 Grace Blackwell)
- **Future Grace-based products** with unified memory
- Similar architectures that NVIDIA may release

## Verified on Real Hardware (2026-04-08)

Tested on DGX Spark (GB10), Driver 580.126.09, CUDA 13.0, Linux 6.17.0-1008-nvidia (aarch64).

### What Works

| Metric | Value | Status |
|---|---|---|
| GPU Detection | `NVIDIA GB10` | OK |
| Architecture | `Blackwell` | OK |
| GPU Utilization | 0% (idle) | OK |
| Temperature | 56°C | OK |
| Power Draw | ~5W | OK |
| GPU Frequency | 208 MHz | OK |
| System Memory | 121.7 GB / 130.7 GB | OK (via system memory reader) |

### What's Broken

| Metric | Reported | Expected | Root Cause |
|---|---|---|---|
| GPU Memory Used | **0 bytes** | Should reflect shared memory usage | NVML `memory_info()` returns `[N/A]` |
| GPU Memory Total | **0 bytes** | ~128 GB (unified with system) | NVML `memory_info()` returns `[N/A]` |
| Power Limit | N/A | N/A | Not available on GB10 |
| PCIe Gen/Width | Gen 1 x1 (max x16) | N/A | Misleading — GB10 uses internal interconnect, not PCIe |
| Brand | `NvidiaRTX` | DGX Spark or similar | Does not distinguish UMA architecture |

### Key Findings

1. **NVML `nvmlDeviceGetMemoryInfo()` returns `[N/A]` on GB10** — confirmed via both `nvidia-smi --query-gpu=memory.total` and `device.memory_info()` in nvml-wrapper. The code at `src/device/readers/nvidia.rs:184-185` falls back to 0 via `unwrap_or(0)`.

2. **`nvidia-smi` shows per-process GPU memory** (e.g., Xorg 18MiB, gnome-shell 6MiB) even though aggregate memory queries fail — suggesting process-level memory attribution may still work via NVML.

3. **System memory reader works correctly** — reports 130.7 GB total, which IS the unified memory pool. The information exists but is not associated with the GPU.

4. **PCIe reporting is misleading** — GB10 uses an internal interconnect, so reporting "Gen 1 x1 (max x16)" is confusing and inaccurate.

## Proposed Solution

### Priority 1: UMA Memory Fallback
1. **Detect UMA architecture** — when `memory_info()` returns 0 for total AND the device is GB10/Blackwell, flag it as UMA
2. **Use system memory as GPU memory** — fallback to `/proc/meminfo` total as `total_memory`, similar to Jetson approach in `nvidia_jetson.rs`
3. **Reference**: `src/device/readers/nvidia_jetson.rs:145` already does this for Jetson

### Priority 2: Suppress Misleading Metrics
1. **PCIe metrics** — suppress or annotate PCIe Gen/Width for UMA devices (not meaningful)
2. **Power Limit** — gracefully handle N/A (already `unwrap_or` but should be explicit)

### Priority 3: UMA Identification
1. **Add memory type indicator** — "Unified" vs "Discrete" in device detail
2. **Fix brand detection** — identify DGX Spark properly instead of `NvidiaRTX`

## Acceptance Criteria

- [x] Document NVML behavior on DGX Spark / GB10 systems (see "Verified on Real Hardware" above)
- [ ] Implement detection for UMA-based NVIDIA GPUs
- [ ] Memory metrics are reported accurately and meaningfully for UMA systems
- [ ] Device details include memory architecture type (Discrete/Unified)
- [ ] Suppress or annotate misleading PCIe metrics for UMA devices
- [ ] No regression in existing discrete GPU support
- [ ] Unit tests cover UMA detection and reporting logic

## Technical Considerations

### Confirmed NVML Behavior on GB10

```
$ nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
memory.total [MiB], memory.used [MiB], memory.free [MiB]
[N/A], [N/A], [N/A]

$ nvidia-smi --query-gpu=name,compute_mode,temperature.gpu,power.draw,utilization.gpu --format=csv
NVIDIA GB10, Default, 56, 5.05 W, 0 %
```

- `nvmlDeviceGetMemoryInfo()` → returns `[N/A]` (nvml-wrapper maps to error, `unwrap_or(0)`)
- `nvmlDeviceGetArchitecture()` → `Blackwell`
- `nvmlDeviceGetBrand()` → `NvidiaRTX` (incorrect for DGX Spark)
- `nvmlDeviceGetPciInfo()` → Gen 1 x1 (not meaningful for UMA)
- Temperature, power, utilization, frequency → all work correctly

### Architecture Reference
Current relevant implementations:
- `src/device/readers/nvidia.rs:184-185` — where `memory_info()` returns 0 for UMA
- `src/device/readers/nvidia_jetson.rs` — Jetson reader handling integrated GPU with shared memory (uses `/proc/meminfo` fallback)
- `src/device/memory_linux.rs` — System memory reader (already reports correct 128GB unified pool)

### Potential New Fields in `GpuInfo`
```rust
// Consider adding to device detail or as new fields
memory_type: Option<String>,  // "Discrete", "Unified", "Shared"
unified_memory_total: Option<u64>,  // Total unified memory pool
```

### Graceful Degradation
If NVML on UMA systems doesn't provide expected metrics:
1. Fall back to system memory reporting (similar to Jetson approach)
2. Use `/proc/meminfo` for unified memory systems
3. Log warnings for unsupported queries

## Additional Context

### Related Implementations
- **NVIDIA Jetson** (`nvidia_jetson.rs`): Uses tegrastats and system memory fallback for integrated GPU
- **Apple Silicon** (`apple.rs`): Unified memory architecture with shared CPU/GPU memory pool

### References
- [NVIDIA DGX Spark Announcement](https://www.nvidia.com/en-us/autonomous-machines/dgx-spark/)
- [NVIDIA Grace Blackwell Architecture](https://www.nvidia.com/en-us/data-center/grace-blackwell/)
- [NVML Documentation](https://docs.nvidia.com/deploy/nvml-api/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add support for DGX Spark (GB10) and Unified Memory Architecture NVIDIA GPUs #80

Problem / Background

Affected Products

Verified on Real Hardware (2026-04-08)

What Works

What's Broken

Key Findings

Proposed Solution

Priority 1: UMA Memory Fallback

Priority 2: Suppress Misleading Metrics

Priority 3: UMA Identification

Acceptance Criteria

Technical Considerations

Confirmed NVML Behavior on GB10

Architecture Reference

Potential New Fields in `GpuInfo`

Graceful Degradation

Additional Context

Related Implementations

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Value	Status
GPU Detection	`NVIDIA GB10`	OK
Architecture	`Blackwell`	OK
GPU Utilization	0% (idle)	OK
Temperature	56°C	OK
Power Draw	~5W	OK
GPU Frequency	208 MHz	OK
System Memory	121.7 GB / 130.7 GB	OK (via system memory reader)

Metric	Reported	Expected	Root Cause
GPU Memory Used	0 bytes	Should reflect shared memory usage	NVML `memory_info()` returns `[N/A]`
GPU Memory Total	0 bytes	~128 GB (unified with system)	NVML `memory_info()` returns `[N/A]`
Power Limit	N/A	N/A	Not available on GB10
PCIe Gen/Width	Gen 1 x1 (max x16)	N/A	Misleading — GB10 uses internal interconnect, not PCIe
Brand	`NvidiaRTX`	DGX Spark or similar	Does not distinguish UMA architecture

Uh oh!

feat: Add support for DGX Spark (GB10) and Unified Memory Architecture NVIDIA GPUs #80

Description

Problem / Background

Affected Products

Verified on Real Hardware (2026-04-08)

What Works

What's Broken

Key Findings

Proposed Solution

Priority 1: UMA Memory Fallback

Priority 2: Suppress Misleading Metrics

Priority 3: UMA Identification

Acceptance Criteria

Technical Considerations

Confirmed NVML Behavior on GB10

Architecture Reference

Potential New Fields in GpuInfo

Graceful Degradation

Additional Context

Related Implementations

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Potential New Fields in `GpuInfo`