feat: add DGX Spark (GB10) unified memory architecture support#146
Merged
Conversation
Detect UMA on Blackwell-class GPUs where NVML memory_info() returns unavailable, and fall back to system memory from /proc/meminfo. Suppress misleading PCIe metrics for UMA devices and annotate device details with memory type and interconnect information.
get_system_memory_for_uma() returns (total, used) but the NVML code path destructured it as (used, total), causing memory values to be swapped on GB10 systems.
The previous fix (af86da0) swapped the destructuring order to fix UMA devices but inadvertently broke discrete GPUs by binding m.used to total_memory and m.total to used_memory. Align the tuple elements with the (total_memory, used_memory) destructuring.
Member
Author
Implementation Review SummaryIntent
Findings Addressed
Remaining Items
Verification
|
Call memory_info() once per device per tick instead of 3 times. Cache the result and pass memory_total to is_uma_device_with_mem(). Remove redundant Memory Type insertion from hot path (already cached in device detail).
Extract a pure `is_uma_device_name` function from the duplicated name-matching logic in `is_uma_device_with_mem` and the nvidia-smi fallback path, eliminating duplication and making the predicate directly testable. Add 10 new unit tests covering: - is_uma_device_name: gb10, dgx spark (case-insensitive), non-UMA GPUs, empty string - read_meminfo_memory: successful parse from a temp file, missing file - parse_meminfo_content: available > total saturating_sub, malformed lines
Member
Author
PR Finalization CompleteSummary
All checks passing (245 tests, cargo clippy clean, cargo fmt clean). Ready for merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
nvmlDeviceGetMemoryInfo()returns unavailable/proc/meminfowhen NVML memory reporting is unavailable on UMA devicesCloses #80
Test plan