You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The nvml-wrapper crate v0.12.0 introduced several new hardware detail APIs that are not yet exposed in all-smi:
NUMA node ID (numa_node_id) — identifies the NUMA node a GPU is attached to
GSP firmware mode and version (gsp_firmware_mode, gsp_firmware_version) — reports GPU System Processor firmware state
NvLink remote device type (remote_device_type) — identifies what is connected on the other end of each NvLink
GPU Performance Monitoring (GPM) — fine-grained SM occupancy and other performance counters
These details are valuable for topology-aware scheduling, firmware auditing, and interconnect diagnostics in multi-GPU / multi-node environments.
Goal
Expose detailed hardware topology and firmware information so operators can inspect NUMA placement, GSP firmware health, NvLink interconnect topology, and GPU performance counters from a single tool.
Scope
Read NUMA node ID per GPU for topology-aware monitoring
Read GSP firmware mode (enabled / disabled / default) and version string
Read NvLink remote device type for each active link (GPU, CPU/host bridge, NvSwitch, etc.)
Integrate GPU Performance Monitoring (GPM) metrics where available (e.g., SM occupancy, memory bandwidth utilization) — support detection + metric plumbing complete; two-sample handshake deferred to follow-up (see collect_gpm_metrics in src/device/readers/nvidia_hardware.rs)
Display hardware details in TUI info/detail view (e.g., a dedicated "Hardware Details" section or tab)
Export hardware detail metrics in Prometheus format (all_smi_numa_node_id, all_smi_gsp_firmware_mode, all_smi_nvlink_remote_device_type, GPM gauges, etc.)
Technical Considerations
These APIs are NVIDIA-specific; guard behind the existing NVIDIA GPU reader path.
numa_node_id may return an error on platforms without NUMA support — handle gracefully.
GSP firmware APIs may not be available on older driver versions — degrade gracefully and omit the metric.
NvLink enumeration should iterate over all supported links and skip inactive ones.
GPM support depends on GPU architecture (Hopper+); detect availability before querying.
Ensure mock server generates plausible values for new fields so TUI and API tests work without real hardware.
Acceptance Criteria
NUMA node ID is collected and displayed per GPU in both TUI and Prometheus output
GSP firmware mode and version are collected and displayed per GPU
NvLink remote device type is collected and displayed per link per GPU
GPM metrics are collected and exported when the GPU supports them (support-detection + mock/parser/exporter paths complete; live two-sample collection follow-up tracked in collect_gpm_metrics docstring)
All new fields degrade gracefully (no crash, no error log spam) on unsupported hardware/drivers
Mock server provides representative test data for all new fields
Problem / Background
The
nvml-wrappercrate v0.12.0 introduced several new hardware detail APIs that are not yet exposed in all-smi:numa_node_id) — identifies the NUMA node a GPU is attached togsp_firmware_mode,gsp_firmware_version) — reports GPU System Processor firmware stateremote_device_type) — identifies what is connected on the other end of each NvLinkThese details are valuable for topology-aware scheduling, firmware auditing, and interconnect diagnostics in multi-GPU / multi-node environments.
Goal
Expose detailed hardware topology and firmware information so operators can inspect NUMA placement, GSP firmware health, NvLink interconnect topology, and GPU performance counters from a single tool.
Scope
collect_gpm_metricsinsrc/device/readers/nvidia_hardware.rs)all_smi_numa_node_id,all_smi_gsp_firmware_mode,all_smi_nvlink_remote_device_type, GPM gauges, etc.)Technical Considerations
numa_node_idmay return an error on platforms without NUMA support — handle gracefully.Acceptance Criteria
collect_gpm_metricsdocstring)all_smi_*)