Skip to content

Add extended hardware details (NUMA, GSP firmware, NvLink remote device) #132

Description

@inureyes

Problem / Background

The nvml-wrapper crate v0.12.0 introduced several new hardware detail APIs that are not yet exposed in all-smi:

  • NUMA node ID (numa_node_id) — identifies the NUMA node a GPU is attached to
  • GSP firmware mode and version (gsp_firmware_mode, gsp_firmware_version) — reports GPU System Processor firmware state
  • NvLink remote device type (remote_device_type) — identifies what is connected on the other end of each NvLink
  • GPU Performance Monitoring (GPM) — fine-grained SM occupancy and other performance counters

These details are valuable for topology-aware scheduling, firmware auditing, and interconnect diagnostics in multi-GPU / multi-node environments.

Goal

Expose detailed hardware topology and firmware information so operators can inspect NUMA placement, GSP firmware health, NvLink interconnect topology, and GPU performance counters from a single tool.

Scope

  • Read NUMA node ID per GPU for topology-aware monitoring
  • Read GSP firmware mode (enabled / disabled / default) and version string
  • Read NvLink remote device type for each active link (GPU, CPU/host bridge, NvSwitch, etc.)
  • Integrate GPU Performance Monitoring (GPM) metrics where available (e.g., SM occupancy, memory bandwidth utilization) — support detection + metric plumbing complete; two-sample handshake deferred to follow-up (see collect_gpm_metrics in src/device/readers/nvidia_hardware.rs)
  • Display hardware details in TUI info/detail view (e.g., a dedicated "Hardware Details" section or tab)
  • Export hardware detail metrics in Prometheus format (all_smi_numa_node_id, all_smi_gsp_firmware_mode, all_smi_nvlink_remote_device_type, GPM gauges, etc.)

Technical Considerations

  • These APIs are NVIDIA-specific; guard behind the existing NVIDIA GPU reader path.
  • numa_node_id may return an error on platforms without NUMA support — handle gracefully.
  • GSP firmware APIs may not be available on older driver versions — degrade gracefully and omit the metric.
  • NvLink enumeration should iterate over all supported links and skip inactive ones.
  • GPM support depends on GPU architecture (Hopper+); detect availability before querying.
  • Ensure mock server generates plausible values for new fields so TUI and API tests work without real hardware.

Acceptance Criteria

  • NUMA node ID is collected and displayed per GPU in both TUI and Prometheus output
  • GSP firmware mode and version are collected and displayed per GPU
  • NvLink remote device type is collected and displayed per link per GPU
  • GPM metrics are collected and exported when the GPU supports them (support-detection + mock/parser/exporter paths complete; live two-sample collection follow-up tracked in collect_gpm_metrics docstring)
  • All new fields degrade gracefully (no crash, no error log spam) on unsupported hardware/drivers
  • Mock server provides representative test data for all new fields
  • Prometheus metric names follow existing naming conventions (all_smi_*)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions