feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM)#175
Merged
Conversation
Extend `GpuInfo` with five new per-GPU fields for the extended NVIDIA hardware-detail family (issue #132): `numa_node_id`, `gsp_firmware_mode`, `gsp_firmware_version`, `nvlink_remote_devices`, and `gpm_metrics`. Add supporting types `NvLinkRemoteDevice`, `NvLinkRemoteType`, and `GpmMetrics` with stable lowercase label encodings for round-tripping through the Prometheus exposition format. Populate the new fields with safe "unavailable" defaults (`None` / empty vectors) in every non-NVIDIA reader (AMD, Apple Silicon, Gaudi, Google TPU, Furiosa, Rebellions, Tenstorrent, Jetson) plus all test helpers in the UI / view / metrics layers so non-NVIDIA paths remain zero-behavioural-change.
Add `src/device/readers/nvidia_hardware.rs` with:
* `HardwareDetailCache` memoising NUMA node id, GSP firmware mode,
and GSP firmware version per device index (static values cached
for the process lifetime, matching the thermal-threshold pattern
introduced in PR #173).
* `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS`
and classifying the remote endpoint of every active link via raw FFI
to work around a latent bug in nvml-wrapper 0.12.1's wrapper that
writes to an immutable temporary.
* `collect_gpm_metrics` that detects GPM support (Hopper+) but defers
the two-sample handshake to a follow-up; the reader currently emits
`Some(GpmMetrics::default())` on supported hardware so the TUI can
show a "GPM-capable" hint without fabricating numeric readings.
Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call
populates the new fields on supported devices while silently degrading to
the "unavailable" defaults on older drivers, non-datacenter SKUs, and the
nvidia-smi CSV fallback path.
Add `src/api/metrics/hardware.rs` with `HardwareMetricExporter`, emitting
six metric families for every GPU that populated at least one
hardware-detail field:
* `all_smi_gpu_numa_node_id` (gauge; omitted when unknown)
* `all_smi_gpu_gsp_firmware_mode` (gauge, 0/1/2)
* `all_smi_gpu_gsp_firmware_version_info` (info-style, `version` label)
* `all_smi_nvlink_remote_device_type` (one row per active link with
`link_index` + `remote_type=gpu|switch|ibmnpu|unknown` labels)
* `all_smi_gpu_sm_occupancy` (gauge 0-1, only when sampled)
* `all_smi_gpu_memory_bandwidth_utilization` (gauge 0-1, only when sampled)
The exporter self-filters to GPUs that have any hardware-detail field
populated so non-NVIDIA paths and older-driver paths stay silent.
A supported-but-unsampled GPM snapshot (`Some(GpmMetrics::default())`)
produces no gauge lines to avoid publishing misleading zeros.
Row caching in `collect_rows` pre-computes the stringified `index` label
once per GPU so the six metric families iterate without re-allocating
labels, matching the vGPU / MIG exporter pattern.
Extend `MetricsParser::process_gpu_metrics` to handle the six issue-#132 metric families so remote-mode scrapes reconstruct the same `GpuInfo` shape the local reader produces. Route `nvlink_*` lines through the existing per-GPU accumulator via prefix-matching in the top-level dispatcher. Harden the parser against hostile remote input: * `MAX_NVLINK_PER_GPU = 32` caps the `nvlink_remote_devices` vector (current hardware tops out at 18 links; 32 leaves headroom). * `MAX_NUMA_NODE_ID = 4096` rejects absurd NUMA topologies. * `MAX_GSP_VERSION_LEN = 128` bounds the version string label. * GSP firmware mode accepts only 0..=2 (the values the exporter emits). * GPM fractions accept only `[0.0, 1.0]` — out-of-range readings drop so dashboards can distinguish "unavailable" from a real zero. * Duplicate `link_index` emissions coalesce (last-wins) rather than inflating the vector length. Introduce `saturating_i32` as a counterpart to `saturating_u32` for the signed NUMA node id field, plus `ensure_gpm_metrics` so partial GPM payloads don't force-allocate an empty container on every GPU.
Add `render_hardware_details_row` producing a compact indented row under each GPU when NUMA placement, GSP firmware mode/version, or NvLink topology is reported. Format example: HW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1) GPM gauges render inline on the same row (`GPM: SM=0.67 MemBW=0.42`) but a supported-but-unsampled GPM snapshot produces nothing, matching the exporter contract. Extend `gpu_render_line_count` to include the new optional row so the PgUp/PgDn layout math stays correct when rows grow from 2 to 4 lines on hosts that populate every extended field. A new `gpu_has_hardware_details_row` predicate drives both the renderer and the line-count calculation so the two paths cannot drift.
Add `add_hardware_detail_metrics` to the NVIDIA mock generator so the mock server exposes representative values for every issue-#132 metric family: * NUMA node alternates between 0 and 1 across GPUs (replicates a dual-socket host with even-split GPU topology). * GSP firmware enabled, version `550.54.15` (matching a recent R555 datacenter driver). * 6 active NvLinks per GPU — 5 remote=gpu + 1 remote=switch — mirroring typical HGX 8-GPU board topology. * GPM gauges at fixed plausible values (SM=0.67, MemBW=0.42). Fixed rather than randomised values keep the mock output stable across scrapes; hardware details never change at runtime on real devices, and stable mock values simplify downstream test assertions.
Mirror the vGPU / MIG integration-test pattern: build Prometheus
exposition snippets that match the exporter output byte-for-byte, run
them through `MetricsParser::parse_metrics`, and assert every
hardware-detail field survives the round trip.
Coverage:
* Full happy-path round-trip preserves NUMA, GSP mode + version,
all 6 NvLinks with correct classification counts, and both GPM
gauges.
* Partial scrape (NUMA only) preserves absence of the other fields.
* Non-NVIDIA scrape leaves every new field at its "unavailable"
default.
* Unknown and IBM NPU NvLink remote types round-trip intact.
* Parser caps the per-GPU NvLink vector at 32 against a hostile
40-link emission.
* Out-of-range GSP firmware mode (99) and GPM fractions (9.9) are
dropped so dashboards never render impossible values.
* Partial GPM emission populates only the reported field.
… permanently Replace the single HardwareDetails entry cache with three independent per-field caches (numa, gsp_mode, gsp_version). A transient NVML error (Unknown, GpuLost) on one field no longer prevents the other fields from being cached, and the next poll will retry the failed field. Only permanent errors (NotSupported, FunctionNotFound) are cached as None.
…tion Narrow the early-exit in collect_nvlink_remote_devices to only break on known past-the-end variants (InvalidArg, NotSupported). Transient errors (Unknown, GpuLost) now skip the current link index with continue so that a driver hiccup on link N does not silently hide links N+1 through MAX.
Add value.fract() == 0.0 guard to the gpu_gsp_firmware_mode branch so fractional inputs like 1.5 (which would silently saturate to 1) are dropped instead of stored as a wrong mode code. Apply the same guard to gpu_performance_state for consistency.
A malicious remote Prometheus endpoint could embed ANSI escape sequences in the version label (e.g. ESC[2J ESC[H). Without a character filter the TUI would execute those sequences on render. Add a chars().all(|c| !c.is_control()) guard so any version string containing control characters is dropped and the field stays None.
Collapse nested if-let in get_or_fetch_field probe path (collapsible_if) and convert useless format! to .to_string() in control-char test.
Add NVIDIA Hardware Details Metrics section to API.md covering the six new Prometheus series (NUMA node, GSP firmware mode, GSP firmware version info, NvLink remote type, GPM SM occupancy, GPM memory bandwidth util) with label tables and absence semantics. Update README.md NVIDIA feature bullet, API mode metrics list, mock server note, and v0.21.0 changelog entry to reflect the hardware details additions from issue #132.
Member
Author
PR Finalization CompleteSummaryLint/Format: Tests: Documentation: Added new NVIDIA Hardware Details Metrics section to Commit: All checks passing. Ready for merge. |
4 tasks
inureyes
added a commit
that referenced
this pull request
Apr 16, 2026
…#176) * fix(security): cap CPU core_id in remote parser to prevent OOM A malicious upstream sending core_id="4294967295" via the Prometheus exposition format would cause the parser to grow the per_core_utilization vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES cap (1024) and reject any core_id at or above that threshold. * fix(security): strip control characters from all metric label values A compromised remote endpoint can inject ANSI escape sequences (e.g. \x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid, cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values directly, causing the terminal to execute the injected sequences. Add strip_control_chars() to parsing/common.rs and apply it inside sanitize_label_value() so ALL label values flowing through the parser are cleaned. Also apply stripping in unescape_label_value() for quoted values, and in MetricBuilder::metric() on the exporter side to defend against local NVML returning unexpected strings. * refactor: replace parse_metrics 6-tuple with ParsedMetrics struct The parse_metrics return type was an unreadable and unsustainable 6-element tuple. Replace it with a named ParsedMetrics struct with clearly documented fields. Update all call sites in client.rs and all integration/unit tests. * perf: replace O(N^2) linear scans with HashMap lookups for vGPU/MIG matching In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M) per frame, causing millions of comparisons. Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as a rare linear scan for entries not found by UUID. * fix(api): standardize Prometheus labels to gpu_index/gpu_uuid across all exporters BREAKING CHANGE: The GPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid` to match the more explicit naming already used by the MIG and vGPU exporters. The remote parser accepts both old and new label names for backward compatibility with nodes running older versions. Existing Prometheus dashboards and alert rules that filter on `index=` or `uuid=` labels will need to be updated to use `gpu_index=` and `gpu_uuid=` respectively. * test: add missing vGPU memory_utilization round-trip assertion The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture omitted all_smi_vgpu_memory_utilization from the test data. Add the metric line to the fixture and assert that memory_utilization=45 survives the parse round-trip. * fix: correct vgpu doc comment and sanitize GPU detail map values Fix the vgpu.rs module doc that incorrectly referenced all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode). Apply sanitize_label_value() to detail HashMap values in gpu.rs export_device_info to strip control characters from NVML-sourced strings before they reach the Prometheus exposition format. * perf: precompute GPU exporter row cache to reduce per-scrape allocations Add a GpuRow struct and collect_rows function to the GPU exporter, following the same pattern used by the MIG, vGPU, and hardware exporters. The stringified gpu_index is computed once per GPU per scrape instead of being re-created 5x across the five export methods. * chore: apply cargo fmt formatting * chore: fix clippy needless_lifetimes warnings * fix: update stale label names in hardware exporter doc and tests - Update module doc comment to use `gpu_uuid`/`gpu_index` matching the actual `base_labels` implementation - Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix so substring matching cannot mask a regression to old label names - Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since it was superseded by `gpu_render_line_count_with_lookup` for production use and is now only called from tests * docs: note breaking label rename in v0.21.0 changelog entry
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Expose the
nvml-wrapper0.12 hardware-detail APIs through the entire all-smi pipeline — NVML reader →GpuInfo→ Prometheus exporter + remote parser → TUI + mock templates + integration tests. Closes #132.Five new per-GPU fields land on
GpuInfo:numa_node_id: Option<i32>(NVMLnvmlDeviceGetNumaNodeId)gsp_firmware_mode: Option<u8>(0=disabled, 1=enabled, 2=default)gsp_firmware_version: Option<String>nvlink_remote_devices: Vec<NvLinkRemoteDevice>classifying every active link's remote endpoint (gpu / switch / ibmnpu / unknown)gpm_metrics: Option<GpmMetrics>for GPU Performance Monitoring (support-detection is live; the two-sample handshake is deferred to a follow-up)What's in the commits
feat: add hardware-detail data model— types + safe defaults in every non-NVIDIA reader and test helperfeat: populate hardware details from NVML in the NVIDIA reader— newnvidia_hardwaremodule withHardwareDetailCache, NvLink enumeration via raw FFI (working around a latent wrapper bug), and GPM detectionfeat: export NVIDIA hardware details as Prometheus metrics— newapi/metrics/hardware.rsexporter +handlers.rswiringfeat: parse hardware-detail metrics on the remote side— parser updates with defensive caps (MAX_NVLINK_PER_GPU=32, NUMA id ≤ 4096, GSP version length ≤ 128) and range checksfeat: surface hardware details on a compact TUI row per GPU— new indented rowHW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1)plusgpu_render_line_countupdatefeat: emit synthetic hardware details from the NVIDIA mock template— mock server emits 8 GPUs × 6 NvLinks + alternating NUMA nodes + GSP version + GPM gaugestest: add integration tests for the hardware-detail pipeline— mirrors the vGPU / MIG patternMetrics produced
all_smi_gpu_numa_node_idall_smi_gpu_gsp_firmware_modeall_smi_gpu_gsp_firmware_version_infoversionlabelall_smi_nvlink_remote_device_typelink_index+remote_typelabelsall_smi_gpu_sm_occupancyall_smi_gpu_memory_bandwidth_utilizationTest plan
cargo build --features mock— cleancargo test --features mock— 14 test suites, 0 failures (536 lib tests + 10 hardware unit tests + 10 nvidia_hardware unit tests + 10 TUI hw_row tests + 11 hardware exporter tests + 8 integration tests + …)cargo clippy --features mock --all-targets— zero new warningscargo fmt --check— cleanall-smi-mock-server --port-range 19190 --platform nvidiaemits all six metric families with NUMA alternation and 48 NvLink rows across 8 GPUsNone/ empty vectors for every new field — zero behavioural changeDesign decisions
gpm_sample→ wait N seconds →gpm_sample→gpm_metrics_get) which doesn't fit all-smi's single-poll reader contract. The issue's instruction was to degrade rather than over-engineer, so the reader currently detects GPM support (Some(GpmMetrics::default())on Hopper+) without publishing fabricated numbers. The exporter/parser/TUI all silently skip the gauges when no field is populated, and acollect_gpm_metricsdocstring marks the follow-up. Every piece of plumbing downstream of the reader is ready for the numeric values to light up.nvml_wrapper::nv_link::NvLink::remote_device_type()in 0.12.1 passes&mut device_type.as_c()— an immutable temporary — so NVML never writes back. We call the symbol directly vianvmlDeviceGetNvLinkRemoteDeviceTypeinstead. Vendoredmap_remote_device_typekeeps the workaround contained so it can be removed when the upstream bug is fixed.numa_node_idcross-platform:Device::numa_node_id()is available on all platforms in 0.12.1 (nocfg(target_os = "linux")gate), so no FFI stub was needed. The reader canonicalises theu32::MAXsentinel toNonerather than leaking it as a huge number.HardwareDetailCache(static per device, process lifetime), matching the thermal-threshold pattern from feat: add NVIDIA thermal thresholds and P-state monitoring #173. NvLink is NOT cached since links can go down at runtime. GPM support detection is cheap and uncached.HWrow is distinct from the thermal/P-state row introduced in feat: add NVIDIA thermal thresholds and P-state monitoring #173 so operators can see both simultaneously without vertical collisions.gpu_render_line_countnow correctly accounts for up to 4 rows per GPU (info + thermal + hw + gauges).Files touched
New:
src/device/readers/nvidia_hardware.rssrc/api/metrics/hardware.rstests/hardware_details_integration_test.rsModified:
src/device/types.rs,src/device/readers/{nvidia,amd,amd_windows,apple_silicon_native,furiosa,gaudi,google_tpu,nvidia_jetson,rebellions,tenstorrent,mod}.rs,src/api/{handlers.rs,metrics/{mod,gpu}.rs},src/network/metrics_parser.rs,src/ui/{renderers/gpu_renderer,dashboard,gpu_sparkline_panel,layout,led_grid}.rs,src/view/view_cache.rs,src/metrics/aggregator.rs,src/mock/templates/nvidia.rsCloses #132