Skip to content

feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM)#175

Merged
inureyes merged 13 commits into
mainfrom
feature/issue-132-hw-details
Apr 15, 2026
Merged

feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM)#175
inureyes merged 13 commits into
mainfrom
feature/issue-132-hw-details

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Expose the nvml-wrapper 0.12 hardware-detail APIs through the entire all-smi pipeline — NVML reader → GpuInfo → Prometheus exporter + remote parser → TUI + mock templates + integration tests. Closes #132.

Five new per-GPU fields land on GpuInfo:

  • numa_node_id: Option<i32> (NVML nvmlDeviceGetNumaNodeId)
  • gsp_firmware_mode: Option<u8> (0=disabled, 1=enabled, 2=default)
  • gsp_firmware_version: Option<String>
  • nvlink_remote_devices: Vec<NvLinkRemoteDevice> classifying every active link's remote endpoint (gpu / switch / ibmnpu / unknown)
  • gpm_metrics: Option<GpmMetrics> for GPU Performance Monitoring (support-detection is live; the two-sample handshake is deferred to a follow-up)

What's in the commits

  1. feat: add hardware-detail data model — types + safe defaults in every non-NVIDIA reader and test helper
  2. feat: populate hardware details from NVML in the NVIDIA reader — new nvidia_hardware module with HardwareDetailCache, NvLink enumeration via raw FFI (working around a latent wrapper bug), and GPM detection
  3. feat: export NVIDIA hardware details as Prometheus metrics — new api/metrics/hardware.rs exporter + handlers.rs wiring
  4. feat: parse hardware-detail metrics on the remote side — parser updates with defensive caps (MAX_NVLINK_PER_GPU=32, NUMA id ≤ 4096, GSP version length ≤ 128) and range checks
  5. feat: surface hardware details on a compact TUI row per GPU — new indented row HW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1) plus gpu_render_line_count update
  6. feat: emit synthetic hardware details from the NVIDIA mock template — mock server emits 8 GPUs × 6 NvLinks + alternating NUMA nodes + GSP version + GPM gauges
  7. test: add integration tests for the hardware-detail pipeline — mirrors the vGPU / MIG pattern

Metrics produced

Metric Type Notes
all_smi_gpu_numa_node_id gauge Omitted when NUMA topology unknown
all_smi_gpu_gsp_firmware_mode gauge (0/1/2) Omitted on pre-R525 drivers
all_smi_gpu_gsp_firmware_version_info info (value=1) Version carried in version label
all_smi_nvlink_remote_device_type info (value=1) Per active link; link_index + remote_type labels
all_smi_gpu_sm_occupancy gauge 0-1 GPM; omitted when unsampled
all_smi_gpu_memory_bandwidth_utilization gauge 0-1 GPM; omitted when unsampled

Test plan

  • cargo build --features mock — clean
  • cargo test --features mock — 14 test suites, 0 failures (536 lib tests + 10 hardware unit tests + 10 nvidia_hardware unit tests + 10 TUI hw_row tests + 11 hardware exporter tests + 8 integration tests + …)
  • cargo clippy --features mock --all-targets — zero new warnings
  • cargo fmt --check — clean
  • Mock server smoke test: all-smi-mock-server --port-range 19190 --platform nvidia emits all six metric families with NUMA alternation and 48 NvLink rows across 8 GPUs
  • Non-NVIDIA readers (AMD, Apple Silicon, Gaudi, TPU, Furiosa, Rebellions, Tenstorrent, Jetson) return None / empty vectors for every new field — zero behavioural change

Design decisions

  • GPM scope: live sampling requires a two-sample handshake (gpm_sample → wait N seconds → gpm_samplegpm_metrics_get) which doesn't fit all-smi's single-poll reader contract. The issue's instruction was to degrade rather than over-engineer, so the reader currently detects GPM support (Some(GpmMetrics::default()) on Hopper+) without publishing fabricated numbers. The exporter/parser/TUI all silently skip the gauges when no field is populated, and a collect_gpm_metrics docstring marks the follow-up. Every piece of plumbing downstream of the reader is ready for the numeric values to light up.
  • Raw FFI for NvLink remote type: nvml_wrapper::nv_link::NvLink::remote_device_type() in 0.12.1 passes &mut device_type.as_c() — an immutable temporary — so NVML never writes back. We call the symbol directly via nvmlDeviceGetNvLinkRemoteDeviceType instead. Vendored map_remote_device_type keeps the workaround contained so it can be removed when the upstream bug is fixed.
  • numa_node_id cross-platform: Device::numa_node_id() is available on all platforms in 0.12.1 (no cfg(target_os = "linux") gate), so no FFI stub was needed. The reader canonicalises the u32::MAX sentinel to None rather than leaking it as a huge number.
  • Cache shape: NUMA + GSP live in a new HardwareDetailCache (static per device, process lifetime), matching the thermal-threshold pattern from feat: add NVIDIA thermal thresholds and P-state monitoring #173. NvLink is NOT cached since links can go down at runtime. GPM support detection is cheap and uncached.
  • TUI row separation: the new HW row is distinct from the thermal/P-state row introduced in feat: add NVIDIA thermal thresholds and P-state monitoring #173 so operators can see both simultaneously without vertical collisions. gpu_render_line_count now correctly accounts for up to 4 rows per GPU (info + thermal + hw + gauges).

Files touched

New:

  • src/device/readers/nvidia_hardware.rs
  • src/api/metrics/hardware.rs
  • tests/hardware_details_integration_test.rs

Modified: src/device/types.rs, src/device/readers/{nvidia,amd,amd_windows,apple_silicon_native,furiosa,gaudi,google_tpu,nvidia_jetson,rebellions,tenstorrent,mod}.rs, src/api/{handlers.rs,metrics/{mod,gpu}.rs}, src/network/metrics_parser.rs, src/ui/{renderers/gpu_renderer,dashboard,gpu_sparkline_panel,layout,led_grid}.rs, src/view/view_cache.rs, src/metrics/aggregator.rs, src/mock/templates/nvidia.rs

Closes #132

Extend `GpuInfo` with five new per-GPU fields for the extended NVIDIA
hardware-detail family (issue #132): `numa_node_id`, `gsp_firmware_mode`,
`gsp_firmware_version`, `nvlink_remote_devices`, and `gpm_metrics`. Add
supporting types `NvLinkRemoteDevice`, `NvLinkRemoteType`, and `GpmMetrics`
with stable lowercase label encodings for round-tripping through the
Prometheus exposition format.

Populate the new fields with safe "unavailable" defaults (`None` /
empty vectors) in every non-NVIDIA reader (AMD, Apple Silicon, Gaudi,
Google TPU, Furiosa, Rebellions, Tenstorrent, Jetson) plus all test
helpers in the UI / view / metrics layers so non-NVIDIA paths remain
zero-behavioural-change.
Add `src/device/readers/nvidia_hardware.rs` with:
  * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode,
    and GSP firmware version per device index (static values cached
    for the process lifetime, matching the thermal-threshold pattern
    introduced in PR #173).
  * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS`
    and classifying the remote endpoint of every active link via raw FFI
    to work around a latent bug in nvml-wrapper 0.12.1's wrapper that
    writes to an immutable temporary.
  * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers
    the two-sample handshake to a follow-up; the reader currently emits
    `Some(GpmMetrics::default())` on supported hardware so the TUI can
    show a "GPM-capable" hint without fabricating numeric readings.

Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call
populates the new fields on supported devices while silently degrading to
the "unavailable" defaults on older drivers, non-datacenter SKUs, and the
nvidia-smi CSV fallback path.
Add `src/api/metrics/hardware.rs` with `HardwareMetricExporter`, emitting
six metric families for every GPU that populated at least one
hardware-detail field:
  * `all_smi_gpu_numa_node_id` (gauge; omitted when unknown)
  * `all_smi_gpu_gsp_firmware_mode` (gauge, 0/1/2)
  * `all_smi_gpu_gsp_firmware_version_info` (info-style, `version` label)
  * `all_smi_nvlink_remote_device_type` (one row per active link with
    `link_index` + `remote_type=gpu|switch|ibmnpu|unknown` labels)
  * `all_smi_gpu_sm_occupancy` (gauge 0-1, only when sampled)
  * `all_smi_gpu_memory_bandwidth_utilization` (gauge 0-1, only when sampled)

The exporter self-filters to GPUs that have any hardware-detail field
populated so non-NVIDIA paths and older-driver paths stay silent.
A supported-but-unsampled GPM snapshot (`Some(GpmMetrics::default())`)
produces no gauge lines to avoid publishing misleading zeros.

Row caching in `collect_rows` pre-computes the stringified `index` label
once per GPU so the six metric families iterate without re-allocating
labels, matching the vGPU / MIG exporter pattern.
Extend `MetricsParser::process_gpu_metrics` to handle the six issue-#132
metric families so remote-mode scrapes reconstruct the same `GpuInfo`
shape the local reader produces. Route `nvlink_*` lines through the
existing per-GPU accumulator via prefix-matching in the top-level
dispatcher.

Harden the parser against hostile remote input:
  * `MAX_NVLINK_PER_GPU = 32` caps the `nvlink_remote_devices` vector
    (current hardware tops out at 18 links; 32 leaves headroom).
  * `MAX_NUMA_NODE_ID = 4096` rejects absurd NUMA topologies.
  * `MAX_GSP_VERSION_LEN = 128` bounds the version string label.
  * GSP firmware mode accepts only 0..=2 (the values the exporter emits).
  * GPM fractions accept only `[0.0, 1.0]` — out-of-range readings drop
    so dashboards can distinguish "unavailable" from a real zero.
  * Duplicate `link_index` emissions coalesce (last-wins) rather than
    inflating the vector length.

Introduce `saturating_i32` as a counterpart to `saturating_u32` for the
signed NUMA node id field, plus `ensure_gpm_metrics` so partial GPM
payloads don't force-allocate an empty container on every GPU.
Add `render_hardware_details_row` producing a compact indented row under
each GPU when NUMA placement, GSP firmware mode/version, or NvLink
topology is reported. Format example:

  HW  NUMA:0  GSP:enabled v550.54.15  NVLink:6x(gpu=5,sw=1)

GPM gauges render inline on the same row (`GPM: SM=0.67 MemBW=0.42`) but
a supported-but-unsampled GPM snapshot produces nothing, matching the
exporter contract.

Extend `gpu_render_line_count` to include the new optional row so the
PgUp/PgDn layout math stays correct when rows grow from 2 to 4 lines
on hosts that populate every extended field. A new
`gpu_has_hardware_details_row` predicate drives both the renderer and the
line-count calculation so the two paths cannot drift.
Add `add_hardware_detail_metrics` to the NVIDIA mock generator so the
mock server exposes representative values for every issue-#132 metric
family:
  * NUMA node alternates between 0 and 1 across GPUs (replicates a
    dual-socket host with even-split GPU topology).
  * GSP firmware enabled, version `550.54.15` (matching a recent R555
    datacenter driver).
  * 6 active NvLinks per GPU — 5 remote=gpu + 1 remote=switch —
    mirroring typical HGX 8-GPU board topology.
  * GPM gauges at fixed plausible values (SM=0.67, MemBW=0.42).

Fixed rather than randomised values keep the mock output stable across
scrapes; hardware details never change at runtime on real devices, and
stable mock values simplify downstream test assertions.
Mirror the vGPU / MIG integration-test pattern: build Prometheus
exposition snippets that match the exporter output byte-for-byte, run
them through `MetricsParser::parse_metrics`, and assert every
hardware-detail field survives the round trip.

Coverage:
  * Full happy-path round-trip preserves NUMA, GSP mode + version,
    all 6 NvLinks with correct classification counts, and both GPM
    gauges.
  * Partial scrape (NUMA only) preserves absence of the other fields.
  * Non-NVIDIA scrape leaves every new field at its "unavailable"
    default.
  * Unknown and IBM NPU NvLink remote types round-trip intact.
  * Parser caps the per-GPU NvLink vector at 32 against a hostile
    40-link emission.
  * Out-of-range GSP firmware mode (99) and GPM fractions (9.9) are
    dropped so dashboards never render impossible values.
  * Partial GPM emission populates only the reported field.
@inureyes inureyes added type:enhancement New feature or request status:review Under review priority:low Low priority issue device:nvidia-gpu NVIDIA GPU related labels Apr 15, 2026
… permanently

Replace the single HardwareDetails entry cache with three independent
per-field caches (numa, gsp_mode, gsp_version). A transient NVML error
(Unknown, GpuLost) on one field no longer prevents the other fields from
being cached, and the next poll will retry the failed field. Only permanent
errors (NotSupported, FunctionNotFound) are cached as None.
…tion

Narrow the early-exit in collect_nvlink_remote_devices to only break on
known past-the-end variants (InvalidArg, NotSupported). Transient errors
(Unknown, GpuLost) now skip the current link index with continue so that
a driver hiccup on link N does not silently hide links N+1 through MAX.
Add value.fract() == 0.0 guard to the gpu_gsp_firmware_mode branch so
fractional inputs like 1.5 (which would silently saturate to 1) are dropped
instead of stored as a wrong mode code. Apply the same guard to
gpu_performance_state for consistency.
A malicious remote Prometheus endpoint could embed ANSI escape sequences
in the version label (e.g. ESC[2J ESC[H). Without a character filter the
TUI would execute those sequences on render. Add a chars().all(|c|
!c.is_control()) guard so any version string containing control characters
is dropped and the field stays None.
Collapse nested if-let in get_or_fetch_field probe path (collapsible_if)
and convert useless format! to .to_string() in control-char test.
Add NVIDIA Hardware Details Metrics section to API.md covering the six
new Prometheus series (NUMA node, GSP firmware mode, GSP firmware version
info, NvLink remote type, GPM SM occupancy, GPM memory bandwidth util)
with label tables and absence semantics.

Update README.md NVIDIA feature bullet, API mode metrics list, mock
server note, and v0.21.0 changelog entry to reflect the hardware details
additions from issue #132.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Lint/Format: cargo fmt --all — no changes needed. cargo clippy --features mock --all-targets — 6 warnings exist in pre-existing files (src/device/readers/amd.rs, src/device/readers/tpu_grpc.rs), none in PR-touched files.

Tests: cargo test --features mock — all 21 unit/integration tests pass, 12 doc-test ignored-by-design.

Documentation: Added new NVIDIA Hardware Details Metrics section to API.md documenting all 6 new Prometheus series with label tables and absence semantics. Updated README.md: NVIDIA feature bullet, API mode metrics list, mock server note, and v0.21.0 changelog entry.

Commit: 3ece20a pushed to feature/issue-132-hw-details.

All checks passing. Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026
@inureyes inureyes merged commit e27c283 into main Apr 15, 2026
2 checks passed
@inureyes inureyes deleted the feature/issue-132-hw-details branch April 15, 2026 11:23
inureyes added a commit that referenced this pull request Apr 16, 2026
…#176)

* fix(security): cap CPU core_id in remote parser to prevent OOM

A malicious upstream sending core_id="4294967295" via the Prometheus
exposition format would cause the parser to grow the per_core_utilization
vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES
cap (1024) and reject any core_id at or above that threshold.

* fix(security): strip control characters from all metric label values

A compromised remote endpoint can inject ANSI escape sequences (e.g.
\x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid,
cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values
directly, causing the terminal to execute the injected sequences.

Add strip_control_chars() to parsing/common.rs and apply it inside
sanitize_label_value() so ALL label values flowing through the parser
are cleaned. Also apply stripping in unescape_label_value() for quoted
values, and in MetricBuilder::metric() on the exporter side to defend
against local NVML returning unexpected strings.

* refactor: replace parse_metrics 6-tuple with ParsedMetrics struct

The parse_metrics return type was an unreadable and unsustainable
6-element tuple. Replace it with a named ParsedMetrics struct with
clearly documented fields. Update all call sites in client.rs and
all integration/unit tests.

* perf: replace O(N^2) linear scans with HashMap lookups for vGPU/MIG matching

In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in
find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M)
per frame, causing millions of comparisons.

Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and
use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as
a rare linear scan for entries not found by UUID.

* fix(api): standardize Prometheus labels to gpu_index/gpu_uuid across all exporters

BREAKING CHANGE: The GPU base labels `index` and `uuid` in the
Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid`
to match the more explicit naming already used by the MIG and vGPU
exporters. The remote parser accepts both old and new label names for
backward compatibility with nodes running older versions.

Existing Prometheus dashboards and alert rules that filter on
`index=` or `uuid=` labels will need to be updated to use
`gpu_index=` and `gpu_uuid=` respectively.

* test: add missing vGPU memory_utilization round-trip assertion

The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture
omitted all_smi_vgpu_memory_utilization from the test data. Add the
metric line to the fixture and assert that memory_utilization=45
survives the parse round-trip.

* fix: correct vgpu doc comment and sanitize GPU detail map values

Fix the vgpu.rs module doc that incorrectly referenced
all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode).

Apply sanitize_label_value() to detail HashMap values in gpu.rs
export_device_info to strip control characters from NVML-sourced
strings before they reach the Prometheus exposition format.

* perf: precompute GPU exporter row cache to reduce per-scrape allocations

Add a GpuRow struct and collect_rows function to the GPU exporter,
following the same pattern used by the MIG, vGPU, and hardware
exporters. The stringified gpu_index is computed once per GPU per
scrape instead of being re-created 5x across the five export methods.

* chore: apply cargo fmt formatting

* chore: fix clippy needless_lifetimes warnings

* fix: update stale label names in hardware exporter doc and tests

- Update module doc comment to use `gpu_uuid`/`gpu_index` matching the
  actual `base_labels` implementation
- Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix
  so substring matching cannot mask a regression to old label names
- Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since
  it was superseded by `gpu_render_line_count_with_lookup` for production
  use and is now only called from tests

* docs: note breaking label rename in v0.21.0 changelog entry
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:low Low priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extended hardware details (NUMA, GSP firmware, NvLink remote device)

1 participant