feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM) by inureyes · Pull Request #175 · lablup/all-smi

inureyes · 2026-04-15T10:53:29Z

Summary

Expose the nvml-wrapper 0.12 hardware-detail APIs through the entire all-smi pipeline — NVML reader → GpuInfo → Prometheus exporter + remote parser → TUI + mock templates + integration tests. Closes #132.

Five new per-GPU fields land on GpuInfo:

numa_node_id: Option<i32> (NVML nvmlDeviceGetNumaNodeId)
gsp_firmware_mode: Option<u8> (0=disabled, 1=enabled, 2=default)
gsp_firmware_version: Option<String>
nvlink_remote_devices: Vec<NvLinkRemoteDevice> classifying every active link's remote endpoint (gpu / switch / ibmnpu / unknown)
gpm_metrics: Option<GpmMetrics> for GPU Performance Monitoring (support-detection is live; the two-sample handshake is deferred to a follow-up)

What's in the commits

feat: add hardware-detail data model — types + safe defaults in every non-NVIDIA reader and test helper
feat: populate hardware details from NVML in the NVIDIA reader — new nvidia_hardware module with HardwareDetailCache, NvLink enumeration via raw FFI (working around a latent wrapper bug), and GPM detection
feat: export NVIDIA hardware details as Prometheus metrics — new api/metrics/hardware.rs exporter + handlers.rs wiring
feat: parse hardware-detail metrics on the remote side — parser updates with defensive caps (MAX_NVLINK_PER_GPU=32, NUMA id ≤ 4096, GSP version length ≤ 128) and range checks
feat: surface hardware details on a compact TUI row per GPU — new indented row HW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1) plus gpu_render_line_count update
feat: emit synthetic hardware details from the NVIDIA mock template — mock server emits 8 GPUs × 6 NvLinks + alternating NUMA nodes + GSP version + GPM gauges
test: add integration tests for the hardware-detail pipeline — mirrors the vGPU / MIG pattern

Metrics produced

Metric	Type	Notes
`all_smi_gpu_numa_node_id`	gauge	Omitted when NUMA topology unknown
`all_smi_gpu_gsp_firmware_mode`	gauge (0/1/2)	Omitted on pre-R525 drivers
`all_smi_gpu_gsp_firmware_version_info`	info (value=1)	Version carried in `version` label
`all_smi_nvlink_remote_device_type`	info (value=1)	Per active link; `link_index` + `remote_type` labels
`all_smi_gpu_sm_occupancy`	gauge 0-1	GPM; omitted when unsampled
`all_smi_gpu_memory_bandwidth_utilization`	gauge 0-1	GPM; omitted when unsampled

Test plan

cargo build --features mock — clean
cargo test --features mock — 14 test suites, 0 failures (536 lib tests + 10 hardware unit tests + 10 nvidia_hardware unit tests + 10 TUI hw_row tests + 11 hardware exporter tests + 8 integration tests + …)
cargo clippy --features mock --all-targets — zero new warnings
cargo fmt --check — clean
Mock server smoke test: all-smi-mock-server --port-range 19190 --platform nvidia emits all six metric families with NUMA alternation and 48 NvLink rows across 8 GPUs
Non-NVIDIA readers (AMD, Apple Silicon, Gaudi, TPU, Furiosa, Rebellions, Tenstorrent, Jetson) return None / empty vectors for every new field — zero behavioural change

Design decisions

GPM scope: live sampling requires a two-sample handshake (gpm_sample → wait N seconds → gpm_sample → gpm_metrics_get) which doesn't fit all-smi's single-poll reader contract. The issue's instruction was to degrade rather than over-engineer, so the reader currently detects GPM support (Some(GpmMetrics::default()) on Hopper+) without publishing fabricated numbers. The exporter/parser/TUI all silently skip the gauges when no field is populated, and a collect_gpm_metrics docstring marks the follow-up. Every piece of plumbing downstream of the reader is ready for the numeric values to light up.
Raw FFI for NvLink remote type: nvml_wrapper::nv_link::NvLink::remote_device_type() in 0.12.1 passes &mut device_type.as_c() — an immutable temporary — so NVML never writes back. We call the symbol directly via nvmlDeviceGetNvLinkRemoteDeviceType instead. Vendored map_remote_device_type keeps the workaround contained so it can be removed when the upstream bug is fixed.
numa_node_id cross-platform: Device::numa_node_id() is available on all platforms in 0.12.1 (no cfg(target_os = "linux") gate), so no FFI stub was needed. The reader canonicalises the u32::MAX sentinel to None rather than leaking it as a huge number.
Cache shape: NUMA + GSP live in a new HardwareDetailCache (static per device, process lifetime), matching the thermal-threshold pattern from feat: add NVIDIA thermal thresholds and P-state monitoring #173. NvLink is NOT cached since links can go down at runtime. GPM support detection is cheap and uncached.
TUI row separation: the new HW row is distinct from the thermal/P-state row introduced in feat: add NVIDIA thermal thresholds and P-state monitoring #173 so operators can see both simultaneously without vertical collisions. gpu_render_line_count now correctly accounts for up to 4 rows per GPU (info + thermal + hw + gauges).

Files touched

New:

src/device/readers/nvidia_hardware.rs
src/api/metrics/hardware.rs
tests/hardware_details_integration_test.rs

Modified: src/device/types.rs, src/device/readers/{nvidia,amd,amd_windows,apple_silicon_native,furiosa,gaudi,google_tpu,nvidia_jetson,rebellions,tenstorrent,mod}.rs, src/api/{handlers.rs,metrics/{mod,gpu}.rs}, src/network/metrics_parser.rs, src/ui/{renderers/gpu_renderer,dashboard,gpu_sparkline_panel,layout,led_grid}.rs, src/view/view_cache.rs, src/metrics/aggregator.rs, src/mock/templates/nvidia.rs

Closes #132

Extend `GpuInfo` with five new per-GPU fields for the extended NVIDIA hardware-detail family (issue #132): `numa_node_id`, `gsp_firmware_mode`, `gsp_firmware_version`, `nvlink_remote_devices`, and `gpm_metrics`. Add supporting types `NvLinkRemoteDevice`, `NvLinkRemoteType`, and `GpmMetrics` with stable lowercase label encodings for round-tripping through the Prometheus exposition format. Populate the new fields with safe "unavailable" defaults (`None` / empty vectors) in every non-NVIDIA reader (AMD, Apple Silicon, Gaudi, Google TPU, Furiosa, Rebellions, Tenstorrent, Jetson) plus all test helpers in the UI / view / metrics layers so non-NVIDIA paths remain zero-behavioural-change.

Add `src/device/readers/nvidia_hardware.rs` with: * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode, and GSP firmware version per device index (static values cached for the process lifetime, matching the thermal-threshold pattern introduced in PR #173). * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS` and classifying the remote endpoint of every active link via raw FFI to work around a latent bug in nvml-wrapper 0.12.1's wrapper that writes to an immutable temporary. * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers the two-sample handshake to a follow-up; the reader currently emits `Some(GpmMetrics::default())` on supported hardware so the TUI can show a "GPM-capable" hint without fabricating numeric readings. Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call populates the new fields on supported devices while silently degrading to the "unavailable" defaults on older drivers, non-datacenter SKUs, and the nvidia-smi CSV fallback path.

Add `src/api/metrics/hardware.rs` with `HardwareMetricExporter`, emitting six metric families for every GPU that populated at least one hardware-detail field: * `all_smi_gpu_numa_node_id` (gauge; omitted when unknown) * `all_smi_gpu_gsp_firmware_mode` (gauge, 0/1/2) * `all_smi_gpu_gsp_firmware_version_info` (info-style, `version` label) * `all_smi_nvlink_remote_device_type` (one row per active link with `link_index` + `remote_type=gpu|switch|ibmnpu|unknown` labels) * `all_smi_gpu_sm_occupancy` (gauge 0-1, only when sampled) * `all_smi_gpu_memory_bandwidth_utilization` (gauge 0-1, only when sampled) The exporter self-filters to GPUs that have any hardware-detail field populated so non-NVIDIA paths and older-driver paths stay silent. A supported-but-unsampled GPM snapshot (`Some(GpmMetrics::default())`) produces no gauge lines to avoid publishing misleading zeros. Row caching in `collect_rows` pre-computes the stringified `index` label once per GPU so the six metric families iterate without re-allocating labels, matching the vGPU / MIG exporter pattern.

Extend `MetricsParser::process_gpu_metrics` to handle the six issue-#132 metric families so remote-mode scrapes reconstruct the same `GpuInfo` shape the local reader produces. Route `nvlink_*` lines through the existing per-GPU accumulator via prefix-matching in the top-level dispatcher. Harden the parser against hostile remote input: * `MAX_NVLINK_PER_GPU = 32` caps the `nvlink_remote_devices` vector (current hardware tops out at 18 links; 32 leaves headroom). * `MAX_NUMA_NODE_ID = 4096` rejects absurd NUMA topologies. * `MAX_GSP_VERSION_LEN = 128` bounds the version string label. * GSP firmware mode accepts only 0..=2 (the values the exporter emits). * GPM fractions accept only `[0.0, 1.0]` — out-of-range readings drop so dashboards can distinguish "unavailable" from a real zero. * Duplicate `link_index` emissions coalesce (last-wins) rather than inflating the vector length. Introduce `saturating_i32` as a counterpart to `saturating_u32` for the signed NUMA node id field, plus `ensure_gpm_metrics` so partial GPM payloads don't force-allocate an empty container on every GPU.

Add `render_hardware_details_row` producing a compact indented row under each GPU when NUMA placement, GSP firmware mode/version, or NvLink topology is reported. Format example: HW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1) GPM gauges render inline on the same row (`GPM: SM=0.67 MemBW=0.42`) but a supported-but-unsampled GPM snapshot produces nothing, matching the exporter contract. Extend `gpu_render_line_count` to include the new optional row so the PgUp/PgDn layout math stays correct when rows grow from 2 to 4 lines on hosts that populate every extended field. A new `gpu_has_hardware_details_row` predicate drives both the renderer and the line-count calculation so the two paths cannot drift.

Add `add_hardware_detail_metrics` to the NVIDIA mock generator so the mock server exposes representative values for every issue-#132 metric family: * NUMA node alternates between 0 and 1 across GPUs (replicates a dual-socket host with even-split GPU topology). * GSP firmware enabled, version `550.54.15` (matching a recent R555 datacenter driver). * 6 active NvLinks per GPU — 5 remote=gpu + 1 remote=switch — mirroring typical HGX 8-GPU board topology. * GPM gauges at fixed plausible values (SM=0.67, MemBW=0.42). Fixed rather than randomised values keep the mock output stable across scrapes; hardware details never change at runtime on real devices, and stable mock values simplify downstream test assertions.

Mirror the vGPU / MIG integration-test pattern: build Prometheus exposition snippets that match the exporter output byte-for-byte, run them through `MetricsParser::parse_metrics`, and assert every hardware-detail field survives the round trip. Coverage: * Full happy-path round-trip preserves NUMA, GSP mode + version, all 6 NvLinks with correct classification counts, and both GPM gauges. * Partial scrape (NUMA only) preserves absence of the other fields. * Non-NVIDIA scrape leaves every new field at its "unavailable" default. * Unknown and IBM NPU NvLink remote types round-trip intact. * Parser caps the per-GPU NvLink vector at 32 against a hostile 40-link emission. * Out-of-range GSP firmware mode (99) and GPM fractions (9.9) are dropped so dashboards never render impossible values. * Partial GPM emission populates only the reported field.

… permanently Replace the single HardwareDetails entry cache with three independent per-field caches (numa, gsp_mode, gsp_version). A transient NVML error (Unknown, GpuLost) on one field no longer prevents the other fields from being cached, and the next poll will retry the failed field. Only permanent errors (NotSupported, FunctionNotFound) are cached as None.

…tion Narrow the early-exit in collect_nvlink_remote_devices to only break on known past-the-end variants (InvalidArg, NotSupported). Transient errors (Unknown, GpuLost) now skip the current link index with continue so that a driver hiccup on link N does not silently hide links N+1 through MAX.

Add value.fract() == 0.0 guard to the gpu_gsp_firmware_mode branch so fractional inputs like 1.5 (which would silently saturate to 1) are dropped instead of stored as a wrong mode code. Apply the same guard to gpu_performance_state for consistency.

A malicious remote Prometheus endpoint could embed ANSI escape sequences in the version label (e.g. ESC[2J ESC[H). Without a character filter the TUI would execute those sequences on render. Add a chars().all(|c| !c.is_control()) guard so any version string containing control characters is dropped and the field stays None.

Collapse nested if-let in get_or_fetch_field probe path (collapsible_if) and convert useless format! to .to_string() in control-char test.

Add NVIDIA Hardware Details Metrics section to API.md covering the six new Prometheus series (NUMA node, GSP firmware mode, GSP firmware version info, NvLink remote type, GPM SM occupancy, GPM memory bandwidth util) with label tables and absence semantics. Update README.md NVIDIA feature bullet, API mode metrics list, mock server note, and v0.21.0 changelog entry to reflect the hardware details additions from issue #132.

inureyes · 2026-04-15T11:23:16Z

PR Finalization Complete

Summary

Lint/Format: cargo fmt --all — no changes needed. cargo clippy --features mock --all-targets — 6 warnings exist in pre-existing files (src/device/readers/amd.rs, src/device/readers/tpu_grpc.rs), none in PR-touched files.

Tests: cargo test --features mock — all 21 unit/integration tests pass, 12 doc-test ignored-by-design.

Documentation: Added new NVIDIA Hardware Details Metrics section to API.md documenting all 6 new Prometheus series with label tables and absence semantics. Updated README.md: NVIDIA feature bullet, API mode metrics list, mock server note, and v0.21.0 changelog entry.

Commit: 3ece20a pushed to feature/issue-132-hw-details.

All checks passing. Ready for merge.

…#176) * fix(security): cap CPU core_id in remote parser to prevent OOM A malicious upstream sending core_id="4294967295" via the Prometheus exposition format would cause the parser to grow the per_core_utilization vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES cap (1024) and reject any core_id at or above that threshold. * fix(security): strip control characters from all metric label values A compromised remote endpoint can inject ANSI escape sequences (e.g. \x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid, cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values directly, causing the terminal to execute the injected sequences. Add strip_control_chars() to parsing/common.rs and apply it inside sanitize_label_value() so ALL label values flowing through the parser are cleaned. Also apply stripping in unescape_label_value() for quoted values, and in MetricBuilder::metric() on the exporter side to defend against local NVML returning unexpected strings. * refactor: replace parse_metrics 6-tuple with ParsedMetrics struct The parse_metrics return type was an unreadable and unsustainable 6-element tuple. Replace it with a named ParsedMetrics struct with clearly documented fields. Update all call sites in client.rs and all integration/unit tests. * perf: replace O(N^2) linear scans with HashMap lookups for vGPU/MIG matching In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M) per frame, causing millions of comparisons. Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as a rare linear scan for entries not found by UUID. * fix(api): standardize Prometheus labels to gpu_index/gpu_uuid across all exporters BREAKING CHANGE: The GPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid` to match the more explicit naming already used by the MIG and vGPU exporters. The remote parser accepts both old and new label names for backward compatibility with nodes running older versions. Existing Prometheus dashboards and alert rules that filter on `index=` or `uuid=` labels will need to be updated to use `gpu_index=` and `gpu_uuid=` respectively. * test: add missing vGPU memory_utilization round-trip assertion The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture omitted all_smi_vgpu_memory_utilization from the test data. Add the metric line to the fixture and assert that memory_utilization=45 survives the parse round-trip. * fix: correct vgpu doc comment and sanitize GPU detail map values Fix the vgpu.rs module doc that incorrectly referenced all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode). Apply sanitize_label_value() to detail HashMap values in gpu.rs export_device_info to strip control characters from NVML-sourced strings before they reach the Prometheus exposition format. * perf: precompute GPU exporter row cache to reduce per-scrape allocations Add a GpuRow struct and collect_rows function to the GPU exporter, following the same pattern used by the MIG, vGPU, and hardware exporters. The stringified gpu_index is computed once per GPU per scrape instead of being re-created 5x across the five export methods. * chore: apply cargo fmt formatting * chore: fix clippy needless_lifetimes warnings * fix: update stale label names in hardware exporter doc and tests - Update module doc comment to use `gpu_uuid`/`gpu_index` matching the actual `base_labels` implementation - Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix so substring matching cannot mask a regression to old label names - Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since it was superseded by `gpu_render_line_count_with_lookup` for production use and is now only called from tests * docs: note breaking label rename in v0.21.0 changelog entry

inureyes added 7 commits April 15, 2026 19:50

inureyes added type:enhancement New feature or request status:review Under review priority:low Low priority issue device:nvidia-gpu NVIDIA GPU related labels Apr 15, 2026

inureyes added 6 commits April 15, 2026 20:15

fix(nvidia): address clippy warnings from security fixes

8f4e968

Collapse nested if-let in get_or_fetch_field probe path (collapsible_if) and convert useless format! to .to_string() in control-char test.

inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026

inureyes merged commit e27c283 into main Apr 15, 2026
2 checks passed

inureyes deleted the feature/issue-132-hw-details branch April 15, 2026 11:23

inureyes mentioned this pull request Apr 16, 2026

fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175) #176

Merged

4 tasks

inureyes mentioned this pull request Apr 17, 2026

enhancement: Add env-var gate for hardware detail mock data (ALL_SMI_MOCK_HARDWARE_DETAILS) #178

Closed

inureyes self-assigned this Apr 17, 2026

inureyes mentioned this pull request Apr 17, 2026

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM)#175

feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM)#175
inureyes merged 13 commits into
mainfrom
feature/issue-132-hw-details

inureyes commented Apr 15, 2026

Uh oh!

inureyes commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 15, 2026

Summary

What's in the commits

Metrics produced

Test plan

Design decisions

Files touched

Uh oh!

inureyes commented Apr 15, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant