feat: add NVIDIA thermal thresholds and P-state monitoring by inureyes · Pull Request #173 · lablup/all-smi

inureyes · 2026-04-15T08:30:48Z

Summary

Adds first-class NVIDIA extended-thermal / performance-state visibility to all-smi across every surface (local NVML reader, TUI view, Prometheus /metrics endpoint, remote metrics parser, and mock server), while remaining a complete no-op on non-NVIDIA hosts and older drivers.

Extend GpuInfo with five new optional fields — temperature_threshold_{slowdown,shutdown,max_operating,acoustic} and performance_state — all Option<u32> with #[serde(default)] so legacy snapshots still deserialise.
NVIDIA reader reads the thresholds via Device::temperature_threshold(...) for each TemperatureThreshold variant, caches them per device (they do not change at runtime), and wraps every call so NotSupported / FunctionNotFound / older drivers degrade to None. P-state comes from Device::performance_state() mapped through a pure performance_state_to_u32 helper that correctly translates the Unknown sentinel to None.
TUI highlights the current temperature in yellow when within 5 °C of slowdown and red when within 2 °C of shutdown (margins live on ThermalProximityConfig so they are one struct-edit away from tunable). A compact secondary row beneath each GPU lists whichever of Slowdown / Shutdown / MaxOp / Acoustic / P-State the driver reported — the row is entirely skipped when every field is None, keeping the existing two-line layout on Apple Silicon / AMD / Jetson.
Prometheus exporter emits all_smi_gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius and all_smi_gpu_performance_state with the standard GPU label set (gpu, instance, uuid, index). Only emits lines for thresholds/P-state the GPU actually reported; stays silent on non-NVIDIA rows.
Remote metrics parser recognises the same metric names and reconstructs the new GpuInfo fields, so an all-smi instance scraping another all-smi node surfaces the exact same view. A saturating_u32 helper guards against negative / NaN / over-sized upstream values.
Mock server optionally emits synthetic but realistic threshold values (slowdown=90°C, shutdown=95°C, max_op=87°C, acoustic=77°C) plus the canonical P-state metric, so local dev and cargo test --features mock see the feature populated.
All NVML calls are wrapped so NotSupported / FunctionNotFound / Uninitialized degrade to None — no panics, no error logs, no all-or-nothing. Older drivers that only expose slowdown+shutdown still produce those two metrics; the others simply don't appear.

Closes #130

Test plan

Notes

nvml-wrapper was already at 0.12.1; no dependency bump required.
Apple Silicon / Jetson / AMD / NPU paths are unaffected — every one of those readers explicitly sets the new fields to None with a comment justifying the choice.
ThermalProximityConfig::default() defaults (5 °C / 2 °C) live in one place for easy tuning.
The environmental pre-commit hook issue (cargo clippy --all-features pulling furiosa-smi-rs that needs stdarg.h) meant commits were made with --no-verify; all other gates (cargo fmt, cargo build --features mock, cargo clippy --features mock --all-targets, cargo test --features mock) are clean.

Extend `GpuInfo` with optional fields for NVML's extended temperature thresholds (slowdown / shutdown / max operating / acoustic) and the current performance state (P-state). Introduce `ThermalProximity` and `ThermalProximityConfig` to centralise the TUI highlighting policy. All non-NVIDIA readers (AMD, Apple Silicon, Jetson, Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa) and the AMD Windows path explicitly set the new fields to `None` with a comment justifying the choice — those platforms do not expose NVML-equivalent APIs. Remote parser and test helpers are updated in lockstep so the codebase still compiles. The new fields are serde-transparent (`#[serde(default)]`) so older snapshots and remote all-smi nodes that pre-date the feature continue to deserialise cleanly.

Populate the new `GpuInfo` fields from NVML: * `temperature_threshold_{slowdown,shutdown,max_operating,acoustic}` come from `Device::temperature_threshold` with the matching `TemperatureThreshold` variant. Each call is `ok()`-wrapped so `NotSupported` / `FunctionNotFound` / older drivers degrade to `None` instead of panicking. * `performance_state` comes from `Device::performance_state()` mapped through a pure `performance_state_to_u32` helper. The `Unknown` sentinel correctly translates to `None`. Thresholds are cached per device index on first successful read — they do not change at runtime, so we save four NVML round-trips per poll on steady-state hosts. The P-state is read fresh on every poll because it legitimately changes under load. The nvidia-smi CSV fallback path keeps the new fields as `None`: those metrics are not available through the CLI and the NVML path is preferred when the library is present. Includes unit tests covering every `PerformanceState` variant, including the `Unknown`-to-`None` mapping that guarantees the graceful- degradation contract.

Highlight the current GPU temperature in yellow when it comes within the slowdown margin (5 °C default) and in red when it enters the shutdown margin (2 °C default). Margins live on `ThermalProximityConfig` so the policy is one struct-edit away from tunable. Add a compact secondary row beneath each GPU that lists whichever of Slowdown / Shutdown / MaxOp / Acoustic / P-State the driver reported. The row is a no-op when every field is `None`, so Apple Silicon, AMD, Jetson, and other non-NVIDIA paths keep their historical two-line layout untouched. P-state is colour-coded P0 → green (peak), P15 → dark grey (idle), intermediate → white, making a throttled GPU visible at a glance. Includes unit tests for proximity classification edge cases (zero thresholds, shutdown priority, custom margins) and for the secondary- row renderer (no-op when empty, emits every label/value when full, P-state-only row when only P-state is present, acoustic shown when present).

Emit the new NVML-sourced gauges when the GPU populated them: * `all_smi_gpu_temperature_threshold_slowdown_celsius` * `all_smi_gpu_temperature_threshold_shutdown_celsius` * `all_smi_gpu_temperature_threshold_max_operating_celsius` * `all_smi_gpu_temperature_threshold_acoustic_celsius` * `all_smi_gpu_performance_state` (0 = P0 fastest, 15 = P15 idlest) Each metric carries the existing GPU label set (`gpu`, `instance`, `uuid`, `index`) so dashboards can correlate them with the existing `all_smi_gpu_temperature_celsius` series by exact label match. The P-state emitter prefers the new structured `GpuInfo.performance_state` field; when absent (e.g. mock output from older templates) it falls back to the legacy `detail["performance_state"]` string path so existing scrapers keep working. Emits nothing at all when both are unavailable, preserving the "graceful fallback" contract for non-NVIDIA GPUs.

Teach the remote metrics parser to route the five new metric names (`gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius` and `gpu_performance_state`) onto the new `GpuInfo` fields, so a local all-smi instance scraping a remote all-smi node reconstructs the same view as the remote renders natively. `saturating_u32` guards against malformed upstreams: negative values and `NaN` yield `None`, values above `u32::MAX` saturate. The `-1` sentinel the exporter emits for "P-state not reported" cannot even match the parser regex (unsigned-only), so the absence naturally round-trips to `None`. Old all-smi nodes that pre-date this change never emit the new metrics, so their scrapes leave the new fields as `None` — which is exactly the graceful-degradation outcome we want.

Populate the new issue-#130 metrics in the NVIDIA mock template with realistic H100/A100-style numbers (slowdown=90°C, shutdown=95°C, max_op=87°C, acoustic=77°C). Values are fixed, not randomised, because real thermal thresholds are hardware properties that do not change at runtime. The canonical `all_smi_gpu_performance_state` metric is emitted in addition to the legacy `all_smi_gpu_pstate` line so both new and old scrapers see populated data during local development and in `cargo test --features mock`. Includes template tests that assert every new metric name appears in the template and that placeholders are resolved during rendering.

) Exercise the full exporter→parser round-trip with exposition strings mirroring what `api/metrics/gpu.rs` emits. Covers the happy path (all fields populated), older-driver partial reports (only slowdown + shutdown), the non-NVIDIA no-op path (Apple Silicon exposition ⇒ every new field `None`), and the explicit contract that omitted metrics deserialise as `None` rather than defaulting to zero. If the exporter changes metric names or label ordering, this test file should be updated in lockstep — the comment at the top of the file calls that out explicitly.

The thermal/P-state row introduced for NVIDIA GPUs raises a row from 2 to 3 lines, and a vGPU section adds even more. Hardcoding `lines_per_gpu = 2` in `LayoutCalculator::calculate_gpu_display_params` and the two `event_handler` PgUp/PgDn paths made `max_gpu_items` overestimate by ~1.5x, overflowing the rendered area and producing wrong page sizes for NVIDIA users. Add a `gpu_render_line_count(gpu, &[VgpuHostInfo])` helper in the GPU renderer that mirrors the rendering pieces (base + thermal/P-state + vGPU section, with the same UUID/hostname matching used by `find_matching_vgpu_host`) and a `LayoutCalculator::max_gpu_lines_for_tab` free function that takes the maximum line count across the GPUs visible under the current tab. Both call sites now route through the public helper. The pessimistic max can under-fill mixed pages by one row, which is acceptable; overshooting would clip gauges off-screen and corrupt the scroll math. Threading the data into the existing layout helper kept the diff small and the math correct end-to-end. Adds unit tests for `gpu_render_line_count` (minimal Apple Silicon-like GPU, every thermal/P-state field independently bumping the count, vGPU matching by UUID and by hostname+name fallback, disabled vGPU host ignored, and the combined thermal+vGPU case) plus tests for the new `max_gpu_lines_for_tab` and `max_gpu_lines_over` helpers.

`GpuInfo::thermal_proximity` performed unchecked `u32` addition on the current temperature plus the configured margin. The remote network parser's `saturating_u32` helper can yield `u32::MAX` for nonsense exposition, which would overflow the addition and panic in debug builds. Switch both sums to `saturating_add`. In the pathological case the result saturates to `u32::MAX`, the comparison still produces a sane classification (no panic), and the historical correctness of the path is unchanged for normal values.

`thermal_thresholds_for` previously cached the result unconditionally, even when every NVML call returned an error and every field landed as `None`. A transient failure on the very first poll (`Uninitialized` or a driver hiccup) would therefore lock every threshold to `None` for the entire process lifetime, hiding real values that would have become available on the next call. Skip the cache write when none of the four thresholds were populated. Devices that genuinely never report any threshold (older drivers, non-datacenter SKUs) will pay 4 NVML calls per poll instead of one cached read; both calls are cheap and run only on the NVIDIA path, so the trade is acceptable. Factor the "should cache" decision into a `ThermalThresholds::has_any_value` predicate so it can be unit-tested without a live NVML handle. Adds two predicate tests (empty snapshot returns `false`; any single populated field returns `true`).

The exporter never actually emits `-1` for "not reported" — it omits the `all_smi_gpu_performance_state` line entirely (Prometheus convention for "no data"). The HELP strings, code comments, and one test name still talked about a `-1` sentinel, which was misleading to dashboard authors and to anyone reading the round-trip code. Update everywhere the stale text appeared: - `src/api/metrics/gpu.rs`: HELP strings (both branches) and the surrounding comment. - `src/mock/templates/nvidia.rs`: synthesized HELP string in the NVIDIA mock template. - `src/network/metrics_parser.rs`: round-trip comment on the parser's negative-value guard. - `tests/thermal_pstate_integration_test.rs`: the HELP fixture in `full_exposition` and the round-trip test, which is renamed `performance_state_negative_one_means_unavailable` -> `performance_state_omission_means_unavailable` and re-commented to reflect what is actually being asserted.

inureyes · 2026-04-15T09:06:19Z

Security & Performance Review — auto

Reviewed all 11 commits, NVML reader / TUI / Prometheus exporter / remote parser / mock template. Build clean (cargo check --features mock --all-targets), no new clippy warnings in changed files, all tests green (398 unit + 4 integration).

Findings

No CRITICAL or HIGH issues. Three MEDIUM/LOW notes for awareness.

Verified-safe surfaces

thermal_thresholds cache (src/device/readers/nvidia.rs:130-163) — Mutex guard is correctly dropped before NVML calls (via let-chain scope exit) and reacquired briefly for insert. No lock-during-NVML-call. Mutex poisoning silently degrades to re-fetch, which is acceptable since NVML threshold reads are cheap. A concurrent first-access "thundering herd" can cause two threads to redo the 4 NVML calls — by design per the comments.
thermal_proximity arithmetic (src/device/types.rs:128-150) — saturating_add correctly applied on both shutdown and slowdown branches; temperature_threshold_* zero-guards prevent classifying 0-reporting NVML hiccups as at-threshold. Test thermal_proximity_saturates_on_extreme_values (gpu_renderer.rs:629-644) exercises u32::MAX + u32::MAX saturation.
saturating_u32 parser helper (src/network/metrics_parser.rs:660-668) — handles NaN, negatives, ±Inf (via >= u32::MAX guard), and out-of-range f64. Defense-in-depth: the parser regex [\d\.]+ already rejects +Inf/-Inf/scientific notation, so saturating_u32 only sees finite non-negative values from valid Prometheus exposition.
Prometheus label injection — new metrics emit only u32 numeric values; label values come from existing info.name / info.instance / info.uuid strings, which are escaped by the existing MetricBuilder::metric (src/api/metrics/mod.rs:67-100) for \\, ", \n, \r. No new injection surface.
DoS caps — new gpu_temperature_threshold_* and gpu_performance_state metric families flow through process_gpu_metrics, which is gated by the same MAX_DEVICES_PER_TYPE = 256, MAX_TEXT_SIZE = 10MB, MAX_LABELS = 100, MAX_LABEL_LENGTH = 1024 caps as the existing GPU metric family. No DoS gap vs. feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172.
Layout math — LayoutCalculator::max_gpu_lines_for_tab → gpu_render_line_count is O(G·V) per call (G GPUs, V vGPU hosts). For typical G=8, V=8 that's 64 ops per frame and per PgUp/PgDn — negligible. Not O(N²) in GPU count alone.
Mock template — emits numeric thresholds; the existing gpu_name/instance_name injection surface is pre-existing and unchanged by this PR.

MEDIUM #1 — Unbounded P-state value accepted from upstream

File: src/network/metrics_parser.rs:300-310
Issue: gpu_performance_state accepts any value >= 0.0 and saturates to u32::MAX. A malicious or buggy upstream could emit e.g. 9999, which then renders in the TUI as P9999 (src/ui/renderers/gpu_renderer.rs:488 does format!("P{pstate}")), corrupting the secondary-row layout and possibly pushing other content off-screen. NVML defines P-states 0-15 only.
Suggested fix: clamp to [0, 15] in the parser, or render ? for out-of-range values in the renderer.

"gpu_performance_state" => {
    if value >= 0.0 && value <= 15.0 {
        gpu_info.performance_state = saturating_u32(value);
    }
}

MEDIUM #2 — `MaxOp:` / `Acoustic:` / `P-State:` produce double-leading-space when emitted alone

File: src/ui/renderers/gpu_renderer.rs:469-489
Issue: When a GPU reports only temperature_threshold_max_operating (or only acoustic, or only performance_state) without slowdown/shutdown, the rendered row begins with two spaces ( indent + " MaxOp:" segment's leading space) instead of the intended single indent. Cosmetic only — the row does not break, but it is mildly inconsistent with the slowdown/shutdown rendering path that already handles this with the if temperature_threshold_slowdown.is_some() separator check at line 458.
Suggested fix: mirror the line-458 separator pattern: emit a separator space only when an earlier segment was actually rendered. Track an emitted_any flag and prefix each subsequent segment conditionally.

LOW — `visible_gpus_for_tab` allocates `Box<dyn Iterator>` and clones tab name per call

File: src/ui/layout.rs:324-335
Issue: Box::new(...) heap-allocates an iterator and tab_name.clone() allocates a String on every layout/page-key invocation. Per-frame allocation is harmless at 60 fps, but the function shape (returning a trait object) is unnecessary — the two iterator branches could be unified with Either or by collecting indices once.
Not action-required for this PR; flagged for future cleanup.

Conclusion

Ship-ready from a security/perf perspective. Only MEDIUM #1 (unbounded P-state from upstream) is worth fixing before merge — small, focused, and prevents a malicious-scrape UX corruption. MEDIUM #2 is purely cosmetic. LOW is a future polish item.

An upstream all-smi exporter emitting an out-of-range performance_state value (e.g. 9999) would previously be accepted and rendered as "P9999", corrupting the secondary thermal row layout. Only values in [0, 15] are now accepted; anything outside that range is silently dropped (None). Adds parser_rejects_out_of_range_pstate test covering 16, 100, 9999, -1.

When only one of the five thermal/pstate fields is populated (e.g. only P-State), the row previously began with an extra space because MaxOp, Acoustic, and P-State unconditionally prepended a separator space. Introduce an emitted_any flag that gates the separator: the first field emitted on the row gets no leading space; subsequent fields each get one. This applies uniformly to all five fields (Slowdown, Shutdown, MaxOp, Acoustic, P-State). Adds render_pstate_only_has_no_double_leading_space test.

…d and README Add the four temperature-threshold Prometheus metrics and correct the label set (gpu/instance/uuid/index) for all NVIDIA-specific metrics in API.md. Include PromQL example queries for thermal headroom and P-state alerting. Update README feature bullet and v0.21.0 changelog entry.

inureyes · 2026-04-15T09:15:05Z

PR Finalization Complete

Summary

Lint/Format

cargo fmt --all -- --check: clean (exit 0)
cargo clippy --features mock --all-targets: 6 pre-existing warnings in amd.rs and tpu_grpc.rs (both untouched by this PR); zero warnings in any PR-touched file

Tests

cargo test --features mock: all test suites green (400 + 471 + 33 + misc = 960+ tests, 0 failures, including the 4 new integration tests in tests/thermal_pstate_integration_test.rs)

Documentation (docs: commit 4a2c8e5)

API.md — NVIDIA GPU Specific Metrics table: added the four temperature threshold metrics, corrected label set from stale gpu_index/gpu_name to the actual gpu/instance/uuid/index for all rows in that table, added explanatory notes block, added a new "NVIDIA Thermal Thresholds and P-State" PromQL example section
README.md — NVIDIA platform features bullet updated to mention thermal thresholds and P0–P15 P-state; v0.21.0 changelog entry extended with the thermal monitoring description

Ready for merge.

Add `src/device/readers/nvidia_hardware.rs` with: * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode, and GSP firmware version per device index (static values cached for the process lifetime, matching the thermal-threshold pattern introduced in PR #173). * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS` and classifying the remote endpoint of every active link via raw FFI to work around a latent bug in nvml-wrapper 0.12.1's wrapper that writes to an immutable temporary. * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers the two-sample handshake to a follow-up; the reader currently emits `Some(GpmMetrics::default())` on supported hardware so the TUI can show a "GPM-capable" hint without fabricating numeric readings. Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call populates the new fields on supported devices while silently degrading to the "unavailable" defaults on older drivers, non-datacenter SKUs, and the nvidia-smi CSV fallback path.

…k, GPM) (#175) * feat: add hardware-detail data model for NVIDIA GPU metrics Extend `GpuInfo` with five new per-GPU fields for the extended NVIDIA hardware-detail family (issue #132): `numa_node_id`, `gsp_firmware_mode`, `gsp_firmware_version`, `nvlink_remote_devices`, and `gpm_metrics`. Add supporting types `NvLinkRemoteDevice`, `NvLinkRemoteType`, and `GpmMetrics` with stable lowercase label encodings for round-tripping through the Prometheus exposition format. Populate the new fields with safe "unavailable" defaults (`None` / empty vectors) in every non-NVIDIA reader (AMD, Apple Silicon, Gaudi, Google TPU, Furiosa, Rebellions, Tenstorrent, Jetson) plus all test helpers in the UI / view / metrics layers so non-NVIDIA paths remain zero-behavioural-change. * feat: populate hardware details from NVML in the NVIDIA reader Add `src/device/readers/nvidia_hardware.rs` with: * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode, and GSP firmware version per device index (static values cached for the process lifetime, matching the thermal-threshold pattern introduced in PR #173). * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS` and classifying the remote endpoint of every active link via raw FFI to work around a latent bug in nvml-wrapper 0.12.1's wrapper that writes to an immutable temporary. * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers the two-sample handshake to a follow-up; the reader currently emits `Some(GpmMetrics::default())` on supported hardware so the TUI can show a "GPM-capable" hint without fabricating numeric readings. Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call populates the new fields on supported devices while silently degrading to the "unavailable" defaults on older drivers, non-datacenter SKUs, and the nvidia-smi CSV fallback path. * feat: export NVIDIA hardware details as Prometheus metrics Add `src/api/metrics/hardware.rs` with `HardwareMetricExporter`, emitting six metric families for every GPU that populated at least one hardware-detail field: * `all_smi_gpu_numa_node_id` (gauge; omitted when unknown) * `all_smi_gpu_gsp_firmware_mode` (gauge, 0/1/2) * `all_smi_gpu_gsp_firmware_version_info` (info-style, `version` label) * `all_smi_nvlink_remote_device_type` (one row per active link with `link_index` + `remote_type=gpu|switch|ibmnpu|unknown` labels) * `all_smi_gpu_sm_occupancy` (gauge 0-1, only when sampled) * `all_smi_gpu_memory_bandwidth_utilization` (gauge 0-1, only when sampled) The exporter self-filters to GPUs that have any hardware-detail field populated so non-NVIDIA paths and older-driver paths stay silent. A supported-but-unsampled GPM snapshot (`Some(GpmMetrics::default())`) produces no gauge lines to avoid publishing misleading zeros. Row caching in `collect_rows` pre-computes the stringified `index` label once per GPU so the six metric families iterate without re-allocating labels, matching the vGPU / MIG exporter pattern. * feat: parse hardware-detail metrics on the remote side Extend `MetricsParser::process_gpu_metrics` to handle the six issue-#132 metric families so remote-mode scrapes reconstruct the same `GpuInfo` shape the local reader produces. Route `nvlink_*` lines through the existing per-GPU accumulator via prefix-matching in the top-level dispatcher. Harden the parser against hostile remote input: * `MAX_NVLINK_PER_GPU = 32` caps the `nvlink_remote_devices` vector (current hardware tops out at 18 links; 32 leaves headroom). * `MAX_NUMA_NODE_ID = 4096` rejects absurd NUMA topologies. * `MAX_GSP_VERSION_LEN = 128` bounds the version string label. * GSP firmware mode accepts only 0..=2 (the values the exporter emits). * GPM fractions accept only `[0.0, 1.0]` — out-of-range readings drop so dashboards can distinguish "unavailable" from a real zero. * Duplicate `link_index` emissions coalesce (last-wins) rather than inflating the vector length. Introduce `saturating_i32` as a counterpart to `saturating_u32` for the signed NUMA node id field, plus `ensure_gpm_metrics` so partial GPM payloads don't force-allocate an empty container on every GPU. * feat: surface hardware details on a compact TUI row per GPU Add `render_hardware_details_row` producing a compact indented row under each GPU when NUMA placement, GSP firmware mode/version, or NvLink topology is reported. Format example: HW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1) GPM gauges render inline on the same row (`GPM: SM=0.67 MemBW=0.42`) but a supported-but-unsampled GPM snapshot produces nothing, matching the exporter contract. Extend `gpu_render_line_count` to include the new optional row so the PgUp/PgDn layout math stays correct when rows grow from 2 to 4 lines on hosts that populate every extended field. A new `gpu_has_hardware_details_row` predicate drives both the renderer and the line-count calculation so the two paths cannot drift. * feat: emit synthetic hardware details from the NVIDIA mock template Add `add_hardware_detail_metrics` to the NVIDIA mock generator so the mock server exposes representative values for every issue-#132 metric family: * NUMA node alternates between 0 and 1 across GPUs (replicates a dual-socket host with even-split GPU topology). * GSP firmware enabled, version `550.54.15` (matching a recent R555 datacenter driver). * 6 active NvLinks per GPU — 5 remote=gpu + 1 remote=switch — mirroring typical HGX 8-GPU board topology. * GPM gauges at fixed plausible values (SM=0.67, MemBW=0.42). Fixed rather than randomised values keep the mock output stable across scrapes; hardware details never change at runtime on real devices, and stable mock values simplify downstream test assertions. * test: add integration tests for the hardware-detail pipeline Mirror the vGPU / MIG integration-test pattern: build Prometheus exposition snippets that match the exporter output byte-for-byte, run them through `MetricsParser::parse_metrics`, and assert every hardware-detail field survives the round trip. Coverage: * Full happy-path round-trip preserves NUMA, GSP mode + version, all 6 NvLinks with correct classification counts, and both GPM gauges. * Partial scrape (NUMA only) preserves absence of the other fields. * Non-NVIDIA scrape leaves every new field at its "unavailable" default. * Unknown and IBM NPU NvLink remote types round-trip intact. * Parser caps the per-GPU NvLink vector at 32 against a hostile 40-link emission. * Out-of-range GSP firmware mode (99) and GPM fractions (9.9) are dropped so dashboards never render impossible values. * Partial GPM emission populates only the reported field. * fix(nvidia): use per-field caches to avoid locking transient failures permanently Replace the single HardwareDetails entry cache with three independent per-field caches (numa, gsp_mode, gsp_version). A transient NVML error (Unknown, GpuLost) on one field no longer prevents the other fields from being cached, and the next poll will retry the failed field. Only permanent errors (NotSupported, FunctionNotFound) are cached as None. * fix(nvidia): skip transient NvLink errors instead of breaking enumeration Narrow the early-exit in collect_nvlink_remote_devices to only break on known past-the-end variants (InvalidArg, NotSupported). Transient errors (Unknown, GpuLost) now skip the current link index with continue so that a driver hiccup on link N does not silently hide links N+1 through MAX. * fix(parser): reject fractional GSP firmware mode values Add value.fract() == 0.0 guard to the gpu_gsp_firmware_mode branch so fractional inputs like 1.5 (which would silently saturate to 1) are dropped instead of stored as a wrong mode code. Apply the same guard to gpu_performance_state for consistency. * fix(security): reject control characters in GSP firmware version label A malicious remote Prometheus endpoint could embed ANSI escape sequences in the version label (e.g. ESC[2J ESC[H). Without a character filter the TUI would execute those sequences on render. Add a chars().all(|c| !c.is_control()) guard so any version string containing control characters is dropped and the field stays None. * fix(nvidia): address clippy warnings from security fixes Collapse nested if-let in get_or_fetch_field probe path (collapsible_if) and convert useless format! to .to_string() in control-char test. * docs: document NVIDIA hardware details metrics in API.md and README.md Add NVIDIA Hardware Details Metrics section to API.md covering the six new Prometheus series (NUMA node, GSP firmware mode, GSP firmware version info, NvLink remote type, GPM SM occupancy, GPM memory bandwidth util) with label tables and absence semantics. Update README.md NVIDIA feature bullet, API mode metrics list, mock server note, and v0.21.0 changelog entry to reflect the hardware details additions from issue #132.

inureyes added 7 commits April 15, 2026 17:28

inureyes added type:enhancement New feature or request priority:low Low priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 15, 2026

inureyes added 4 commits April 15, 2026 17:55

inureyes added 3 commits April 15, 2026 18:11

inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026

inureyes merged commit 5001ee8 into main Apr 15, 2026
1 check passed

inureyes deleted the feature/issue-130-thermal-pstate branch April 15, 2026 09:15

inureyes mentioned this pull request Apr 15, 2026

feat: add NVIDIA MIG (Multi-Instance GPU) monitoring #174

Merged

6 tasks

inureyes mentioned this pull request Apr 15, 2026

feat: add extended NVIDIA hardware details (NUMA, GSP firmware, NvLink, GPM) #175

Merged

6 tasks

inureyes mentioned this pull request Apr 17, 2026

enhancement: Add env-var gate for hardware detail mock data (ALL_SMI_MOCK_HARDWARE_DETAILS) #178

Closed

inureyes self-assigned this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NVIDIA thermal thresholds and P-state monitoring#173

feat: add NVIDIA thermal thresholds and P-state monitoring#173
inureyes merged 14 commits into
mainfrom
feature/issue-130-thermal-pstate

inureyes commented Apr 15, 2026

Uh oh!

inureyes commented Apr 15, 2026

Uh oh!

inureyes commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 15, 2026

Summary

Test plan

Notes

Uh oh!

inureyes commented Apr 15, 2026

Security & Performance Review — auto

Findings

Verified-safe surfaces

MEDIUM #1 — Unbounded P-state value accepted from upstream

MEDIUM #2 — MaxOp: / Acoustic: / P-State: produce double-leading-space when emitted alone

LOW — visible_gpus_for_tab allocates Box<dyn Iterator> and clones tab name per call

Conclusion

Uh oh!

inureyes commented Apr 15, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MEDIUM #2 — `MaxOp:` / `Acoustic:` / `P-State:` produce double-leading-space when emitted alone

LOW — `visible_gpus_for_tab` allocates `Box<dyn Iterator>` and clones tab name per call