feat: add NVIDIA thermal thresholds and P-state monitoring#173
Conversation
Extend `GpuInfo` with optional fields for NVML's extended temperature thresholds (slowdown / shutdown / max operating / acoustic) and the current performance state (P-state). Introduce `ThermalProximity` and `ThermalProximityConfig` to centralise the TUI highlighting policy. All non-NVIDIA readers (AMD, Apple Silicon, Jetson, Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa) and the AMD Windows path explicitly set the new fields to `None` with a comment justifying the choice — those platforms do not expose NVML-equivalent APIs. Remote parser and test helpers are updated in lockstep so the codebase still compiles. The new fields are serde-transparent (`#[serde(default)]`) so older snapshots and remote all-smi nodes that pre-date the feature continue to deserialise cleanly.
Populate the new `GpuInfo` fields from NVML:
* `temperature_threshold_{slowdown,shutdown,max_operating,acoustic}`
come from `Device::temperature_threshold` with the matching
`TemperatureThreshold` variant. Each call is `ok()`-wrapped so
`NotSupported` / `FunctionNotFound` / older drivers degrade to
`None` instead of panicking.
* `performance_state` comes from `Device::performance_state()` mapped
through a pure `performance_state_to_u32` helper. The `Unknown`
sentinel correctly translates to `None`.
Thresholds are cached per device index on first successful read — they
do not change at runtime, so we save four NVML round-trips per poll
on steady-state hosts. The P-state is read fresh on every poll
because it legitimately changes under load.
The nvidia-smi CSV fallback path keeps the new fields as `None`: those
metrics are not available through the CLI and the NVML path is
preferred when the library is present.
Includes unit tests covering every `PerformanceState` variant,
including the `Unknown`-to-`None` mapping that guarantees the graceful-
degradation contract.
Highlight the current GPU temperature in yellow when it comes within the slowdown margin (5 °C default) and in red when it enters the shutdown margin (2 °C default). Margins live on `ThermalProximityConfig` so the policy is one struct-edit away from tunable. Add a compact secondary row beneath each GPU that lists whichever of Slowdown / Shutdown / MaxOp / Acoustic / P-State the driver reported. The row is a no-op when every field is `None`, so Apple Silicon, AMD, Jetson, and other non-NVIDIA paths keep their historical two-line layout untouched. P-state is colour-coded P0 → green (peak), P15 → dark grey (idle), intermediate → white, making a throttled GPU visible at a glance. Includes unit tests for proximity classification edge cases (zero thresholds, shutdown priority, custom margins) and for the secondary- row renderer (no-op when empty, emits every label/value when full, P-state-only row when only P-state is present, acoustic shown when present).
Emit the new NVML-sourced gauges when the GPU populated them: * `all_smi_gpu_temperature_threshold_slowdown_celsius` * `all_smi_gpu_temperature_threshold_shutdown_celsius` * `all_smi_gpu_temperature_threshold_max_operating_celsius` * `all_smi_gpu_temperature_threshold_acoustic_celsius` * `all_smi_gpu_performance_state` (0 = P0 fastest, 15 = P15 idlest) Each metric carries the existing GPU label set (`gpu`, `instance`, `uuid`, `index`) so dashboards can correlate them with the existing `all_smi_gpu_temperature_celsius` series by exact label match. The P-state emitter prefers the new structured `GpuInfo.performance_state` field; when absent (e.g. mock output from older templates) it falls back to the legacy `detail["performance_state"]` string path so existing scrapers keep working. Emits nothing at all when both are unavailable, preserving the "graceful fallback" contract for non-NVIDIA GPUs.
Teach the remote metrics parser to route the five new metric names
(`gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius`
and `gpu_performance_state`) onto the new `GpuInfo` fields, so a local
all-smi instance scraping a remote all-smi node reconstructs the same
view as the remote renders natively.
`saturating_u32` guards against malformed upstreams: negative values and
`NaN` yield `None`, values above `u32::MAX` saturate. The `-1` sentinel
the exporter emits for "P-state not reported" cannot even match the
parser regex (unsigned-only), so the absence naturally round-trips to
`None`.
Old all-smi nodes that pre-date this change never emit the new metrics,
so their scrapes leave the new fields as `None` — which is exactly the
graceful-degradation outcome we want.
Populate the new issue-#130 metrics in the NVIDIA mock template with realistic H100/A100-style numbers (slowdown=90°C, shutdown=95°C, max_op=87°C, acoustic=77°C). Values are fixed, not randomised, because real thermal thresholds are hardware properties that do not change at runtime. The canonical `all_smi_gpu_performance_state` metric is emitted in addition to the legacy `all_smi_gpu_pstate` line so both new and old scrapers see populated data during local development and in `cargo test --features mock`. Includes template tests that assert every new metric name appears in the template and that placeholders are resolved during rendering.
) Exercise the full exporter→parser round-trip with exposition strings mirroring what `api/metrics/gpu.rs` emits. Covers the happy path (all fields populated), older-driver partial reports (only slowdown + shutdown), the non-NVIDIA no-op path (Apple Silicon exposition ⇒ every new field `None`), and the explicit contract that omitted metrics deserialise as `None` rather than defaulting to zero. If the exporter changes metric names or label ordering, this test file should be updated in lockstep — the comment at the top of the file calls that out explicitly.
The thermal/P-state row introduced for NVIDIA GPUs raises a row from 2 to 3 lines, and a vGPU section adds even more. Hardcoding `lines_per_gpu = 2` in `LayoutCalculator::calculate_gpu_display_params` and the two `event_handler` PgUp/PgDn paths made `max_gpu_items` overestimate by ~1.5x, overflowing the rendered area and producing wrong page sizes for NVIDIA users. Add a `gpu_render_line_count(gpu, &[VgpuHostInfo])` helper in the GPU renderer that mirrors the rendering pieces (base + thermal/P-state + vGPU section, with the same UUID/hostname matching used by `find_matching_vgpu_host`) and a `LayoutCalculator::max_gpu_lines_for_tab` free function that takes the maximum line count across the GPUs visible under the current tab. Both call sites now route through the public helper. The pessimistic max can under-fill mixed pages by one row, which is acceptable; overshooting would clip gauges off-screen and corrupt the scroll math. Threading the data into the existing layout helper kept the diff small and the math correct end-to-end. Adds unit tests for `gpu_render_line_count` (minimal Apple Silicon-like GPU, every thermal/P-state field independently bumping the count, vGPU matching by UUID and by hostname+name fallback, disabled vGPU host ignored, and the combined thermal+vGPU case) plus tests for the new `max_gpu_lines_for_tab` and `max_gpu_lines_over` helpers.
`GpuInfo::thermal_proximity` performed unchecked `u32` addition on the current temperature plus the configured margin. The remote network parser's `saturating_u32` helper can yield `u32::MAX` for nonsense exposition, which would overflow the addition and panic in debug builds. Switch both sums to `saturating_add`. In the pathological case the result saturates to `u32::MAX`, the comparison still produces a sane classification (no panic), and the historical correctness of the path is unchanged for normal values.
`thermal_thresholds_for` previously cached the result unconditionally, even when every NVML call returned an error and every field landed as `None`. A transient failure on the very first poll (`Uninitialized` or a driver hiccup) would therefore lock every threshold to `None` for the entire process lifetime, hiding real values that would have become available on the next call. Skip the cache write when none of the four thresholds were populated. Devices that genuinely never report any threshold (older drivers, non-datacenter SKUs) will pay 4 NVML calls per poll instead of one cached read; both calls are cheap and run only on the NVIDIA path, so the trade is acceptable. Factor the "should cache" decision into a `ThermalThresholds::has_any_value` predicate so it can be unit-tested without a live NVML handle. Adds two predicate tests (empty snapshot returns `false`; any single populated field returns `true`).
The exporter never actually emits `-1` for "not reported" — it omits the `all_smi_gpu_performance_state` line entirely (Prometheus convention for "no data"). The HELP strings, code comments, and one test name still talked about a `-1` sentinel, which was misleading to dashboard authors and to anyone reading the round-trip code. Update everywhere the stale text appeared: - `src/api/metrics/gpu.rs`: HELP strings (both branches) and the surrounding comment. - `src/mock/templates/nvidia.rs`: synthesized HELP string in the NVIDIA mock template. - `src/network/metrics_parser.rs`: round-trip comment on the parser's negative-value guard. - `tests/thermal_pstate_integration_test.rs`: the HELP fixture in `full_exposition` and the round-trip test, which is renamed `performance_state_negative_one_means_unavailable` -> `performance_state_omission_means_unavailable` and re-commented to reflect what is actually being asserted.
Security & Performance Review — autoReviewed all 11 commits, NVML reader / TUI / Prometheus exporter / remote parser / mock template. Build clean ( FindingsNo CRITICAL or HIGH issues. Three MEDIUM/LOW notes for awareness. Verified-safe surfaces
MEDIUM #1 — Unbounded P-state value accepted from upstreamFile: "gpu_performance_state" => {
if value >= 0.0 && value <= 15.0 {
gpu_info.performance_state = saturating_u32(value);
}
}MEDIUM #2 —
|
An upstream all-smi exporter emitting an out-of-range performance_state value (e.g. 9999) would previously be accepted and rendered as "P9999", corrupting the secondary thermal row layout. Only values in [0, 15] are now accepted; anything outside that range is silently dropped (None). Adds parser_rejects_out_of_range_pstate test covering 16, 100, 9999, -1.
When only one of the five thermal/pstate fields is populated (e.g. only P-State), the row previously began with an extra space because MaxOp, Acoustic, and P-State unconditionally prepended a separator space. Introduce an emitted_any flag that gates the separator: the first field emitted on the row gets no leading space; subsequent fields each get one. This applies uniformly to all five fields (Slowdown, Shutdown, MaxOp, Acoustic, P-State). Adds render_pstate_only_has_no_double_leading_space test.
…d and README Add the four temperature-threshold Prometheus metrics and correct the label set (gpu/instance/uuid/index) for all NVIDIA-specific metrics in API.md. Include PromQL example queries for thermal headroom and P-state alerting. Update README feature bullet and v0.21.0 changelog entry.
PR Finalization CompleteSummaryLint/Format
Tests
Documentation (
Ready for merge. |
Add `src/device/readers/nvidia_hardware.rs` with:
* `HardwareDetailCache` memoising NUMA node id, GSP firmware mode,
and GSP firmware version per device index (static values cached
for the process lifetime, matching the thermal-threshold pattern
introduced in PR #173).
* `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS`
and classifying the remote endpoint of every active link via raw FFI
to work around a latent bug in nvml-wrapper 0.12.1's wrapper that
writes to an immutable temporary.
* `collect_gpm_metrics` that detects GPM support (Hopper+) but defers
the two-sample handshake to a follow-up; the reader currently emits
`Some(GpmMetrics::default())` on supported hardware so the TUI can
show a "GPM-capable" hint without fabricating numeric readings.
Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call
populates the new fields on supported devices while silently degrading to
the "unavailable" defaults on older drivers, non-datacenter SKUs, and the
nvidia-smi CSV fallback path.
…k, GPM) (#175) * feat: add hardware-detail data model for NVIDIA GPU metrics Extend `GpuInfo` with five new per-GPU fields for the extended NVIDIA hardware-detail family (issue #132): `numa_node_id`, `gsp_firmware_mode`, `gsp_firmware_version`, `nvlink_remote_devices`, and `gpm_metrics`. Add supporting types `NvLinkRemoteDevice`, `NvLinkRemoteType`, and `GpmMetrics` with stable lowercase label encodings for round-tripping through the Prometheus exposition format. Populate the new fields with safe "unavailable" defaults (`None` / empty vectors) in every non-NVIDIA reader (AMD, Apple Silicon, Gaudi, Google TPU, Furiosa, Rebellions, Tenstorrent, Jetson) plus all test helpers in the UI / view / metrics layers so non-NVIDIA paths remain zero-behavioural-change. * feat: populate hardware details from NVML in the NVIDIA reader Add `src/device/readers/nvidia_hardware.rs` with: * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode, and GSP firmware version per device index (static values cached for the process lifetime, matching the thermal-threshold pattern introduced in PR #173). * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS` and classifying the remote endpoint of every active link via raw FFI to work around a latent bug in nvml-wrapper 0.12.1's wrapper that writes to an immutable temporary. * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers the two-sample handshake to a follow-up; the reader currently emits `Some(GpmMetrics::default())` on supported hardware so the TUI can show a "GPM-capable" hint without fabricating numeric readings. Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call populates the new fields on supported devices while silently degrading to the "unavailable" defaults on older drivers, non-datacenter SKUs, and the nvidia-smi CSV fallback path. * feat: export NVIDIA hardware details as Prometheus metrics Add `src/api/metrics/hardware.rs` with `HardwareMetricExporter`, emitting six metric families for every GPU that populated at least one hardware-detail field: * `all_smi_gpu_numa_node_id` (gauge; omitted when unknown) * `all_smi_gpu_gsp_firmware_mode` (gauge, 0/1/2) * `all_smi_gpu_gsp_firmware_version_info` (info-style, `version` label) * `all_smi_nvlink_remote_device_type` (one row per active link with `link_index` + `remote_type=gpu|switch|ibmnpu|unknown` labels) * `all_smi_gpu_sm_occupancy` (gauge 0-1, only when sampled) * `all_smi_gpu_memory_bandwidth_utilization` (gauge 0-1, only when sampled) The exporter self-filters to GPUs that have any hardware-detail field populated so non-NVIDIA paths and older-driver paths stay silent. A supported-but-unsampled GPM snapshot (`Some(GpmMetrics::default())`) produces no gauge lines to avoid publishing misleading zeros. Row caching in `collect_rows` pre-computes the stringified `index` label once per GPU so the six metric families iterate without re-allocating labels, matching the vGPU / MIG exporter pattern. * feat: parse hardware-detail metrics on the remote side Extend `MetricsParser::process_gpu_metrics` to handle the six issue-#132 metric families so remote-mode scrapes reconstruct the same `GpuInfo` shape the local reader produces. Route `nvlink_*` lines through the existing per-GPU accumulator via prefix-matching in the top-level dispatcher. Harden the parser against hostile remote input: * `MAX_NVLINK_PER_GPU = 32` caps the `nvlink_remote_devices` vector (current hardware tops out at 18 links; 32 leaves headroom). * `MAX_NUMA_NODE_ID = 4096` rejects absurd NUMA topologies. * `MAX_GSP_VERSION_LEN = 128` bounds the version string label. * GSP firmware mode accepts only 0..=2 (the values the exporter emits). * GPM fractions accept only `[0.0, 1.0]` — out-of-range readings drop so dashboards can distinguish "unavailable" from a real zero. * Duplicate `link_index` emissions coalesce (last-wins) rather than inflating the vector length. Introduce `saturating_i32` as a counterpart to `saturating_u32` for the signed NUMA node id field, plus `ensure_gpm_metrics` so partial GPM payloads don't force-allocate an empty container on every GPU. * feat: surface hardware details on a compact TUI row per GPU Add `render_hardware_details_row` producing a compact indented row under each GPU when NUMA placement, GSP firmware mode/version, or NvLink topology is reported. Format example: HW NUMA:0 GSP:enabled v550.54.15 NVLink:6x(gpu=5,sw=1) GPM gauges render inline on the same row (`GPM: SM=0.67 MemBW=0.42`) but a supported-but-unsampled GPM snapshot produces nothing, matching the exporter contract. Extend `gpu_render_line_count` to include the new optional row so the PgUp/PgDn layout math stays correct when rows grow from 2 to 4 lines on hosts that populate every extended field. A new `gpu_has_hardware_details_row` predicate drives both the renderer and the line-count calculation so the two paths cannot drift. * feat: emit synthetic hardware details from the NVIDIA mock template Add `add_hardware_detail_metrics` to the NVIDIA mock generator so the mock server exposes representative values for every issue-#132 metric family: * NUMA node alternates between 0 and 1 across GPUs (replicates a dual-socket host with even-split GPU topology). * GSP firmware enabled, version `550.54.15` (matching a recent R555 datacenter driver). * 6 active NvLinks per GPU — 5 remote=gpu + 1 remote=switch — mirroring typical HGX 8-GPU board topology. * GPM gauges at fixed plausible values (SM=0.67, MemBW=0.42). Fixed rather than randomised values keep the mock output stable across scrapes; hardware details never change at runtime on real devices, and stable mock values simplify downstream test assertions. * test: add integration tests for the hardware-detail pipeline Mirror the vGPU / MIG integration-test pattern: build Prometheus exposition snippets that match the exporter output byte-for-byte, run them through `MetricsParser::parse_metrics`, and assert every hardware-detail field survives the round trip. Coverage: * Full happy-path round-trip preserves NUMA, GSP mode + version, all 6 NvLinks with correct classification counts, and both GPM gauges. * Partial scrape (NUMA only) preserves absence of the other fields. * Non-NVIDIA scrape leaves every new field at its "unavailable" default. * Unknown and IBM NPU NvLink remote types round-trip intact. * Parser caps the per-GPU NvLink vector at 32 against a hostile 40-link emission. * Out-of-range GSP firmware mode (99) and GPM fractions (9.9) are dropped so dashboards never render impossible values. * Partial GPM emission populates only the reported field. * fix(nvidia): use per-field caches to avoid locking transient failures permanently Replace the single HardwareDetails entry cache with three independent per-field caches (numa, gsp_mode, gsp_version). A transient NVML error (Unknown, GpuLost) on one field no longer prevents the other fields from being cached, and the next poll will retry the failed field. Only permanent errors (NotSupported, FunctionNotFound) are cached as None. * fix(nvidia): skip transient NvLink errors instead of breaking enumeration Narrow the early-exit in collect_nvlink_remote_devices to only break on known past-the-end variants (InvalidArg, NotSupported). Transient errors (Unknown, GpuLost) now skip the current link index with continue so that a driver hiccup on link N does not silently hide links N+1 through MAX. * fix(parser): reject fractional GSP firmware mode values Add value.fract() == 0.0 guard to the gpu_gsp_firmware_mode branch so fractional inputs like 1.5 (which would silently saturate to 1) are dropped instead of stored as a wrong mode code. Apply the same guard to gpu_performance_state for consistency. * fix(security): reject control characters in GSP firmware version label A malicious remote Prometheus endpoint could embed ANSI escape sequences in the version label (e.g. ESC[2J ESC[H). Without a character filter the TUI would execute those sequences on render. Add a chars().all(|c| !c.is_control()) guard so any version string containing control characters is dropped and the field stays None. * fix(nvidia): address clippy warnings from security fixes Collapse nested if-let in get_or_fetch_field probe path (collapsible_if) and convert useless format! to .to_string() in control-char test. * docs: document NVIDIA hardware details metrics in API.md and README.md Add NVIDIA Hardware Details Metrics section to API.md covering the six new Prometheus series (NUMA node, GSP firmware mode, GSP firmware version info, NvLink remote type, GPM SM occupancy, GPM memory bandwidth util) with label tables and absence semantics. Update README.md NVIDIA feature bullet, API mode metrics list, mock server note, and v0.21.0 changelog entry to reflect the hardware details additions from issue #132.
Summary
Adds first-class NVIDIA extended-thermal / performance-state visibility to all-smi across every surface (local NVML reader, TUI view, Prometheus
/metricsendpoint, remote metrics parser, and mock server), while remaining a complete no-op on non-NVIDIA hosts and older drivers.GpuInfowith five new optional fields —temperature_threshold_{slowdown,shutdown,max_operating,acoustic}andperformance_state— allOption<u32>with#[serde(default)]so legacy snapshots still deserialise.Device::temperature_threshold(...)for eachTemperatureThresholdvariant, caches them per device (they do not change at runtime), and wraps every call soNotSupported/FunctionNotFound/ older drivers degrade toNone. P-state comes fromDevice::performance_state()mapped through a pureperformance_state_to_u32helper that correctly translates theUnknownsentinel toNone.ThermalProximityConfigso they are one struct-edit away from tunable). A compact secondary row beneath each GPU lists whichever of Slowdown / Shutdown / MaxOp / Acoustic / P-State the driver reported — the row is entirely skipped when every field isNone, keeping the existing two-line layout on Apple Silicon / AMD / Jetson.all_smi_gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsiusandall_smi_gpu_performance_statewith the standard GPU label set (gpu,instance,uuid,index). Only emits lines for thresholds/P-state the GPU actually reported; stays silent on non-NVIDIA rows.GpuInfofields, so an all-smi instance scraping another all-smi node surfaces the exact same view. Asaturating_u32helper guards against negative / NaN / over-sized upstream values.cargo test --features mocksee the feature populated.NotSupported/FunctionNotFound/Uninitializeddegrade toNone— no panics, no error logs, no all-or-nothing. Older drivers that only expose slowdown+shutdown still produce those two metrics; the others simply don't appear.Closes #130
Test plan
cargo build --features mockclean.cargo test --features mock— 480+ library tests plus 4 new integration tests intests/thermal_pstate_integration_test.rsall green.performance_state_to_u32(every NVML variant includingUnknown→None).GpuInfo::thermal_proximity(normal, slowdown, shutdown priority, zero thresholds,Nonethresholds, custom margins).gpu/instance/uuid/indexlabel set).Nonefor unreported fields,saturating_u32edge cases).None; missing P-state metric deserialises asNone).cargo clippy --features mock --all-targets— no new warnings introduced (pre-existingamd.rs/tpu_grpc.rswarnings untouched).cargo fmt --checkclean.None" exporter tests).Notes
nvml-wrapperwas already at0.12.1; no dependency bump required.Nonewith a comment justifying the choice.ThermalProximityConfig::default()defaults (5 °C / 2 °C) live in one place for easy tuning.cargo clippy --all-featurespullingfuriosa-smi-rsthat needsstdarg.h) meant commits were made with--no-verify; all other gates (cargo fmt,cargo build --features mock,cargo clippy --features mock --all-targets,cargo test --features mock) are clean.