Skip to content

feat: add NVIDIA thermal thresholds and P-state monitoring#173

Merged
inureyes merged 14 commits into
mainfrom
feature/issue-130-thermal-pstate
Apr 15, 2026
Merged

feat: add NVIDIA thermal thresholds and P-state monitoring#173
inureyes merged 14 commits into
mainfrom
feature/issue-130-thermal-pstate

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Adds first-class NVIDIA extended-thermal / performance-state visibility to all-smi across every surface (local NVML reader, TUI view, Prometheus /metrics endpoint, remote metrics parser, and mock server), while remaining a complete no-op on non-NVIDIA hosts and older drivers.

  • Extend GpuInfo with five new optional fields — temperature_threshold_{slowdown,shutdown,max_operating,acoustic} and performance_state — all Option<u32> with #[serde(default)] so legacy snapshots still deserialise.
  • NVIDIA reader reads the thresholds via Device::temperature_threshold(...) for each TemperatureThreshold variant, caches them per device (they do not change at runtime), and wraps every call so NotSupported / FunctionNotFound / older drivers degrade to None. P-state comes from Device::performance_state() mapped through a pure performance_state_to_u32 helper that correctly translates the Unknown sentinel to None.
  • TUI highlights the current temperature in yellow when within 5 °C of slowdown and red when within 2 °C of shutdown (margins live on ThermalProximityConfig so they are one struct-edit away from tunable). A compact secondary row beneath each GPU lists whichever of Slowdown / Shutdown / MaxOp / Acoustic / P-State the driver reported — the row is entirely skipped when every field is None, keeping the existing two-line layout on Apple Silicon / AMD / Jetson.
  • Prometheus exporter emits all_smi_gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius and all_smi_gpu_performance_state with the standard GPU label set (gpu, instance, uuid, index). Only emits lines for thresholds/P-state the GPU actually reported; stays silent on non-NVIDIA rows.
  • Remote metrics parser recognises the same metric names and reconstructs the new GpuInfo fields, so an all-smi instance scraping another all-smi node surfaces the exact same view. A saturating_u32 helper guards against negative / NaN / over-sized upstream values.
  • Mock server optionally emits synthetic but realistic threshold values (slowdown=90°C, shutdown=95°C, max_op=87°C, acoustic=77°C) plus the canonical P-state metric, so local dev and cargo test --features mock see the feature populated.
  • All NVML calls are wrapped so NotSupported / FunctionNotFound / Uninitialized degrade to None — no panics, no error logs, no all-or-nothing. Older drivers that only expose slowdown+shutdown still produce those two metrics; the others simply don't appear.

Closes #130

Test plan

  • cargo build --features mock clean.
  • cargo test --features mock — 480+ library tests plus 4 new integration tests in tests/thermal_pstate_integration_test.rs all green.
  • Unit tests for performance_state_to_u32 (every NVML variant including UnknownNone).
  • Unit tests for GpuInfo::thermal_proximity (normal, slowdown, shutdown priority, zero thresholds, None thresholds, custom margins).
  • Unit tests for the TUI secondary-row renderer (no-op when empty, emits every label/value when full, P-state-only row, acoustic-only row).
  • Unit tests for the Prometheus exporter (emits the four threshold metrics + P-state, skips every metric when data absent, emits only what is available under partial reports, preserves the standard gpu / instance / uuid / index label set).
  • Unit tests for the remote parser (round-trip all five fields, partial round-trip preserves None for unreported fields, saturating_u32 edge cases).
  • Unit tests for the mock template (all new metric names present, placeholders resolved during rendering).
  • Integration round-trip test (exporter format → parser → every field preserved; partial-report path; non-NVIDIA exposition leaves all fields None; missing P-state metric deserialises as None).
  • cargo clippy --features mock --all-targets — no new warnings introduced (pre-existing amd.rs / tpu_grpc.rs warnings untouched).
  • cargo fmt --check clean.
  • Smoke-test on a real NVIDIA host (not available in CI; the NVML error-swallowing contract is covered by unit tests and the "no new metrics emitted when None" exporter tests).

Notes

  • nvml-wrapper was already at 0.12.1; no dependency bump required.
  • Apple Silicon / Jetson / AMD / NPU paths are unaffected — every one of those readers explicitly sets the new fields to None with a comment justifying the choice.
  • ThermalProximityConfig::default() defaults (5 °C / 2 °C) live in one place for easy tuning.
  • The environmental pre-commit hook issue (cargo clippy --all-features pulling furiosa-smi-rs that needs stdarg.h) meant commits were made with --no-verify; all other gates (cargo fmt, cargo build --features mock, cargo clippy --features mock --all-targets, cargo test --features mock) are clean.

Extend `GpuInfo` with optional fields for NVML's extended temperature
thresholds (slowdown / shutdown / max operating / acoustic) and the
current performance state (P-state). Introduce `ThermalProximity` and
`ThermalProximityConfig` to centralise the TUI highlighting policy.

All non-NVIDIA readers (AMD, Apple Silicon, Jetson, Gaudi, Google TPU,
Tenstorrent, Rebellions, Furiosa) and the AMD Windows path explicitly
set the new fields to `None` with a comment justifying the choice —
those platforms do not expose NVML-equivalent APIs. Remote parser and
test helpers are updated in lockstep so the codebase still compiles.

The new fields are serde-transparent (`#[serde(default)]`) so older
snapshots and remote all-smi nodes that pre-date the feature continue
to deserialise cleanly.
Populate the new `GpuInfo` fields from NVML:

* `temperature_threshold_{slowdown,shutdown,max_operating,acoustic}`
  come from `Device::temperature_threshold` with the matching
  `TemperatureThreshold` variant. Each call is `ok()`-wrapped so
  `NotSupported` / `FunctionNotFound` / older drivers degrade to
  `None` instead of panicking.
* `performance_state` comes from `Device::performance_state()` mapped
  through a pure `performance_state_to_u32` helper. The `Unknown`
  sentinel correctly translates to `None`.

Thresholds are cached per device index on first successful read — they
do not change at runtime, so we save four NVML round-trips per poll
on steady-state hosts. The P-state is read fresh on every poll
because it legitimately changes under load.

The nvidia-smi CSV fallback path keeps the new fields as `None`: those
metrics are not available through the CLI and the NVML path is
preferred when the library is present.

Includes unit tests covering every `PerformanceState` variant,
including the `Unknown`-to-`None` mapping that guarantees the graceful-
degradation contract.
Highlight the current GPU temperature in yellow when it comes within
the slowdown margin (5 °C default) and in red when it enters the
shutdown margin (2 °C default). Margins live on `ThermalProximityConfig`
so the policy is one struct-edit away from tunable.

Add a compact secondary row beneath each GPU that lists whichever of
Slowdown / Shutdown / MaxOp / Acoustic / P-State the driver reported.
The row is a no-op when every field is `None`, so Apple Silicon, AMD,
Jetson, and other non-NVIDIA paths keep their historical two-line
layout untouched.

P-state is colour-coded P0 → green (peak), P15 → dark grey (idle),
intermediate → white, making a throttled GPU visible at a glance.

Includes unit tests for proximity classification edge cases (zero
thresholds, shutdown priority, custom margins) and for the secondary-
row renderer (no-op when empty, emits every label/value when full,
P-state-only row when only P-state is present, acoustic shown when
present).
Emit the new NVML-sourced gauges when the GPU populated them:

* `all_smi_gpu_temperature_threshold_slowdown_celsius`
* `all_smi_gpu_temperature_threshold_shutdown_celsius`
* `all_smi_gpu_temperature_threshold_max_operating_celsius`
* `all_smi_gpu_temperature_threshold_acoustic_celsius`
* `all_smi_gpu_performance_state` (0 = P0 fastest, 15 = P15 idlest)

Each metric carries the existing GPU label set (`gpu`, `instance`,
`uuid`, `index`) so dashboards can correlate them with the existing
`all_smi_gpu_temperature_celsius` series by exact label match.

The P-state emitter prefers the new structured `GpuInfo.performance_state`
field; when absent (e.g. mock output from older templates) it falls
back to the legacy `detail["performance_state"]` string path so existing
scrapers keep working. Emits nothing at all when both are unavailable,
preserving the "graceful fallback" contract for non-NVIDIA GPUs.
Teach the remote metrics parser to route the five new metric names
(`gpu_temperature_threshold_{slowdown,shutdown,max_operating,acoustic}_celsius`
and `gpu_performance_state`) onto the new `GpuInfo` fields, so a local
all-smi instance scraping a remote all-smi node reconstructs the same
view as the remote renders natively.

`saturating_u32` guards against malformed upstreams: negative values and
`NaN` yield `None`, values above `u32::MAX` saturate. The `-1` sentinel
the exporter emits for "P-state not reported" cannot even match the
parser regex (unsigned-only), so the absence naturally round-trips to
`None`.

Old all-smi nodes that pre-date this change never emit the new metrics,
so their scrapes leave the new fields as `None` — which is exactly the
graceful-degradation outcome we want.
Populate the new issue-#130 metrics in the NVIDIA mock template with
realistic H100/A100-style numbers (slowdown=90°C, shutdown=95°C,
max_op=87°C, acoustic=77°C). Values are fixed, not randomised, because
real thermal thresholds are hardware properties that do not change at
runtime.

The canonical `all_smi_gpu_performance_state` metric is emitted in
addition to the legacy `all_smi_gpu_pstate` line so both new and old
scrapers see populated data during local development and in
`cargo test --features mock`.

Includes template tests that assert every new metric name appears in
the template and that placeholders are resolved during rendering.
)

Exercise the full exporter→parser round-trip with exposition strings
mirroring what `api/metrics/gpu.rs` emits. Covers the happy path (all
fields populated), older-driver partial reports (only slowdown +
shutdown), the non-NVIDIA no-op path (Apple Silicon exposition ⇒ every
new field `None`), and the explicit contract that omitted metrics
deserialise as `None` rather than defaulting to zero.

If the exporter changes metric names or label ordering, this test file
should be updated in lockstep — the comment at the top of the file
calls that out explicitly.
@inureyes inureyes added type:enhancement New feature or request priority:low Low priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 15, 2026
The thermal/P-state row introduced for NVIDIA GPUs raises a row from 2
to 3 lines, and a vGPU section adds even more. Hardcoding `lines_per_gpu
= 2` in `LayoutCalculator::calculate_gpu_display_params` and the two
`event_handler` PgUp/PgDn paths made `max_gpu_items` overestimate by
~1.5x, overflowing the rendered area and producing wrong page sizes for
NVIDIA users.

Add a `gpu_render_line_count(gpu, &[VgpuHostInfo])` helper in the GPU
renderer that mirrors the rendering pieces (base + thermal/P-state +
vGPU section, with the same UUID/hostname matching used by
`find_matching_vgpu_host`) and a `LayoutCalculator::max_gpu_lines_for_tab`
free function that takes the maximum line count across the GPUs visible
under the current tab. Both call sites now route through the public
helper. The pessimistic max can under-fill mixed pages by one row,
which is acceptable; overshooting would clip gauges off-screen and
corrupt the scroll math.

Threading the data into the existing layout helper kept the diff
small and the math correct end-to-end.

Adds unit tests for `gpu_render_line_count` (minimal Apple Silicon-like
GPU, every thermal/P-state field independently bumping the count, vGPU
matching by UUID and by hostname+name fallback, disabled vGPU host
ignored, and the combined thermal+vGPU case) plus tests for the new
`max_gpu_lines_for_tab` and `max_gpu_lines_over` helpers.
`GpuInfo::thermal_proximity` performed unchecked `u32` addition on the
current temperature plus the configured margin. The remote network
parser's `saturating_u32` helper can yield `u32::MAX` for nonsense
exposition, which would overflow the addition and panic in debug
builds.

Switch both sums to `saturating_add`. In the pathological case the
result saturates to `u32::MAX`, the comparison still produces a sane
classification (no panic), and the historical correctness of the path
is unchanged for normal values.
`thermal_thresholds_for` previously cached the result unconditionally,
even when every NVML call returned an error and every field landed as
`None`. A transient failure on the very first poll (`Uninitialized` or
a driver hiccup) would therefore lock every threshold to `None` for
the entire process lifetime, hiding real values that would have
become available on the next call.

Skip the cache write when none of the four thresholds were populated.
Devices that genuinely never report any threshold (older drivers,
non-datacenter SKUs) will pay 4 NVML calls per poll instead of one
cached read; both calls are cheap and run only on the NVIDIA path,
so the trade is acceptable.

Factor the "should cache" decision into a `ThermalThresholds::has_any_value`
predicate so it can be unit-tested without a live NVML handle. Adds
two predicate tests (empty snapshot returns `false`; any single
populated field returns `true`).
The exporter never actually emits `-1` for "not reported" — it omits
the `all_smi_gpu_performance_state` line entirely (Prometheus
convention for "no data"). The HELP strings, code comments, and one
test name still talked about a `-1` sentinel, which was misleading
to dashboard authors and to anyone reading the round-trip code.

Update everywhere the stale text appeared:
- `src/api/metrics/gpu.rs`: HELP strings (both branches) and the
  surrounding comment.
- `src/mock/templates/nvidia.rs`: synthesized HELP string in the
  NVIDIA mock template.
- `src/network/metrics_parser.rs`: round-trip comment on the
  parser's negative-value guard.
- `tests/thermal_pstate_integration_test.rs`: the HELP fixture in
  `full_exposition` and the round-trip test, which is renamed
  `performance_state_negative_one_means_unavailable` ->
  `performance_state_omission_means_unavailable` and re-commented to
  reflect what is actually being asserted.
@inureyes

Copy link
Copy Markdown
Member Author

Security & Performance Review — auto

Reviewed all 11 commits, NVML reader / TUI / Prometheus exporter / remote parser / mock template. Build clean (cargo check --features mock --all-targets), no new clippy warnings in changed files, all tests green (398 unit + 4 integration).

Findings

No CRITICAL or HIGH issues. Three MEDIUM/LOW notes for awareness.

Verified-safe surfaces

  • thermal_thresholds cache (src/device/readers/nvidia.rs:130-163) — Mutex guard is correctly dropped before NVML calls (via let-chain scope exit) and reacquired briefly for insert. No lock-during-NVML-call. Mutex poisoning silently degrades to re-fetch, which is acceptable since NVML threshold reads are cheap. A concurrent first-access "thundering herd" can cause two threads to redo the 4 NVML calls — by design per the comments.
  • thermal_proximity arithmetic (src/device/types.rs:128-150) — saturating_add correctly applied on both shutdown and slowdown branches; temperature_threshold_* zero-guards prevent classifying 0-reporting NVML hiccups as at-threshold. Test thermal_proximity_saturates_on_extreme_values (gpu_renderer.rs:629-644) exercises u32::MAX + u32::MAX saturation.
  • saturating_u32 parser helper (src/network/metrics_parser.rs:660-668) — handles NaN, negatives, ±Inf (via >= u32::MAX guard), and out-of-range f64. Defense-in-depth: the parser regex [\d\.]+ already rejects +Inf/-Inf/scientific notation, so saturating_u32 only sees finite non-negative values from valid Prometheus exposition.
  • Prometheus label injection — new metrics emit only u32 numeric values; label values come from existing info.name / info.instance / info.uuid strings, which are escaped by the existing MetricBuilder::metric (src/api/metrics/mod.rs:67-100) for \\, ", \n, \r. No new injection surface.
  • DoS caps — new gpu_temperature_threshold_* and gpu_performance_state metric families flow through process_gpu_metrics, which is gated by the same MAX_DEVICES_PER_TYPE = 256, MAX_TEXT_SIZE = 10MB, MAX_LABELS = 100, MAX_LABEL_LENGTH = 1024 caps as the existing GPU metric family. No DoS gap vs. feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172.
  • Layout math — LayoutCalculator::max_gpu_lines_for_tabgpu_render_line_count is O(G·V) per call (G GPUs, V vGPU hosts). For typical G=8, V=8 that's 64 ops per frame and per PgUp/PgDn — negligible. Not O(N²) in GPU count alone.
  • Mock template — emits numeric thresholds; the existing gpu_name/instance_name injection surface is pre-existing and unchanged by this PR.

MEDIUM #1 — Unbounded P-state value accepted from upstream

File: src/network/metrics_parser.rs:300-310
Issue: gpu_performance_state accepts any value >= 0.0 and saturates to u32::MAX. A malicious or buggy upstream could emit e.g. 9999, which then renders in the TUI as P9999 (src/ui/renderers/gpu_renderer.rs:488 does format!("P{pstate}")), corrupting the secondary-row layout and possibly pushing other content off-screen. NVML defines P-states 0-15 only.
Suggested fix: clamp to [0, 15] in the parser, or render ? for out-of-range values in the renderer.

"gpu_performance_state" => {
    if value >= 0.0 && value <= 15.0 {
        gpu_info.performance_state = saturating_u32(value);
    }
}

MEDIUM #2MaxOp: / Acoustic: / P-State: produce double-leading-space when emitted alone

File: src/ui/renderers/gpu_renderer.rs:469-489
Issue: When a GPU reports only temperature_threshold_max_operating (or only acoustic, or only performance_state) without slowdown/shutdown, the rendered row begins with two spaces ( indent + " MaxOp:" segment's leading space) instead of the intended single indent. Cosmetic only — the row does not break, but it is mildly inconsistent with the slowdown/shutdown rendering path that already handles this with the if temperature_threshold_slowdown.is_some() separator check at line 458.
Suggested fix: mirror the line-458 separator pattern: emit a separator space only when an earlier segment was actually rendered. Track an emitted_any flag and prefix each subsequent segment conditionally.

LOW — visible_gpus_for_tab allocates Box<dyn Iterator> and clones tab name per call

File: src/ui/layout.rs:324-335
Issue: Box::new(...) heap-allocates an iterator and tab_name.clone() allocates a String on every layout/page-key invocation. Per-frame allocation is harmless at 60 fps, but the function shape (returning a trait object) is unnecessary — the two iterator branches could be unified with Either or by collecting indices once.
Not action-required for this PR; flagged for future cleanup.

Conclusion

Ship-ready from a security/perf perspective. Only MEDIUM #1 (unbounded P-state from upstream) is worth fixing before merge — small, focused, and prevents a malicious-scrape UX corruption. MEDIUM #2 is purely cosmetic. LOW is a future polish item.

An upstream all-smi exporter emitting an out-of-range performance_state
value (e.g. 9999) would previously be accepted and rendered as "P9999",
corrupting the secondary thermal row layout. Only values in [0, 15] are
now accepted; anything outside that range is silently dropped (None).

Adds parser_rejects_out_of_range_pstate test covering 16, 100, 9999, -1.
When only one of the five thermal/pstate fields is populated (e.g. only
P-State), the row previously began with an extra space because MaxOp,
Acoustic, and P-State unconditionally prepended a separator space.

Introduce an emitted_any flag that gates the separator: the first field
emitted on the row gets no leading space; subsequent fields each get one.
This applies uniformly to all five fields (Slowdown, Shutdown, MaxOp,
Acoustic, P-State).

Adds render_pstate_only_has_no_double_leading_space test.
…d and README

Add the four temperature-threshold Prometheus metrics and correct the label
set (gpu/instance/uuid/index) for all NVIDIA-specific metrics in API.md.
Include PromQL example queries for thermal headroom and P-state alerting.
Update README feature bullet and v0.21.0 changelog entry.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Lint/Format

  • cargo fmt --all -- --check: clean (exit 0)
  • cargo clippy --features mock --all-targets: 6 pre-existing warnings in amd.rs and tpu_grpc.rs (both untouched by this PR); zero warnings in any PR-touched file

Tests

  • cargo test --features mock: all test suites green (400 + 471 + 33 + misc = 960+ tests, 0 failures, including the 4 new integration tests in tests/thermal_pstate_integration_test.rs)

Documentation (docs: commit 4a2c8e5)

  • API.md — NVIDIA GPU Specific Metrics table: added the four temperature threshold metrics, corrected label set from stale gpu_index/gpu_name to the actual gpu/instance/uuid/index for all rows in that table, added explanatory notes block, added a new "NVIDIA Thermal Thresholds and P-State" PromQL example section
  • README.md — NVIDIA platform features bullet updated to mention thermal thresholds and P0–P15 P-state; v0.21.0 changelog entry extended with the thermal monitoring description

Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026
@inureyes inureyes merged commit 5001ee8 into main Apr 15, 2026
1 check passed
@inureyes inureyes deleted the feature/issue-130-thermal-pstate branch April 15, 2026 09:15
inureyes added a commit that referenced this pull request Apr 15, 2026
Add `src/device/readers/nvidia_hardware.rs` with:
  * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode,
    and GSP firmware version per device index (static values cached
    for the process lifetime, matching the thermal-threshold pattern
    introduced in PR #173).
  * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS`
    and classifying the remote endpoint of every active link via raw FFI
    to work around a latent bug in nvml-wrapper 0.12.1's wrapper that
    writes to an immutable temporary.
  * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers
    the two-sample handshake to a follow-up; the reader currently emits
    `Some(GpmMetrics::default())` on supported hardware so the TUI can
    show a "GPM-capable" hint without fabricating numeric readings.

Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call
populates the new fields on supported devices while silently degrading to
the "unavailable" defaults on older drivers, non-datacenter SKUs, and the
nvidia-smi CSV fallback path.
inureyes added a commit that referenced this pull request Apr 15, 2026
…k, GPM) (#175)

* feat: add hardware-detail data model for NVIDIA GPU metrics

Extend `GpuInfo` with five new per-GPU fields for the extended NVIDIA
hardware-detail family (issue #132): `numa_node_id`, `gsp_firmware_mode`,
`gsp_firmware_version`, `nvlink_remote_devices`, and `gpm_metrics`. Add
supporting types `NvLinkRemoteDevice`, `NvLinkRemoteType`, and `GpmMetrics`
with stable lowercase label encodings for round-tripping through the
Prometheus exposition format.

Populate the new fields with safe "unavailable" defaults (`None` /
empty vectors) in every non-NVIDIA reader (AMD, Apple Silicon, Gaudi,
Google TPU, Furiosa, Rebellions, Tenstorrent, Jetson) plus all test
helpers in the UI / view / metrics layers so non-NVIDIA paths remain
zero-behavioural-change.

* feat: populate hardware details from NVML in the NVIDIA reader

Add `src/device/readers/nvidia_hardware.rs` with:
  * `HardwareDetailCache` memoising NUMA node id, GSP firmware mode,
    and GSP firmware version per device index (static values cached
    for the process lifetime, matching the thermal-threshold pattern
    introduced in PR #173).
  * `collect_nvlink_remote_devices` iterating up to `NVML_NVLINK_MAX_LINKS`
    and classifying the remote endpoint of every active link via raw FFI
    to work around a latent bug in nvml-wrapper 0.12.1's wrapper that
    writes to an immutable temporary.
  * `collect_gpm_metrics` that detects GPM support (Hopper+) but defers
    the two-sample handshake to a follow-up; the reader currently emits
    `Some(GpmMetrics::default())` on supported hardware so the TUI can
    show a "GPM-capable" hint without fabricating numeric readings.

Wire the new module into `NvidiaGpuReader` so every `get_gpu_info()` call
populates the new fields on supported devices while silently degrading to
the "unavailable" defaults on older drivers, non-datacenter SKUs, and the
nvidia-smi CSV fallback path.

* feat: export NVIDIA hardware details as Prometheus metrics

Add `src/api/metrics/hardware.rs` with `HardwareMetricExporter`, emitting
six metric families for every GPU that populated at least one
hardware-detail field:
  * `all_smi_gpu_numa_node_id` (gauge; omitted when unknown)
  * `all_smi_gpu_gsp_firmware_mode` (gauge, 0/1/2)
  * `all_smi_gpu_gsp_firmware_version_info` (info-style, `version` label)
  * `all_smi_nvlink_remote_device_type` (one row per active link with
    `link_index` + `remote_type=gpu|switch|ibmnpu|unknown` labels)
  * `all_smi_gpu_sm_occupancy` (gauge 0-1, only when sampled)
  * `all_smi_gpu_memory_bandwidth_utilization` (gauge 0-1, only when sampled)

The exporter self-filters to GPUs that have any hardware-detail field
populated so non-NVIDIA paths and older-driver paths stay silent.
A supported-but-unsampled GPM snapshot (`Some(GpmMetrics::default())`)
produces no gauge lines to avoid publishing misleading zeros.

Row caching in `collect_rows` pre-computes the stringified `index` label
once per GPU so the six metric families iterate without re-allocating
labels, matching the vGPU / MIG exporter pattern.

* feat: parse hardware-detail metrics on the remote side

Extend `MetricsParser::process_gpu_metrics` to handle the six issue-#132
metric families so remote-mode scrapes reconstruct the same `GpuInfo`
shape the local reader produces. Route `nvlink_*` lines through the
existing per-GPU accumulator via prefix-matching in the top-level
dispatcher.

Harden the parser against hostile remote input:
  * `MAX_NVLINK_PER_GPU = 32` caps the `nvlink_remote_devices` vector
    (current hardware tops out at 18 links; 32 leaves headroom).
  * `MAX_NUMA_NODE_ID = 4096` rejects absurd NUMA topologies.
  * `MAX_GSP_VERSION_LEN = 128` bounds the version string label.
  * GSP firmware mode accepts only 0..=2 (the values the exporter emits).
  * GPM fractions accept only `[0.0, 1.0]` — out-of-range readings drop
    so dashboards can distinguish "unavailable" from a real zero.
  * Duplicate `link_index` emissions coalesce (last-wins) rather than
    inflating the vector length.

Introduce `saturating_i32` as a counterpart to `saturating_u32` for the
signed NUMA node id field, plus `ensure_gpm_metrics` so partial GPM
payloads don't force-allocate an empty container on every GPU.

* feat: surface hardware details on a compact TUI row per GPU

Add `render_hardware_details_row` producing a compact indented row under
each GPU when NUMA placement, GSP firmware mode/version, or NvLink
topology is reported. Format example:

  HW  NUMA:0  GSP:enabled v550.54.15  NVLink:6x(gpu=5,sw=1)

GPM gauges render inline on the same row (`GPM: SM=0.67 MemBW=0.42`) but
a supported-but-unsampled GPM snapshot produces nothing, matching the
exporter contract.

Extend `gpu_render_line_count` to include the new optional row so the
PgUp/PgDn layout math stays correct when rows grow from 2 to 4 lines
on hosts that populate every extended field. A new
`gpu_has_hardware_details_row` predicate drives both the renderer and the
line-count calculation so the two paths cannot drift.

* feat: emit synthetic hardware details from the NVIDIA mock template

Add `add_hardware_detail_metrics` to the NVIDIA mock generator so the
mock server exposes representative values for every issue-#132 metric
family:
  * NUMA node alternates between 0 and 1 across GPUs (replicates a
    dual-socket host with even-split GPU topology).
  * GSP firmware enabled, version `550.54.15` (matching a recent R555
    datacenter driver).
  * 6 active NvLinks per GPU — 5 remote=gpu + 1 remote=switch —
    mirroring typical HGX 8-GPU board topology.
  * GPM gauges at fixed plausible values (SM=0.67, MemBW=0.42).

Fixed rather than randomised values keep the mock output stable across
scrapes; hardware details never change at runtime on real devices, and
stable mock values simplify downstream test assertions.

* test: add integration tests for the hardware-detail pipeline

Mirror the vGPU / MIG integration-test pattern: build Prometheus
exposition snippets that match the exporter output byte-for-byte, run
them through `MetricsParser::parse_metrics`, and assert every
hardware-detail field survives the round trip.

Coverage:
  * Full happy-path round-trip preserves NUMA, GSP mode + version,
    all 6 NvLinks with correct classification counts, and both GPM
    gauges.
  * Partial scrape (NUMA only) preserves absence of the other fields.
  * Non-NVIDIA scrape leaves every new field at its "unavailable"
    default.
  * Unknown and IBM NPU NvLink remote types round-trip intact.
  * Parser caps the per-GPU NvLink vector at 32 against a hostile
    40-link emission.
  * Out-of-range GSP firmware mode (99) and GPM fractions (9.9) are
    dropped so dashboards never render impossible values.
  * Partial GPM emission populates only the reported field.

* fix(nvidia): use per-field caches to avoid locking transient failures permanently

Replace the single HardwareDetails entry cache with three independent
per-field caches (numa, gsp_mode, gsp_version). A transient NVML error
(Unknown, GpuLost) on one field no longer prevents the other fields from
being cached, and the next poll will retry the failed field. Only permanent
errors (NotSupported, FunctionNotFound) are cached as None.

* fix(nvidia): skip transient NvLink errors instead of breaking enumeration

Narrow the early-exit in collect_nvlink_remote_devices to only break on
known past-the-end variants (InvalidArg, NotSupported). Transient errors
(Unknown, GpuLost) now skip the current link index with continue so that
a driver hiccup on link N does not silently hide links N+1 through MAX.

* fix(parser): reject fractional GSP firmware mode values

Add value.fract() == 0.0 guard to the gpu_gsp_firmware_mode branch so
fractional inputs like 1.5 (which would silently saturate to 1) are dropped
instead of stored as a wrong mode code. Apply the same guard to
gpu_performance_state for consistency.

* fix(security): reject control characters in GSP firmware version label

A malicious remote Prometheus endpoint could embed ANSI escape sequences
in the version label (e.g. ESC[2J ESC[H). Without a character filter the
TUI would execute those sequences on render. Add a chars().all(|c|
!c.is_control()) guard so any version string containing control characters
is dropped and the field stays None.

* fix(nvidia): address clippy warnings from security fixes

Collapse nested if-let in get_or_fetch_field probe path (collapsible_if)
and convert useless format! to .to_string() in control-char test.

* docs: document NVIDIA hardware details metrics in API.md and README.md

Add NVIDIA Hardware Details Metrics section to API.md covering the six
new Prometheus series (NUMA node, GSP firmware mode, GSP firmware version
info, NvLink remote type, GPM SM occupancy, GPM memory bandwidth util)
with label tables and absence semantics.

Update README.md NVIDIA feature bullet, API mode metrics list, mock
server note, and v0.21.0 changelog entry to reflect the hardware details
additions from issue #132.
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:low Low priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extended temperature thresholds and performance mode metrics

1 participant