Skip to content

fix: anchor temperature/power/ANE sparklines to fixed Y-axis ranges#237

Merged
inureyes merged 3 commits into
mainfrom
fix/issue-236-sparkline-fixed-ranges
May 25, 2026
Merged

fix: anchor temperature/power/ANE sparklines to fixed Y-axis ranges#237
inureyes merged 3 commits into
mainfrom
fix/issue-236-sparkline-fixed-ranges

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

The GPU Metrics box and summary-bar sparklines auto-scaled their Y-axis to the per-frame data window's [min, max]. With only four braille vertical levels this made the absolute-magnitude graphs (temperature, power, ANE) meaningless: noise was amplified to full height, the baseline shifted every frame as the 100-sample window slid, and near-constant series collapsed to the bottom row — so a blazing 90°C looked identical to a cool 35°C. The accompanying min/max badge jittered every frame too.

This replaces per-window auto-ranging with fixed, domain-meaningful axes.

Changes

  • New src/ui/scale.rs — fixed-range helpers:
    • temp_range(gpu)(30, ceiling); ceiling = reported GPU thermal threshold (slowdown → max_operating → shutdown), else 100°C fallback (also used for CPU temp, which reports no threshold).
    • power_range(gpu, history)(0, ceiling); ceiling = enforced power limit from gpu.detail (NVIDIA/Gaudi), else nice_ceil(peak) floored at 10 W.
    • ane_range(history)(0, nice_ceil(max(peak, 8 W))).
    • nice_ceil() rounds up to 1/2/5 × 10ⁿ so the fallback axis doesn't drift with small peak changes.
    • scale_badge() formats the fixed axis for display.
  • gpu_sparkline_panel.rs — GPU Temp / ANE / Pkg Power now use fixed axes; the badge shows the fixed axis (e.g. 30-83) instead of the jittery observed window min/max. Removed the now-unused min_max_badge.
  • local_header.rs — top summary Tmp (system temperature) and Pwr now use fixed axes.
  • remote_sparkline_panel.rs — remote GPU/CPU Temp now use fixed axes.
  • ioreport.rs — collapsed a clippy::collapsible_match into a match guard (drive-by; it was the only remaining clippy warning in the lib).

CPU%, memory%, and per-core bars were already pinned to (0,100) and are unchanged.

Behavior

Before (auto-range, temp steady 45–47°C):  GPU Temp ⣀⣶⣀⣿⣶ 46°C  45-47   (2°C → full scale, badge drifts)
After  (fixed 30–83°C):                     GPU Temp ⣀⣀⣀⣀⣀ 46°C  30-83   (low & steady; rises under load)

Test plan

  • cargo fmt --check
  • cargo clippy --lib → 0 warnings ✅
  • cargo test --lib → 994 passed, 0 failed, including 13 new scale tests (threshold/limit priority, nice_ceil rounding, axis stability under window shift, fallbacks).

Closes #236

The GPU Metrics box and summary-bar sparklines auto-scaled their Y-axis
to the per-frame window [min,max]. With only four braille vertical
levels this amplified noise (a ±1C wiggle filled the full height),
shifted the baseline every frame as the window slid, and collapsed
near-constant series to the bottom row -- making a hot reading look
identical to a cool one.

Add src/ui/scale.rs with fixed, domain-meaningful ranges:
- temperature: 30C floor + thermal-threshold ceiling (100C fallback)
- power: 0 + enforced power limit, else nice_ceil over the peak
- ANE: 0 + nice_ceil(max(peak, 8W))

Wire the helpers into gpu_sparkline_panel, local_header, and
remote_sparkline_panel, and switch the GPU Metrics badge from the
jittery observed window min/max to the stable fixed axis. CPU%, memory%,
and per-core bars were already fixed at (0,100) and are unchanged.

Also collapse a clippy collapsible_match in ioreport.rs.

Closes #236
@inureyes inureyes added type:bug Something isn't working mode:view View mode related labels May 25, 2026
The Pkg Power / summary Pwr sparkline axis is anchored to the device's enforced power limit (`power_range`), and on remote hosts that limit originates from a scraped Prometheus value. Rust's `f64` parsing accepts `"inf"`, `"infinity"`, and `"NaN"`, so a malformed or hostile remote endpoint emitting `gpu_power_limit_max_watts inf` round-trips through the metrics parser into `gpu.detail`, and `power_range`'s `w > 0.0` filter let it through (`inf > 0.0` is true). The helper then returned `(0, inf)`, which `scale_badge` rendered as a `0-inf` axis legend. The downstream braille renderer's non-finite guard prevented a panic, but the corrupted badge still surfaced from untrusted input.

A second, narrower route reached the same `(0, inf)` result through the fallback path: a pathologically large but finite `power_consumption` scrape (near `f64::MAX`) survives the finite filter in `history_peak`, and `nice_ceil` then overflowed `nice * pow` to infinity.

Both routes are now closed in `src/ui/scale.rs`: the power-limit filter requires a positive *finite* value (a bogus limit falls back to the nice-rounded observed peak), and `nice_ceil` guards its result so it can never return a non-finite ceiling. Added regression tests for non-finite limit strings and `nice_ceil` overflow near `f64::MAX`.
@inureyes

Copy link
Copy Markdown
Member Author

Security & performance review

Reviewed the diff against main with focus on the float/numeric edge cases in scale.rs, the untrusted power-limit parse, the render hot path, and the ioreport.rs refactor.

Finding addressed (MEDIUM) — non-finite power limit reached the axis

power_range anchors the Pkg Power / summary Pwr axis to the device's enforced power limit, and in remote view mode that limit comes from a scraped Prometheus value (gpu_power_limit_max_wattsmetrics_parser.rsgpu.detail["power_limit_max"]). Rust's f64::from_str accepts "inf", "infinity", and "NaN", and the existing guard was only .filter(|&w| w > 0.0) — but inf > 0.0 is true. A malformed or hostile endpoint emitting ... inf therefore produced a (0, inf) range, which scale_badge rendered as a literal 0-inf axis legend.

A second, narrower route hit the same (0, inf) via the fallback path: a pathologically large but finite power_consumption scrape (near f64::MAX) passes the finite filter in history_peak, and nice_ceil then overflowed nice * pow to infinity.

Neither route can panic (the braille renderer at src/ui/braille.rs:77 treats non-finite range bounds as a degenerate constant range), so this is display corruption from untrusted input rather than a crash/DoS — hence MEDIUM, not CRITICAL.

Fix (4b68bba, src/ui/scale.rs only):

  • power_range limit filter now requires w.is_finite() && w > 0.0; a bogus limit falls back to the nice-rounded observed peak.
  • nice_ceil guards its result so it can never return a non-finite ceiling.
  • Added regression tests: power_range_ignores_non_finite_limit (covers inf/Inf/infinity/-inf/NaN/nan) and nice_ceil_result_is_always_finite (covers f64::MAX).

Reviewed and clear

  • Temperature math — thresholds are Option<u32>f64::from is exact and non-negative; filtered to > 30.0 with a 100°C fallback. No NaN, no inversion, no divide-by-zero.
  • nice_ceil / log10 / powf — zero/negative/NaN/inf inputs are short-circuited to 1.0; subnormal-to-0.0 and overflow-to-inf cases are unreachable for real inputs (callers floor at 8/10 W) and are now defended regardless.
  • history_peak — already filters non-finite samples, so NaN/inf in power_consumption/ane_utilization history never poisons the peak.
  • Render hot path — per-frame cost is O(1) range math plus a single O(n≤100) history_peak pass that replaces what the old auto-range scanned internally; one small scale_badge String per row. No new allocations or super-linear scans; no regression.
  • ioreport.rs — collapsing the nested if item.channel == "GPUPH" into a match guard is behavior-preserving: a non-GPUPH channel falls through to _ => {} either way.

Gate

cargo fmt --check, cargo clippy --lib (0 warnings), cargo test --lib (996 passed, 0 failed; +2 new) all pass.

Address pr-reviewer findings on #237:

- [MEDIUM] Package power is summed across all GPUs, but power_range
  anchored the ceiling to only the first GPU's limit. On a multi-GPU
  NVIDIA node (all-smi's common case) the summed power far exceeds one
  GPU's limit, pinning the Pkg Power sparkline flat at the top. power_range
  now takes &[GpuInfo] and sums every GPU's enforced limit, falling back to
  the nice-rounded peak when any GPU lacks one.
- [LOW] Extract gpu_power_limit() so each power-limit key is parsed and
  validated independently in current -> max -> default order; a
  present-but-invalid power_limit_current no longer masks a valid
  power_limit_max.

Adds 3 regression tests (multi-GPU sum, mixed-limit fallback, per-key fallback).
@inureyes

Copy link
Copy Markdown
Member Author

Addressed pr-reviewer findings (commit 01e7a02)

  • [MEDIUM] Multi-GPU aggregate power clippower_range now takes &[GpuInfo] and anchors the ceiling to the sum of every GPU's enforced limit (package power is itself summed across GPUs). When any GPU lacks a valid limit it falls back to the nice-rounded observed peak, so an N-GPU NVIDIA node no longer pins Pkg Power flat at the top.
  • [LOW] Power-limit key fallback short-circuit — extracted gpu_power_limit(); each key is now parsed + validated independently in current → max → default order, so a present-but-invalid power_limit_current no longer masks a valid power_limit_max.

Added 3 regression tests (multi-GPU sum, mixed-limit fallback, per-key fallback).

Gates: cargo fmt --check ✅ · cargo clippy --lib → 0 warnings ✅ · cargo test --lib → 999 passed, 0 failed ✅

@inureyes inureyes merged commit 76edd9f into main May 25, 2026
4 checks passed
@inureyes inureyes deleted the fix/issue-236-sparkline-fixed-ranges branch May 25, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mode:view View mode related type:bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: anchor temperature / power / ANE sparklines to fixed Y-axis ranges in the GPU Metrics box and summary bars

1 participant