fix: anchor temperature/power/ANE sparklines to fixed Y-axis ranges by inureyes · Pull Request #237 · lablup/all-smi

inureyes · 2026-05-25T04:50:50Z

Summary

The GPU Metrics box and summary-bar sparklines auto-scaled their Y-axis to the per-frame data window's [min, max]. With only four braille vertical levels this made the absolute-magnitude graphs (temperature, power, ANE) meaningless: noise was amplified to full height, the baseline shifted every frame as the 100-sample window slid, and near-constant series collapsed to the bottom row — so a blazing 90°C looked identical to a cool 35°C. The accompanying min/max badge jittered every frame too.

This replaces per-window auto-ranging with fixed, domain-meaningful axes.

Changes

New src/ui/scale.rs — fixed-range helpers:
- temp_range(gpu) → (30, ceiling); ceiling = reported GPU thermal threshold (slowdown → max_operating → shutdown), else 100°C fallback (also used for CPU temp, which reports no threshold).
- power_range(gpu, history) → (0, ceiling); ceiling = enforced power limit from gpu.detail (NVIDIA/Gaudi), else nice_ceil(peak) floored at 10 W.
- ane_range(history) → (0, nice_ceil(max(peak, 8 W))).
- nice_ceil() rounds up to 1/2/5 × 10ⁿ so the fallback axis doesn't drift with small peak changes.
- scale_badge() formats the fixed axis for display.
gpu_sparkline_panel.rs — GPU Temp / ANE / Pkg Power now use fixed axes; the badge shows the fixed axis (e.g. 30-83) instead of the jittery observed window min/max. Removed the now-unused min_max_badge.
local_header.rs — top summary Tmp (system temperature) and Pwr now use fixed axes.
remote_sparkline_panel.rs — remote GPU/CPU Temp now use fixed axes.
ioreport.rs — collapsed a clippy::collapsible_match into a match guard (drive-by; it was the only remaining clippy warning in the lib).

CPU%, memory%, and per-core bars were already pinned to (0,100) and are unchanged.

Behavior

Before (auto-range, temp steady 45–47°C):  GPU Temp ⣀⣶⣀⣿⣶ 46°C  45-47   (2°C → full scale, badge drifts)
After  (fixed 30–83°C):                     GPU Temp ⣀⣀⣀⣀⣀ 46°C  30-83   (low & steady; rises under load)

Test plan

cargo fmt --check ✅
cargo clippy --lib → 0 warnings ✅
cargo test --lib → 994 passed, 0 failed, including 13 new scale tests (threshold/limit priority, nice_ceil rounding, axis stability under window shift, fallbacks).

Closes #236

The GPU Metrics box and summary-bar sparklines auto-scaled their Y-axis to the per-frame window [min,max]. With only four braille vertical levels this amplified noise (a ±1C wiggle filled the full height), shifted the baseline every frame as the window slid, and collapsed near-constant series to the bottom row -- making a hot reading look identical to a cool one. Add src/ui/scale.rs with fixed, domain-meaningful ranges: - temperature: 30C floor + thermal-threshold ceiling (100C fallback) - power: 0 + enforced power limit, else nice_ceil over the peak - ANE: 0 + nice_ceil(max(peak, 8W)) Wire the helpers into gpu_sparkline_panel, local_header, and remote_sparkline_panel, and switch the GPU Metrics badge from the jittery observed window min/max to the stable fixed axis. CPU%, memory%, and per-core bars were already fixed at (0,100) and are unchanged. Also collapse a clippy collapsible_match in ioreport.rs. Closes #236

The Pkg Power / summary Pwr sparkline axis is anchored to the device's enforced power limit (`power_range`), and on remote hosts that limit originates from a scraped Prometheus value. Rust's `f64` parsing accepts `"inf"`, `"infinity"`, and `"NaN"`, so a malformed or hostile remote endpoint emitting `gpu_power_limit_max_watts inf` round-trips through the metrics parser into `gpu.detail`, and `power_range`'s `w > 0.0` filter let it through (`inf > 0.0` is true). The helper then returned `(0, inf)`, which `scale_badge` rendered as a `0-inf` axis legend. The downstream braille renderer's non-finite guard prevented a panic, but the corrupted badge still surfaced from untrusted input. A second, narrower route reached the same `(0, inf)` result through the fallback path: a pathologically large but finite `power_consumption` scrape (near `f64::MAX`) survives the finite filter in `history_peak`, and `nice_ceil` then overflowed `nice * pow` to infinity. Both routes are now closed in `src/ui/scale.rs`: the power-limit filter requires a positive *finite* value (a bogus limit falls back to the nice-rounded observed peak), and `nice_ceil` guards its result so it can never return a non-finite ceiling. Added regression tests for non-finite limit strings and `nice_ceil` overflow near `f64::MAX`.

inureyes · 2026-05-25T04:56:04Z

Security & performance review

Reviewed the diff against main with focus on the float/numeric edge cases in scale.rs, the untrusted power-limit parse, the render hot path, and the ioreport.rs refactor.

Finding addressed (MEDIUM) — non-finite power limit reached the axis

power_range anchors the Pkg Power / summary Pwr axis to the device's enforced power limit, and in remote view mode that limit comes from a scraped Prometheus value (gpu_power_limit_max_watts → metrics_parser.rs → gpu.detail["power_limit_max"]). Rust's f64::from_str accepts "inf", "infinity", and "NaN", and the existing guard was only .filter(|&w| w > 0.0) — but inf > 0.0 is true. A malformed or hostile endpoint emitting ... inf therefore produced a (0, inf) range, which scale_badge rendered as a literal 0-inf axis legend.

A second, narrower route hit the same (0, inf) via the fallback path: a pathologically large but finite power_consumption scrape (near f64::MAX) passes the finite filter in history_peak, and nice_ceil then overflowed nice * pow to infinity.

Neither route can panic (the braille renderer at src/ui/braille.rs:77 treats non-finite range bounds as a degenerate constant range), so this is display corruption from untrusted input rather than a crash/DoS — hence MEDIUM, not CRITICAL.

Fix (4b68bba, src/ui/scale.rs only):

power_range limit filter now requires w.is_finite() && w > 0.0; a bogus limit falls back to the nice-rounded observed peak.
nice_ceil guards its result so it can never return a non-finite ceiling.
Added regression tests: power_range_ignores_non_finite_limit (covers inf/Inf/infinity/-inf/NaN/nan) and nice_ceil_result_is_always_finite (covers f64::MAX).

Reviewed and clear

Temperature math — thresholds are Option<u32> → f64::from is exact and non-negative; filtered to > 30.0 with a 100°C fallback. No NaN, no inversion, no divide-by-zero.
nice_ceil / log10 / powf — zero/negative/NaN/inf inputs are short-circuited to 1.0; subnormal-to-0.0 and overflow-to-inf cases are unreachable for real inputs (callers floor at 8/10 W) and are now defended regardless.
history_peak — already filters non-finite samples, so NaN/inf in power_consumption/ane_utilization history never poisons the peak.
Render hot path — per-frame cost is O(1) range math plus a single O(n≤100) history_peak pass that replaces what the old auto-range scanned internally; one small scale_badge String per row. No new allocations or super-linear scans; no regression.
ioreport.rs — collapsing the nested if item.channel == "GPUPH" into a match guard is behavior-preserving: a non-GPUPH channel falls through to _ => {} either way.

Gate

cargo fmt --check, cargo clippy --lib (0 warnings), cargo test --lib (996 passed, 0 failed; +2 new) all pass.

Address pr-reviewer findings on #237: - [MEDIUM] Package power is summed across all GPUs, but power_range anchored the ceiling to only the first GPU's limit. On a multi-GPU NVIDIA node (all-smi's common case) the summed power far exceeds one GPU's limit, pinning the Pkg Power sparkline flat at the top. power_range now takes &[GpuInfo] and sums every GPU's enforced limit, falling back to the nice-rounded peak when any GPU lacks one. - [LOW] Extract gpu_power_limit() so each power-limit key is parsed and validated independently in current -> max -> default order; a present-but-invalid power_limit_current no longer masks a valid power_limit_max. Adds 3 regression tests (multi-GPU sum, mixed-limit fallback, per-key fallback).

inureyes · 2026-05-25T05:21:32Z

Addressed `pr-reviewer` findings (commit `01e7a02`)

[MEDIUM] Multi-GPU aggregate power clip — power_range now takes &[GpuInfo] and anchors the ceiling to the sum of every GPU's enforced limit (package power is itself summed across GPUs). When any GPU lacks a valid limit it falls back to the nice-rounded observed peak, so an N-GPU NVIDIA node no longer pins Pkg Power flat at the top.
[LOW] Power-limit key fallback short-circuit — extracted gpu_power_limit(); each key is now parsed + validated independently in current → max → default order, so a present-but-invalid power_limit_current no longer masks a valid power_limit_max.

Added 3 regression tests (multi-GPU sum, mixed-limit fallback, per-key fallback).

Gates: cargo fmt --check ✅ · cargo clippy --lib → 0 warnings ✅ · cargo test --lib → 999 passed, 0 failed ✅

inureyes added type:bug Something isn't working mode:view View mode related labels May 25, 2026

inureyes merged commit 76edd9f into main May 25, 2026
4 checks passed

inureyes deleted the fix/issue-236-sparkline-fixed-ranges branch May 25, 2026 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: anchor temperature/power/ANE sparklines to fixed Y-axis ranges#237

fix: anchor temperature/power/ANE sparklines to fixed Y-axis ranges#237
inureyes merged 3 commits into
mainfrom
fix/issue-236-sparkline-fixed-ranges

inureyes commented May 25, 2026

Uh oh!

inureyes commented May 25, 2026

Uh oh!

inureyes commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented May 25, 2026

Summary

Changes

Behavior

Test plan

Uh oh!

inureyes commented May 25, 2026

Security & performance review

Finding addressed (MEDIUM) — non-finite power limit reached the axis

Reviewed and clear

Gate

Uh oh!

inureyes commented May 25, 2026

Addressed pr-reviewer findings (commit 01e7a02)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Addressed `pr-reviewer` findings (commit `01e7a02`)