fix: anchor temperature/power/ANE sparklines to fixed Y-axis ranges#237
Conversation
The GPU Metrics box and summary-bar sparklines auto-scaled their Y-axis to the per-frame window [min,max]. With only four braille vertical levels this amplified noise (a ±1C wiggle filled the full height), shifted the baseline every frame as the window slid, and collapsed near-constant series to the bottom row -- making a hot reading look identical to a cool one. Add src/ui/scale.rs with fixed, domain-meaningful ranges: - temperature: 30C floor + thermal-threshold ceiling (100C fallback) - power: 0 + enforced power limit, else nice_ceil over the peak - ANE: 0 + nice_ceil(max(peak, 8W)) Wire the helpers into gpu_sparkline_panel, local_header, and remote_sparkline_panel, and switch the GPU Metrics badge from the jittery observed window min/max to the stable fixed axis. CPU%, memory%, and per-core bars were already fixed at (0,100) and are unchanged. Also collapse a clippy collapsible_match in ioreport.rs. Closes #236
The Pkg Power / summary Pwr sparkline axis is anchored to the device's enforced power limit (`power_range`), and on remote hosts that limit originates from a scraped Prometheus value. Rust's `f64` parsing accepts `"inf"`, `"infinity"`, and `"NaN"`, so a malformed or hostile remote endpoint emitting `gpu_power_limit_max_watts inf` round-trips through the metrics parser into `gpu.detail`, and `power_range`'s `w > 0.0` filter let it through (`inf > 0.0` is true). The helper then returned `(0, inf)`, which `scale_badge` rendered as a `0-inf` axis legend. The downstream braille renderer's non-finite guard prevented a panic, but the corrupted badge still surfaced from untrusted input. A second, narrower route reached the same `(0, inf)` result through the fallback path: a pathologically large but finite `power_consumption` scrape (near `f64::MAX`) survives the finite filter in `history_peak`, and `nice_ceil` then overflowed `nice * pow` to infinity. Both routes are now closed in `src/ui/scale.rs`: the power-limit filter requires a positive *finite* value (a bogus limit falls back to the nice-rounded observed peak), and `nice_ceil` guards its result so it can never return a non-finite ceiling. Added regression tests for non-finite limit strings and `nice_ceil` overflow near `f64::MAX`.
Security & performance reviewReviewed the diff against Finding addressed (MEDIUM) — non-finite power limit reached the axis
A second, narrower route hit the same Neither route can panic (the braille renderer at Fix (
Reviewed and clear
Gate
|
Address pr-reviewer findings on #237: - [MEDIUM] Package power is summed across all GPUs, but power_range anchored the ceiling to only the first GPU's limit. On a multi-GPU NVIDIA node (all-smi's common case) the summed power far exceeds one GPU's limit, pinning the Pkg Power sparkline flat at the top. power_range now takes &[GpuInfo] and sums every GPU's enforced limit, falling back to the nice-rounded peak when any GPU lacks one. - [LOW] Extract gpu_power_limit() so each power-limit key is parsed and validated independently in current -> max -> default order; a present-but-invalid power_limit_current no longer masks a valid power_limit_max. Adds 3 regression tests (multi-GPU sum, mixed-limit fallback, per-key fallback).
Addressed
|
Summary
The GPU Metrics box and summary-bar sparklines auto-scaled their Y-axis to the per-frame data window's
[min, max]. With only four braille vertical levels this made the absolute-magnitude graphs (temperature, power, ANE) meaningless: noise was amplified to full height, the baseline shifted every frame as the 100-sample window slid, and near-constant series collapsed to the bottom row — so a blazing 90°C looked identical to a cool 35°C. The accompanying min/max badge jittered every frame too.This replaces per-window auto-ranging with fixed, domain-meaningful axes.
Changes
src/ui/scale.rs— fixed-range helpers:temp_range(gpu)→(30, ceiling); ceiling = reported GPU thermal threshold (slowdown → max_operating → shutdown), else100°Cfallback (also used for CPU temp, which reports no threshold).power_range(gpu, history)→(0, ceiling); ceiling = enforced power limit fromgpu.detail(NVIDIA/Gaudi), elsenice_ceil(peak)floored at 10 W.ane_range(history)→(0, nice_ceil(max(peak, 8 W))).nice_ceil()rounds up to1/2/5 × 10ⁿso the fallback axis doesn't drift with small peak changes.scale_badge()formats the fixed axis for display.gpu_sparkline_panel.rs— GPU Temp / ANE / Pkg Power now use fixed axes; the badge shows the fixed axis (e.g.30-83) instead of the jittery observed window min/max. Removed the now-unusedmin_max_badge.local_header.rs— top summary Tmp (system temperature) and Pwr now use fixed axes.remote_sparkline_panel.rs— remote GPU/CPU Temp now use fixed axes.ioreport.rs— collapsed aclippy::collapsible_matchinto a match guard (drive-by; it was the only remaining clippy warning in the lib).CPU%, memory%, and per-core bars were already pinned to
(0,100)and are unchanged.Behavior
Test plan
cargo fmt --check✅cargo clippy --lib→ 0 warnings ✅cargo test --lib→ 994 passed, 0 failed, including 13 newscaletests (threshold/limit priority,nice_ceilrounding, axis stability under window shift, fallbacks).Closes #236