Skip to content

Add extended temperature thresholds and performance mode metrics #130

Description

@inureyes

Problem / Background

The nvml-wrapper 0.12.0 release added new temperature and performance APIs, including set_temperature_threshold, performance_modes, and profile_info. Currently, all-smi only reports basic GPU temperature without exposing threshold context (slowdown, shutdown, max operating) or performance state information. This limits the ability to proactively detect thermal issues or understand GPU performance profiles.

Proposed Solution

Leverage the new nvml-wrapper 0.12.0 APIs to surface richer thermal and performance monitoring data in both the TUI view and Prometheus API export.

Scope

  • Temperature thresholds: Read slowdown, shutdown, and max operating temperature thresholds per GPU via NVML
  • Performance modes: Read available performance modes and the current performance state (P-state) for each GPU
  • GPU profile info: Read GPU profile information for performance characterization
  • TUI integration: Display thermal thresholds alongside current temperature in the TUI view; highlight or warn when current temperature approaches slowdown/shutdown limits
  • Prometheus export: Export temperature threshold values and current performance mode/state as Prometheus metrics (e.g., gpu_temperature_threshold_slowdown_celsius, gpu_performance_state)
  • Acoustic threshold: Include acoustic threshold information where the hardware/driver exposes it

Acceptance Criteria

  • Temperature thresholds (slowdown, shutdown, max operating) are read per GPU on NVIDIA platforms
  • Current performance state (P-state) is read and exposed per GPU
  • Available performance modes and GPU profile info are retrieved where supported
  • TUI displays temperature thresholds alongside current temperature
  • TUI highlights or warns when temperature is within a configurable margin of slowdown/shutdown thresholds
  • Prometheus /metrics endpoint exports temperature threshold and performance state metrics
  • Acoustic threshold data is included when available from the hardware
  • Graceful fallback when APIs are unsupported (older drivers, non-NVIDIA hardware)
  • Feature is fully integrated into the existing data collection, TUI rendering, and API export code paths

Technical Considerations

  • Requires nvml-wrapper >= 0.12.0 — verify current dependency version and update if needed
  • Temperature threshold reads are typically non-privileged and low-overhead, safe for periodic polling
  • Performance state queries may not be available on all GPU SKUs; handle NotSupported errors gracefully
  • Consider caching threshold values (they rarely change) to avoid unnecessary NVML calls each interval
  • Apple Silicon and Jetson paths are unaffected; guard new code behind NVIDIA-specific reader

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions