Problem / Background
The nvml-wrapper 0.12.0 release added new temperature and performance APIs, including set_temperature_threshold, performance_modes, and profile_info. Currently, all-smi only reports basic GPU temperature without exposing threshold context (slowdown, shutdown, max operating) or performance state information. This limits the ability to proactively detect thermal issues or understand GPU performance profiles.
Proposed Solution
Leverage the new nvml-wrapper 0.12.0 APIs to surface richer thermal and performance monitoring data in both the TUI view and Prometheus API export.
Scope
- Temperature thresholds: Read slowdown, shutdown, and max operating temperature thresholds per GPU via NVML
- Performance modes: Read available performance modes and the current performance state (P-state) for each GPU
- GPU profile info: Read GPU profile information for performance characterization
- TUI integration: Display thermal thresholds alongside current temperature in the TUI view; highlight or warn when current temperature approaches slowdown/shutdown limits
- Prometheus export: Export temperature threshold values and current performance mode/state as Prometheus metrics (e.g.,
gpu_temperature_threshold_slowdown_celsius, gpu_performance_state)
- Acoustic threshold: Include acoustic threshold information where the hardware/driver exposes it
Acceptance Criteria
Technical Considerations
- Requires
nvml-wrapper >= 0.12.0 — verify current dependency version and update if needed
- Temperature threshold reads are typically non-privileged and low-overhead, safe for periodic polling
- Performance state queries may not be available on all GPU SKUs; handle
NotSupported errors gracefully
- Consider caching threshold values (they rarely change) to avoid unnecessary NVML calls each interval
- Apple Silicon and Jetson paths are unaffected; guard new code behind NVIDIA-specific reader
Problem / Background
The
nvml-wrapper0.12.0 release added new temperature and performance APIs, includingset_temperature_threshold,performance_modes, andprofile_info. Currently, all-smi only reports basic GPU temperature without exposing threshold context (slowdown, shutdown, max operating) or performance state information. This limits the ability to proactively detect thermal issues or understand GPU performance profiles.Proposed Solution
Leverage the new nvml-wrapper 0.12.0 APIs to surface richer thermal and performance monitoring data in both the TUI view and Prometheus API export.
Scope
gpu_temperature_threshold_slowdown_celsius,gpu_performance_state)Acceptance Criteria
/metricsendpoint exports temperature threshold and performance state metricsTechnical Considerations
nvml-wrapper >= 0.12.0— verify current dependency version and update if neededNotSupportederrors gracefully