Problem
Currently, all GPU/NPU readers fetch static values (driver versions, device information, etc.) on every get_gpu_info() call, even though these values never change during runtime. This causes unnecessary overhead and repeated system calls.
Call Frequency
- API mode: Every 3 seconds (default)
- Local mode: Every 1-2 seconds
- Remote view: Every 3-6 seconds
Impact:
- 1 hour: 1,200-3,600 redundant calls
- 24 hours: 28,800-86,400 redundant calls
- Multiplied by number of GPUs/NPUs per system
Static Values by Platform
NVIDIA GPUs (src/device/readers/nvidia.rs)
Currently fetched on every call:
- ✗ Driver version:
nvml.sys_driver_version()
- ✗ CUDA version:
nvml.sys_cuda_driver_version()
- ✗ Device details (per-device, but static):
- Brand, Architecture, PCI info
- Compute capability
- Multi-GPU board info
- ECC mode
- Persistence mode
- etc.
Location: Lines 206-215, 281-340
AMD GPUs (src/device/readers/amd.rs)
Currently fetched on every call (after #63 implementation):
- ✗ Driver version:
device_handle.get_drm_version_struct()
- ✗ ROCm version:
libamdgpu_top::get_rocm_version()
- ✗ Device details (per-device, but static):
- Device Name, ASIC Name
- Device ID, Revision ID
- VBIOS version and date
- PCI info (max GPU link, max system link)
- Power cap limits (min, max, default)
- etc.
Location: Lines 210-290 (approximate)
Apple Silicon (src/device/readers/apple_silicon.rs)
Potentially static values:
- ✗ GPU name/model
- ✗ GPU core count
- ✗ CPU model
- ✗ Total memory
NPU Platforms
Tenstorrent (src/device/readers/tenstorrent.rs):
- ✗ Board type
- ✗ Coordinate info (static per device)
- ✗ TDP values
Rebellions (src/device/readers/rebellions.rs):
- ✗ Driver version (KMD version)
- ✗ Device model
Furiosa (src/device/readers/furiosa.rs):
- ✗ Driver version
- ✗ Firmware version
- ✗ Device architecture
NVIDIA Jetson (src/device/readers/nvidia_jetson.rs):
- ✗ Device name
- ✗ CUDA version
- ✗ JetPack version
- ✗ L4T version
Proposed Solution
Use Rust's OnceCell or LazyLock (Rust 1.80+) to cache static values:
use std::sync::OnceLock;
pub struct NvidiaGpuReader {
driver_version: OnceLock<String>,
cuda_version: OnceLock<String>,
device_cache: OnceLock<Vec<DeviceStaticInfo>>,
}
impl NvidiaGpuReader {
pub fn new() -> Self {
Self {
driver_version: OnceLock::new(),
cuda_version: OnceLock::new(),
device_cache: OnceLock::new(),
}
}
fn get_driver_version(&self, nvml: &Nvml) -> String {
self.driver_version.get_or_init(|| {
nvml.sys_driver_version().unwrap_or_else(|_| "Unknown".to_string())
}).clone()
}
fn get_cuda_version(&self, nvml: &Nvml) -> String {
self.cuda_version.get_or_init(|| {
let version = nvml.sys_cuda_driver_version().unwrap_or(0);
format!("{}.{}",
cuda_driver_version_major(version),
cuda_driver_version_minor(version))
}).clone()
}
}
Alternative: Lazy Static Pattern
For earlier Rust versions or simpler cases:
use std::sync::Mutex;
pub struct NvidiaGpuReader {
driver_version: Mutex<Option<String>>,
cuda_version: Mutex<Option<String>>,
}
Implementation Tasks
Phase 1: NVIDIA GPU Reader
File: src/device/readers/nvidia.rs
Phase 2: AMD GPU Reader
File: src/device/readers/amd.rs
Phase 3: Apple Silicon Reader
File: src/device/readers/apple_silicon.rs
Phase 4: NPU Readers
Files: src/device/readers/{tenstorrent,rebellions,furiosa,nvidia_jetson}.rs
Phase 5: Testing & Validation
Expected Benefits
-
Performance:
- Reduce system calls by ~90% for static data
- Lower CPU overhead in monitoring loops
- Faster metrics collection cycles
-
Efficiency:
- Fewer NVML/driver API calls
- Reduced lock contention
- Better scalability for large clusters
-
Code Quality:
- Clear separation of static vs. dynamic data
- More maintainable reader implementations
- Better encapsulation
Compatibility Notes
- Reader Factory: Reader instances are created once and reused
- API Mode: Background task creates readers once at startup
- Local/Remote Mode: Readers persist for the session
- Thread Safety: Use
OnceLock (thread-safe) or Mutex for caching
Performance Impact Estimation
Before optimization (per GPU, 3-second interval):
- 20 calls/minute × 60 minutes = 1,200 calls/hour
- Each call: driver version + CUDA version + ~20 device details
- Total: ~26,400 redundant API calls per GPU per hour
After optimization (per GPU):
- First call: All static values fetched
- Subsequent calls: Only dynamic values (utilization, memory, temp, etc.)
- Reduction: ~95% fewer static value calls
Related Issues
Acceptance Criteria
Problem
Currently, all GPU/NPU readers fetch static values (driver versions, device information, etc.) on every
get_gpu_info()call, even though these values never change during runtime. This causes unnecessary overhead and repeated system calls.Call Frequency
Impact:
Static Values by Platform
NVIDIA GPUs (
src/device/readers/nvidia.rs)Currently fetched on every call:
nvml.sys_driver_version()nvml.sys_cuda_driver_version()Location: Lines 206-215, 281-340
AMD GPUs (
src/device/readers/amd.rs)Currently fetched on every call (after #63 implementation):
device_handle.get_drm_version_struct()libamdgpu_top::get_rocm_version()Location: Lines 210-290 (approximate)
Apple Silicon (
src/device/readers/apple_silicon.rs)Potentially static values:
NPU Platforms
Tenstorrent (
src/device/readers/tenstorrent.rs):Rebellions (
src/device/readers/rebellions.rs):Furiosa (
src/device/readers/furiosa.rs):NVIDIA Jetson (
src/device/readers/nvidia_jetson.rs):Proposed Solution
Use Rust's
OnceCellorLazyLock(Rust 1.80+) to cache static values:Alternative: Lazy Static Pattern
For earlier Rust versions or simpler cases:
Implementation Tasks
Phase 1: NVIDIA GPU Reader
File:
src/device/readers/nvidia.rsNvidiaGpuReaderstructget_gpu_info_nvml()to use cached valuesPhase 2: AMD GPU Reader
File:
src/device/readers/amd.rsAmdGpuReaderstructcollect_gpu_info()to use cached valuesPhase 3: Apple Silicon Reader
File:
src/device/readers/apple_silicon.rsPhase 4: NPU Readers
Files:
src/device/readers/{tenstorrent,rebellions,furiosa,nvidia_jetson}.rsPhase 5: Testing & Validation
Expected Benefits
Performance:
Efficiency:
Code Quality:
Compatibility Notes
OnceLock(thread-safe) orMutexfor cachingPerformance Impact Estimation
Before optimization (per GPU, 3-second interval):
After optimization (per GPU):
Related Issues
Acceptance Criteria