Skip to content

Optimize GPU readers by caching static values to avoid redundant API calls #64

Description

@inureyes

Problem

Currently, all GPU/NPU readers fetch static values (driver versions, device information, etc.) on every get_gpu_info() call, even though these values never change during runtime. This causes unnecessary overhead and repeated system calls.

Call Frequency

  • API mode: Every 3 seconds (default)
  • Local mode: Every 1-2 seconds
  • Remote view: Every 3-6 seconds

Impact:

  • 1 hour: 1,200-3,600 redundant calls
  • 24 hours: 28,800-86,400 redundant calls
  • Multiplied by number of GPUs/NPUs per system

Static Values by Platform

NVIDIA GPUs (src/device/readers/nvidia.rs)

Currently fetched on every call:

  • ✗ Driver version: nvml.sys_driver_version()
  • ✗ CUDA version: nvml.sys_cuda_driver_version()
  • ✗ Device details (per-device, but static):
    • Brand, Architecture, PCI info
    • Compute capability
    • Multi-GPU board info
    • ECC mode
    • Persistence mode
    • etc.

Location: Lines 206-215, 281-340

AMD GPUs (src/device/readers/amd.rs)

Currently fetched on every call (after #63 implementation):

  • ✗ Driver version: device_handle.get_drm_version_struct()
  • ✗ ROCm version: libamdgpu_top::get_rocm_version()
  • ✗ Device details (per-device, but static):
    • Device Name, ASIC Name
    • Device ID, Revision ID
    • VBIOS version and date
    • PCI info (max GPU link, max system link)
    • Power cap limits (min, max, default)
    • etc.

Location: Lines 210-290 (approximate)

Apple Silicon (src/device/readers/apple_silicon.rs)

Potentially static values:

  • ✗ GPU name/model
  • ✗ GPU core count
  • ✗ CPU model
  • ✗ Total memory

NPU Platforms

Tenstorrent (src/device/readers/tenstorrent.rs):

  • ✗ Board type
  • ✗ Coordinate info (static per device)
  • ✗ TDP values

Rebellions (src/device/readers/rebellions.rs):

  • ✗ Driver version (KMD version)
  • ✗ Device model

Furiosa (src/device/readers/furiosa.rs):

  • ✗ Driver version
  • ✗ Firmware version
  • ✗ Device architecture

NVIDIA Jetson (src/device/readers/nvidia_jetson.rs):

  • ✗ Device name
  • ✗ CUDA version
  • ✗ JetPack version
  • ✗ L4T version

Proposed Solution

Use Rust's OnceCell or LazyLock (Rust 1.80+) to cache static values:

use std::sync::OnceLock;

pub struct NvidiaGpuReader {
    driver_version: OnceLock<String>,
    cuda_version: OnceLock<String>,
    device_cache: OnceLock<Vec<DeviceStaticInfo>>,
}

impl NvidiaGpuReader {
    pub fn new() -> Self {
        Self {
            driver_version: OnceLock::new(),
            cuda_version: OnceLock::new(),
            device_cache: OnceLock::new(),
        }
    }
    
    fn get_driver_version(&self, nvml: &Nvml) -> String {
        self.driver_version.get_or_init(|| {
            nvml.sys_driver_version().unwrap_or_else(|_| "Unknown".to_string())
        }).clone()
    }
    
    fn get_cuda_version(&self, nvml: &Nvml) -> String {
        self.cuda_version.get_or_init(|| {
            let version = nvml.sys_cuda_driver_version().unwrap_or(0);
            format!("{}.{}", 
                cuda_driver_version_major(version),
                cuda_driver_version_minor(version))
        }).clone()
    }
}

Alternative: Lazy Static Pattern

For earlier Rust versions or simpler cases:

use std::sync::Mutex;

pub struct NvidiaGpuReader {
    driver_version: Mutex<Option<String>>,
    cuda_version: Mutex<Option<String>>,
}

Implementation Tasks

Phase 1: NVIDIA GPU Reader

File: src/device/readers/nvidia.rs

  • Add caching fields to NvidiaGpuReader struct
  • Cache driver version (first call only)
  • Cache CUDA version (first call only)
  • Cache per-device static info (brand, arch, PCI, etc.)
  • Update get_gpu_info_nvml() to use cached values
  • Verify fallback to nvidia-smi still works

Phase 2: AMD GPU Reader

File: src/device/readers/amd.rs

  • Add caching fields to AmdGpuReader struct
  • Cache driver version (first call only)
  • Cache ROCm version (first call only)
  • Cache per-device static info
  • Update collect_gpu_info() to use cached values

Phase 3: Apple Silicon Reader

File: src/device/readers/apple_silicon.rs

  • Identify truly static values vs. dynamic values
  • Add caching for static system info
  • Cache GPU core count, CPU model, total memory

Phase 4: NPU Readers

Files: src/device/readers/{tenstorrent,rebellions,furiosa,nvidia_jetson}.rs

  • Tenstorrent: Cache board type, TDP, coordinate info
  • Rebellions: Cache driver version, device model
  • Furiosa: Cache driver version, firmware version, architecture (CLI and RS methods)
  • NVIDIA Jetson: Cache device name, CUDA version, JetPack version, L4T version

Phase 5: Testing & Validation

  • Unit tests for cached value initialization
  • Verify values are fetched only once per reader instance
  • Performance benchmarks showing reduced overhead
  • Integration tests for all platforms

Expected Benefits

  1. Performance:

    • Reduce system calls by ~90% for static data
    • Lower CPU overhead in monitoring loops
    • Faster metrics collection cycles
  2. Efficiency:

    • Fewer NVML/driver API calls
    • Reduced lock contention
    • Better scalability for large clusters
  3. Code Quality:

    • Clear separation of static vs. dynamic data
    • More maintainable reader implementations
    • Better encapsulation

Compatibility Notes

  • Reader Factory: Reader instances are created once and reused
  • API Mode: Background task creates readers once at startup
  • Local/Remote Mode: Readers persist for the session
  • Thread Safety: Use OnceLock (thread-safe) or Mutex for caching

Performance Impact Estimation

Before optimization (per GPU, 3-second interval):

  • 20 calls/minute × 60 minutes = 1,200 calls/hour
  • Each call: driver version + CUDA version + ~20 device details
  • Total: ~26,400 redundant API calls per GPU per hour

After optimization (per GPU):

  • First call: All static values fetched
  • Subsequent calls: Only dynamic values (utilization, memory, temp, etc.)
  • Reduction: ~95% fewer static value calls

Related Issues

Acceptance Criteria

  • All static values cached on first access
  • No performance regression for dynamic values
  • Memory usage increase is minimal (< 1KB per GPU)
  • All existing tests pass
  • New tests verify caching behavior
  • Documentation updated with caching details

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions