Optimize GPU readers by caching static values to avoid redundant API calls

## Problem

Currently, all GPU/NPU readers fetch static values (driver versions, device information, etc.) on every `get_gpu_info()` call, even though these values never change during runtime. This causes unnecessary overhead and repeated system calls.

### Call Frequency
- **API mode**: Every 3 seconds (default)
- **Local mode**: Every 1-2 seconds
- **Remote view**: Every 3-6 seconds

**Impact**:
- 1 hour: 1,200-3,600 redundant calls
- 24 hours: 28,800-86,400 redundant calls
- Multiplied by number of GPUs/NPUs per system

## Static Values by Platform

### NVIDIA GPUs (`src/device/readers/nvidia.rs`)
Currently fetched on every call:
- ✗ Driver version: `nvml.sys_driver_version()`
- ✗ CUDA version: `nvml.sys_cuda_driver_version()`
- ✗ Device details (per-device, but static):
  - Brand, Architecture, PCI info
  - Compute capability
  - Multi-GPU board info
  - ECC mode
  - Persistence mode
  - etc.

**Location**: Lines 206-215, 281-340

### AMD GPUs (`src/device/readers/amd.rs`)
Currently fetched on every call (after #63 implementation):
- ✗ Driver version: `device_handle.get_drm_version_struct()`
- ✗ ROCm version: `libamdgpu_top::get_rocm_version()`
- ✗ Device details (per-device, but static):
  - Device Name, ASIC Name
  - Device ID, Revision ID
  - VBIOS version and date
  - PCI info (max GPU link, max system link)
  - Power cap limits (min, max, default)
  - etc.

**Location**: Lines 210-290 (approximate)

### Apple Silicon (`src/device/readers/apple_silicon.rs`)
Potentially static values:
- ✗ GPU name/model
- ✗ GPU core count
- ✗ CPU model
- ✗ Total memory

### NPU Platforms
**Tenstorrent** (`src/device/readers/tenstorrent.rs`):
- ✗ Board type
- ✗ Coordinate info (static per device)
- ✗ TDP values

**Rebellions** (`src/device/readers/rebellions.rs`):
- ✗ Driver version (KMD version)
- ✗ Device model

**Furiosa** (`src/device/readers/furiosa.rs`):
- ✗ Driver version
- ✗ Firmware version
- ✗ Device architecture

**NVIDIA Jetson** (`src/device/readers/nvidia_jetson.rs`):
- ✗ Device name
- ✗ CUDA version
- ✗ JetPack version
- ✗ L4T version

## Proposed Solution

Use Rust's `OnceCell` or `LazyLock` (Rust 1.80+) to cache static values:

```rust
use std::sync::OnceLock;

pub struct NvidiaGpuReader {
    driver_version: OnceLock<String>,
    cuda_version: OnceLock<String>,
    device_cache: OnceLock<Vec<DeviceStaticInfo>>,
}

impl NvidiaGpuReader {
    pub fn new() -> Self {
        Self {
            driver_version: OnceLock::new(),
            cuda_version: OnceLock::new(),
            device_cache: OnceLock::new(),
        }
    }
    
    fn get_driver_version(&self, nvml: &Nvml) -> String {
        self.driver_version.get_or_init(|| {
            nvml.sys_driver_version().unwrap_or_else(|_| "Unknown".to_string())
        }).clone()
    }
    
    fn get_cuda_version(&self, nvml: &Nvml) -> String {
        self.cuda_version.get_or_init(|| {
            let version = nvml.sys_cuda_driver_version().unwrap_or(0);
            format!("{}.{}", 
                cuda_driver_version_major(version),
                cuda_driver_version_minor(version))
        }).clone()
    }
}
```

### Alternative: Lazy Static Pattern
For earlier Rust versions or simpler cases:
```rust
use std::sync::Mutex;

pub struct NvidiaGpuReader {
    driver_version: Mutex<Option<String>>,
    cuda_version: Mutex<Option<String>>,
}
```

## Implementation Tasks

### Phase 1: NVIDIA GPU Reader
**File**: `src/device/readers/nvidia.rs`

- [x] Add caching fields to `NvidiaGpuReader` struct
- [x] Cache driver version (first call only)
- [x] Cache CUDA version (first call only)
- [x] Cache per-device static info (brand, arch, PCI, etc.)
- [x] Update `get_gpu_info_nvml()` to use cached values
- [x] Verify fallback to nvidia-smi still works

### Phase 2: AMD GPU Reader
**File**: `src/device/readers/amd.rs`

- [x] Add caching fields to `AmdGpuReader` struct
- [x] Cache driver version (first call only)
- [x] Cache ROCm version (first call only)
- [x] Cache per-device static info
- [x] Update `collect_gpu_info()` to use cached values

### Phase 3: Apple Silicon Reader
**File**: `src/device/readers/apple_silicon.rs`

- [ ] Identify truly static values vs. dynamic values
- [ ] Add caching for static system info
- [ ] Cache GPU core count, CPU model, total memory

### Phase 4: NPU Readers
**Files**: `src/device/readers/{tenstorrent,rebellions,furiosa,nvidia_jetson}.rs`

- [ ] Tenstorrent: Cache board type, TDP, coordinate info
- [x] Rebellions: Cache driver version, device model
- [x] Furiosa: Cache driver version, firmware version, architecture (CLI and RS methods)
- [x] NVIDIA Jetson: Cache device name, CUDA version, JetPack version, L4T version

### Phase 5: Testing & Validation
- [ ] Unit tests for cached value initialization
- [ ] Verify values are fetched only once per reader instance
- [ ] Performance benchmarks showing reduced overhead
- [ ] Integration tests for all platforms

## Expected Benefits

1. **Performance**:
   - Reduce system calls by ~90% for static data
   - Lower CPU overhead in monitoring loops
   - Faster metrics collection cycles

2. **Efficiency**:
   - Fewer NVML/driver API calls
   - Reduced lock contention
   - Better scalability for large clusters

3. **Code Quality**:
   - Clear separation of static vs. dynamic data
   - More maintainable reader implementations
   - Better encapsulation

## Compatibility Notes

- **Reader Factory**: Reader instances are created once and reused
- **API Mode**: Background task creates readers once at startup
- **Local/Remote Mode**: Readers persist for the session
- **Thread Safety**: Use `OnceLock` (thread-safe) or `Mutex` for caching

## Performance Impact Estimation

**Before optimization** (per GPU, 3-second interval):
- 20 calls/minute × 60 minutes = 1,200 calls/hour
- Each call: driver version + CUDA version + ~20 device details
- Total: ~26,400 redundant API calls per GPU per hour

**After optimization** (per GPU):
- First call: All static values fetched
- Subsequent calls: Only dynamic values (utilization, memory, temp, etc.)
- Reduction: ~95% fewer static value calls

## Related Issues

- #63 - Add AMD GPU driver version extraction (will benefit from this optimization)

## Acceptance Criteria

- [ ] All static values cached on first access
- [ ] No performance regression for dynamic values
- [ ] Memory usage increase is minimal (< 1KB per GPU)
- [ ] All existing tests pass
- [ ] New tests verify caching behavior
- [ ] Documentation updated with caching details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize GPU readers by caching static values to avoid redundant API calls #64

Problem

Call Frequency

Static Values by Platform

NVIDIA GPUs (`src/device/readers/nvidia.rs`)

AMD GPUs (`src/device/readers/amd.rs`)

Apple Silicon (`src/device/readers/apple_silicon.rs`)

NPU Platforms

Proposed Solution

Alternative: Lazy Static Pattern

Implementation Tasks

Phase 1: NVIDIA GPU Reader

Phase 2: AMD GPU Reader

Phase 3: Apple Silicon Reader

Phase 4: NPU Readers

Phase 5: Testing & Validation

Expected Benefits

Compatibility Notes

Performance Impact Estimation

Related Issues

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Optimize GPU readers by caching static values to avoid redundant API calls #64

Description

Problem

Call Frequency

Static Values by Platform

NVIDIA GPUs (src/device/readers/nvidia.rs)

AMD GPUs (src/device/readers/amd.rs)

Apple Silicon (src/device/readers/apple_silicon.rs)

NPU Platforms

Proposed Solution

Alternative: Lazy Static Pattern

Implementation Tasks

Phase 1: NVIDIA GPU Reader

Phase 2: AMD GPU Reader

Phase 3: Apple Silicon Reader

Phase 4: NPU Readers

Phase 5: Testing & Validation

Expected Benefits

Compatibility Notes

Performance Impact Estimation

Related Issues

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

NVIDIA GPUs (`src/device/readers/nvidia.rs`)

AMD GPUs (`src/device/readers/amd.rs`)

Apple Silicon (`src/device/readers/apple_silicon.rs`)