Skip to content

feat: Add proper library API for external Rust projects #106

Description

@inureyes

Problem / Background

The all-smi crate is currently structured primarily as a CLI application, but it already has a library target defined in Cargo.toml (name = "all_smi"). While the core building blocks exist (traits, types, and factory functions), using all-smi as a library in external Rust projects requires understanding internal module structure and manually managing platform-specific components.

Current State:

  • Cargo.toml defines [lib] with name = "all_smi"
  • src/lib.rs re-exports modules primarily for testing purposes
  • Core types exist: GpuInfo, CpuInfo, MemoryInfo, ProcessInfo, ChassisInfo
  • Core traits exist: GpuReader, CpuReader, MemoryReader, ChassisReader
  • Factory functions exist: get_gpu_readers(), get_cpu_readers(), get_memory_readers()

Pain Points for Library Users:

  1. No high-level, ergonomic API - users must understand internal trait-based architecture
  2. Platform-specific managers (e.g., NativeMetricsManager for macOS, HlsmiManager for Gaudi) require manual lifecycle management
  3. No unified error type hierarchy for library usage
  4. Missing prelude module for convenient imports
  5. No documentation examples for library usage

Proposed Solution

Create a user-friendly library API that abstracts away internal complexity while maintaining full access to lower-level components when needed.

Phase 1: High-Level API (src/api.rs or src/client.rs)

Create an AllSmi struct with simple, ergonomic methods:

use all_smi::AllSmi;

fn main() -> Result<(), all_smi::Error> {
    // Initialize with auto-detection
    let smi = AllSmi::new()?;
    
    // Get all GPU/NPU information
    let gpus = smi.get_gpu_info();
    for gpu in &gpus {
        println!("{}: {}% utilization, {:.1}W", 
            gpu.name, gpu.utilization, gpu.power_consumption);
    }
    
    // Get CPU information
    let cpus = smi.get_cpu_info();
    for cpu in &cpus {
        println!("{}: {}% utilization", cpu.cpu_model, cpu.utilization);
    }
    
    // Get memory information  
    let memory = smi.get_memory_info();
    for mem in &memory {
        println!("Memory: {:.1}% used", mem.utilization);
    }
    
    // Get process information (GPU processes)
    let processes = smi.get_process_info();
    for proc in &processes {
        println!("PID {}: {} using {} MB GPU memory",
            proc.pid, proc.process_name, proc.used_memory / 1024 / 1024);
    }
    
    // Get chassis/node-level information
    if let Some(chassis) = smi.get_chassis_info() {
        if let Some(power) = chassis.total_power_watts {
            println!("Total system power: {:.1}W", power);
        }
    }
    
    Ok(())
}

Phase 2: Prelude Module (src/prelude.rs)

Create convenient re-exports for use all_smi::prelude::*:

// src/prelude.rs
pub use crate::AllSmi;
pub use crate::Error;

// Core types
pub use crate::device::{
    GpuInfo, CpuInfo, MemoryInfo, ProcessInfo, ChassisInfo,
    CpuPlatformType, CoreType, CoreUtilization,
    AppleSiliconCpuInfo, CpuSocketInfo,
    FanInfo, PsuInfo, PsuStatus,
};

// Traits (for advanced usage)
pub use crate::device::{GpuReader, CpuReader, MemoryReader, ChassisReader};

// Factory functions (for advanced usage)
pub use crate::device::{get_gpu_readers, get_cpu_readers, get_memory_readers, create_chassis_reader};

Phase 3: Unified Error Type (src/error.rs)

Create a proper error hierarchy:

use thiserror::Error;

#[derive(Debug, Error)]
pub enum Error {
    #[error("Platform initialization failed: {0}")]
    PlatformInit(String),
    
    #[error("No supported devices found")]
    NoDevicesFound,
    
    #[error("Device access error: {0}")]
    DeviceAccess(String),
    
    #[error("Permission denied: {0}")]
    PermissionDenied(String),
    
    #[error("Feature not supported on this platform: {0}")]
    NotSupported(String),
    
    #[error(transparent)]
    Io(#[from] std::io::Error),
}

pub type Result<T> = std::result::Result<T, Error>;

Phase 4: Lifecycle Management

The AllSmi struct should handle:

  • Auto-initialize platform-specific managers (NativeMetricsManager, HlsmiManager)
  • Implement Drop for proper cleanup
  • Provide refresh() method for updating cached data
impl AllSmi {
    pub fn new() -> Result<Self> { ... }
    
    /// Refresh all device information
    pub fn refresh(&mut self) -> Result<()> { ... }
    
    /// Get supported device types on this platform
    pub fn supported_devices(&self) -> Vec<DeviceType> { ... }
}

impl Drop for AllSmi {
    fn drop(&mut self) {
        // Clean up platform-specific resources
    }
}

Phase 5: Builder Pattern (Optional)

For advanced configuration:

let smi = AllSmi::builder()
    .with_gpu_monitoring(true)
    .with_cpu_monitoring(true)
    .with_memory_monitoring(true)
    .with_process_monitoring(false)  // Disable process monitoring
    .with_refresh_interval(Duration::from_secs(3))
    .build()?;

Acceptance Criteria

  • Create AllSmi struct with high-level API in src/client.rs or similar
  • Create prelude module with common re-exports
  • Create unified Error type using thiserror
  • Implement proper lifecycle management with Drop
  • Auto-initialize platform managers (NativeMetricsManager, HlsmiManager)
  • Add comprehensive doc comments with examples
  • Create examples/library_usage.rs demonstrating library usage
  • Update src/lib.rs to export new public API
  • Add integration tests for library API
  • Update README.md with library usage documentation

Technical Considerations

Platform-Specific Initialization

The library needs to handle platform-specific managers:

  • macOS (Apple Silicon): Initialize NativeMetricsManager for IOReport/SMC access
  • Linux (Gaudi): Initialize HlsmiManager for hl-smi process management
  • Cross-platform: Handle cases where no GPU/NPU is detected gracefully

Thread Safety

  • AllSmi should be Send + Sync for use in async contexts
  • Consider using Arc<Mutex<...>> for shared state if needed
  • Existing readers already implement Send + Sync

Error Handling

  • Return Result<T> from all fallible operations
  • Provide informative error messages with context
  • Don't panic on missing hardware - return empty collections or errors

Documentation

  • All public types and functions should have doc comments
  • Include # Examples sections in documentation
  • Provide both simple and advanced usage examples

Additional Context

This enhancement will make all-smi more useful for:

  • Integration into larger monitoring systems
  • Building custom CLI tools on top of all-smi
  • Embedding hardware monitoring in Rust applications
  • Backend services that need programmatic access to GPU/NPU metrics

Related crates that could benefit:

  • Backend.AI agent implementations
  • Custom monitoring dashboards
  • Resource schedulers

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions