Problem / Background
The nvml-wrapper crate v0.12.0 introduced several vGPU-related APIs that are now available for integration:
vgpu_scheduler_capabilities – query scheduler capabilities
vgpu_host_mode – check vGPU host mode
vgpu_accounting_pids – list PIDs with vGPU accounting enabled
vgpu_accounting_stats – per-vGPU utilization and memory stats
vgpu_scheduler_log – scheduler log entries
vgpu_scheduler_state – current scheduler state
set_vgpu_scheduler_state – configure scheduler state
Currently, all-smi has no awareness of vGPU environments. Hosts running NVIDIA vGPU (commonly used in virtualized GPU sharing for cloud/enterprise workloads) are treated as standard GPU hosts, missing important scheduling and per-vGPU utilization data.
Goal
Expose vGPU metrics in both the TUI view mode and the Prometheus API mode, giving operators visibility into vGPU scheduling, utilization, and memory allocation across virtualized GPU environments.
Scope
Detection
- Detect whether the host is vGPU-enabled by querying
vgpu_host_mode or scheduler capabilities.
- Gracefully skip vGPU collection on non-vGPU hosts (no errors, no empty sections).
Data Collection
- Read vGPU scheduler capabilities and current scheduler state.
- Collect per-vGPU accounting stats (utilization, memory usage) via
vgpu_accounting_pids and vgpu_accounting_stats.
- Optionally collect scheduler log entries for diagnostic display.
TUI Display
- Display vGPU information in the TUI, either as:
- A sub-tab under the existing GPU tab, or
- An additional collapsible section within each GPU's detail view.
- Show per-vGPU utilization, memory, and scheduler state.
Prometheus API
- Export vGPU metrics at the
/metrics endpoint in Prometheus format, including:
allsmi_vgpu_utilization (per vGPU instance)
allsmi_vgpu_memory_used_bytes / allsmi_vgpu_memory_total_bytes
allsmi_vgpu_scheduler_state
allsmi_vgpu_host_mode
- Use appropriate labels (
gpu_index, vgpu_id, host, etc.).
Acceptance Criteria
Technical Considerations
- Dependency: Requires upgrading
nvml-wrapper to >= 0.12.0.
- Fallback: vGPU APIs may return errors on non-vGPU hosts or older drivers. All calls must be wrapped with proper error handling to avoid panics or degraded behavior on standard GPU hosts.
- Architecture: The vGPU reader logic should integrate into the existing
GpuReader trait flow in src/gpu/nvidia.rs, extending GpuInfo or introducing a companion VgpuInfo struct.
- Mock server: The mock server should be extended to optionally simulate vGPU responses for testing.
Problem / Background
The
nvml-wrappercrate v0.12.0 introduced several vGPU-related APIs that are now available for integration:vgpu_scheduler_capabilities– query scheduler capabilitiesvgpu_host_mode– check vGPU host modevgpu_accounting_pids– list PIDs with vGPU accounting enabledvgpu_accounting_stats– per-vGPU utilization and memory statsvgpu_scheduler_log– scheduler log entriesvgpu_scheduler_state– current scheduler stateset_vgpu_scheduler_state– configure scheduler stateCurrently, all-smi has no awareness of vGPU environments. Hosts running NVIDIA vGPU (commonly used in virtualized GPU sharing for cloud/enterprise workloads) are treated as standard GPU hosts, missing important scheduling and per-vGPU utilization data.
Goal
Expose vGPU metrics in both the TUI view mode and the Prometheus API mode, giving operators visibility into vGPU scheduling, utilization, and memory allocation across virtualized GPU environments.
Scope
Detection
vgpu_host_modeor scheduler capabilities.Data Collection
vgpu_accounting_pidsandvgpu_accounting_stats.TUI Display
Prometheus API
/metricsendpoint in Prometheus format, including:allsmi_vgpu_utilization(per vGPU instance)allsmi_vgpu_memory_used_bytes/allsmi_vgpu_memory_total_bytesallsmi_vgpu_scheduler_stateallsmi_vgpu_host_modegpu_index,vgpu_id,host, etc.).Acceptance Criteria
/metricsendpoint exports all vGPU metrics with correct labelsTechnical Considerations
nvml-wrapperto >= 0.12.0.GpuReadertrait flow insrc/gpu/nvidia.rs, extendingGpuInfoor introducing a companionVgpuInfostruct.