Skip to content

Add NVIDIA vGPU monitoring support via nvml-wrapper 0.12 #129

Description

@inureyes

Problem / Background

The nvml-wrapper crate v0.12.0 introduced several vGPU-related APIs that are now available for integration:

  • vgpu_scheduler_capabilities – query scheduler capabilities
  • vgpu_host_mode – check vGPU host mode
  • vgpu_accounting_pids – list PIDs with vGPU accounting enabled
  • vgpu_accounting_stats – per-vGPU utilization and memory stats
  • vgpu_scheduler_log – scheduler log entries
  • vgpu_scheduler_state – current scheduler state
  • set_vgpu_scheduler_state – configure scheduler state

Currently, all-smi has no awareness of vGPU environments. Hosts running NVIDIA vGPU (commonly used in virtualized GPU sharing for cloud/enterprise workloads) are treated as standard GPU hosts, missing important scheduling and per-vGPU utilization data.

Goal

Expose vGPU metrics in both the TUI view mode and the Prometheus API mode, giving operators visibility into vGPU scheduling, utilization, and memory allocation across virtualized GPU environments.

Scope

Detection

  • Detect whether the host is vGPU-enabled by querying vgpu_host_mode or scheduler capabilities.
  • Gracefully skip vGPU collection on non-vGPU hosts (no errors, no empty sections).

Data Collection

  • Read vGPU scheduler capabilities and current scheduler state.
  • Collect per-vGPU accounting stats (utilization, memory usage) via vgpu_accounting_pids and vgpu_accounting_stats.
  • Optionally collect scheduler log entries for diagnostic display.

TUI Display

  • Display vGPU information in the TUI, either as:
    • A sub-tab under the existing GPU tab, or
    • An additional collapsible section within each GPU's detail view.
  • Show per-vGPU utilization, memory, and scheduler state.

Prometheus API

  • Export vGPU metrics at the /metrics endpoint in Prometheus format, including:
    • allsmi_vgpu_utilization (per vGPU instance)
    • allsmi_vgpu_memory_used_bytes / allsmi_vgpu_memory_total_bytes
    • allsmi_vgpu_scheduler_state
    • allsmi_vgpu_host_mode
  • Use appropriate labels (gpu_index, vgpu_id, host, etc.).

Acceptance Criteria

  • vGPU-enabled hosts are correctly detected; non-vGPU hosts are unaffected
  • Per-vGPU utilization and memory stats are collected via nvml-wrapper 0.12 APIs
  • vGPU scheduler capabilities and state are readable
  • TUI view displays vGPU information in a clear, navigable layout
  • Prometheus /metrics endpoint exports all vGPU metrics with correct labels
  • Existing GPU monitoring functionality is not regressed
  • Feature is integration-tested with the mock server or a vGPU-capable environment

Technical Considerations

  • Dependency: Requires upgrading nvml-wrapper to >= 0.12.0.
  • Fallback: vGPU APIs may return errors on non-vGPU hosts or older drivers. All calls must be wrapped with proper error handling to avoid panics or degraded behavior on standard GPU hosts.
  • Architecture: The vGPU reader logic should integrate into the existing GpuReader trait flow in src/gpu/nvidia.rs, extending GpuInfo or introducing a companion VgpuInfo struct.
  • Mock server: The mock server should be extended to optionally simulate vGPU responses for testing.

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions