Problem / Background
NVIDIA Multi-Instance GPU (MIG) allows a single GPU to be partitioned into multiple isolated instances, each with its own compute, memory, and cache resources. This is widely used in data center and cloud environments (A100, A30, H100, H200) for workload isolation and resource sharing.
The nvml-wrapper crate v0.12.0 introduced MIG-related APIs:
mig_device_by_index — enumerate MIG instances on a GPU
mig_device_count — get the number of MIG instances
mig_parent_device — resolve the parent GPU of a MIG instance
mig_is_mig_device_handle — check if a device handle is a MIG instance
set_mig_mode — query MIG mode status (enabled/disabled)
Currently, all-smi has no awareness of MIG partitions — a MIG-enabled GPU appears as a single device, hiding the individual instances and their metrics.
Proposed Solution
Add MIG instance detection and per-instance monitoring to both TUI and API modes, treating each MIG instance as a sub-device of its parent GPU.
Scope
Detection & Enumeration
Per-Instance Metrics
TUI Display
API / Prometheus Export
Edge Cases
Technical Considerations
- nvml-wrapper version: Requires v0.12.0+ for MIG APIs. Check current pinned version and update if needed.
- Testing: MIG hardware is not commonly available in development environments. Consider extending the mock server to simulate MIG configurations for testing.
- GPU reader trait: May need to extend
GpuInfo or introduce a new MigInstanceInfo struct to represent MIG partitions without breaking the existing GpuReader trait contract.
- Backward compatibility: Non-MIG systems must remain unaffected. MIG detection should fail gracefully on unsupported hardware.
Acceptance Criteria
Problem / Background
NVIDIA Multi-Instance GPU (MIG) allows a single GPU to be partitioned into multiple isolated instances, each with its own compute, memory, and cache resources. This is widely used in data center and cloud environments (A100, A30, H100, H200) for workload isolation and resource sharing.
The
nvml-wrappercrate v0.12.0 introduced MIG-related APIs:mig_device_by_index— enumerate MIG instances on a GPUmig_device_count— get the number of MIG instancesmig_parent_device— resolve the parent GPU of a MIG instancemig_is_mig_device_handle— check if a device handle is a MIG instanceset_mig_mode— query MIG mode status (enabled/disabled)Currently, all-smi has no awareness of MIG partitions — a MIG-enabled GPU appears as a single device, hiding the individual instances and their metrics.
Proposed Solution
Add MIG instance detection and per-instance monitoring to both TUI and API modes, treating each MIG instance as a sub-device of its parent GPU.
Scope
Detection & Enumeration
set_mig_mode/ NVML querymig_device_countandmig_device_by_indexPer-Instance Metrics
gpu_instance_id/compute_instance_idto the new MIG records)TUI Display
GPU 0 / MIG 0,GPU 0 / MIG 1)API / Prometheus Export
gpu="0",mig_instance="0")Edge Cases
Technical Considerations
GpuInfoor introduce a newMigInstanceInfostruct to represent MIG partitions without breaking the existingGpuReadertrait contract.Acceptance Criteria
ALL_SMI_MOCK_MIG)