Skip to content

Add NVIDIA MIG (Multi-Instance GPU) monitoring support #131

Description

@inureyes

Problem / Background

NVIDIA Multi-Instance GPU (MIG) allows a single GPU to be partitioned into multiple isolated instances, each with its own compute, memory, and cache resources. This is widely used in data center and cloud environments (A100, A30, H100, H200) for workload isolation and resource sharing.

The nvml-wrapper crate v0.12.0 introduced MIG-related APIs:

  • mig_device_by_index — enumerate MIG instances on a GPU
  • mig_device_count — get the number of MIG instances
  • mig_parent_device — resolve the parent GPU of a MIG instance
  • mig_is_mig_device_handle — check if a device handle is a MIG instance
  • set_mig_mode — query MIG mode status (enabled/disabled)

Currently, all-smi has no awareness of MIG partitions — a MIG-enabled GPU appears as a single device, hiding the individual instances and their metrics.

Proposed Solution

Add MIG instance detection and per-instance monitoring to both TUI and API modes, treating each MIG instance as a sub-device of its parent GPU.

Scope

Detection & Enumeration

  • Detect whether a GPU has MIG mode enabled via set_mig_mode / NVML query
  • Enumerate MIG instances per GPU using mig_device_count and mig_device_by_index
  • Map each MIG instance back to its parent GPU

Per-Instance Metrics

  • Read utilization (compute/memory) for each MIG instance
  • Read memory usage (used/total) for each MIG instance
  • Collect process information running on each MIG instance (deferred — out of scope for this PR; future work can join NVML process accounting gpu_instance_id/compute_instance_id to the new MIG records)

TUI Display

  • Show MIG mode status (enabled/disabled) per GPU in the device info area
  • Display MIG instances as sub-rows under their parent GPU (e.g., GPU 0 / MIG 0, GPU 0 / MIG 1)
  • Ensure parent GPU aggregate metrics and per-instance metrics are both visible

API / Prometheus Export

  • Export per-MIG-instance metrics with labels distinguishing parent GPU and instance index (e.g., gpu="0", mig_instance="0")
  • Include MIG mode status as a metric or info label
  • Maintain backward compatibility — non-MIG GPUs should produce unchanged metric output

Edge Cases

  • Gracefully handle GPUs that do not support MIG (older architectures)
  • Handle partial MIG configurations (some GPUs MIG-enabled, others not)
  • Handle MIG mode transitions (enabled ↔ disabled) during runtime

Technical Considerations

  • nvml-wrapper version: Requires v0.12.0+ for MIG APIs. Check current pinned version and update if needed.
  • Testing: MIG hardware is not commonly available in development environments. Consider extending the mock server to simulate MIG configurations for testing.
  • GPU reader trait: May need to extend GpuInfo or introduce a new MigInstanceInfo struct to represent MIG partitions without breaking the existing GpuReader trait contract.
  • Backward compatibility: Non-MIG systems must remain unaffected. MIG detection should fail gracefully on unsupported hardware.

Acceptance Criteria

  • MIG-enabled GPUs are detected and their instances are enumerated
  • Per-MIG-instance utilization and memory metrics are collected
  • TUI displays MIG instances as sub-entries under their parent GPU
  • Prometheus metrics include per-MIG-instance data with appropriate labels
  • Non-MIG GPUs and unsupported hardware are handled gracefully with no regressions
  • Mock server can optionally simulate MIG configurations for development/testing (env-gated via ALL_SMI_MOCK_MIG)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions