Add NVIDIA MIG (Multi-Instance GPU) monitoring support

## Problem / Background

NVIDIA Multi-Instance GPU (MIG) allows a single GPU to be partitioned into multiple isolated instances, each with its own compute, memory, and cache resources. This is widely used in data center and cloud environments (A100, A30, H100, H200) for workload isolation and resource sharing.

The `nvml-wrapper` crate v0.12.0 introduced MIG-related APIs:
- `mig_device_by_index` — enumerate MIG instances on a GPU
- `mig_device_count` — get the number of MIG instances
- `mig_parent_device` — resolve the parent GPU of a MIG instance
- `mig_is_mig_device_handle` — check if a device handle is a MIG instance
- `set_mig_mode` — query MIG mode status (enabled/disabled)

Currently, all-smi has no awareness of MIG partitions — a MIG-enabled GPU appears as a single device, hiding the individual instances and their metrics.

## Proposed Solution

Add MIG instance detection and per-instance monitoring to both TUI and API modes, treating each MIG instance as a sub-device of its parent GPU.

## Scope

### Detection & Enumeration
- [x] Detect whether a GPU has MIG mode enabled via `set_mig_mode` / NVML query
- [x] Enumerate MIG instances per GPU using `mig_device_count` and `mig_device_by_index`
- [x] Map each MIG instance back to its parent GPU

### Per-Instance Metrics
- [x] Read utilization (compute/memory) for each MIG instance
- [x] Read memory usage (used/total) for each MIG instance
- [ ] Collect process information running on each MIG instance (deferred — out of scope for this PR; future work can join NVML process accounting `gpu_instance_id`/`compute_instance_id` to the new MIG records)

### TUI Display
- [x] Show MIG mode status (enabled/disabled) per GPU in the device info area
- [x] Display MIG instances as sub-rows under their parent GPU (e.g., `GPU 0 / MIG 0`, `GPU 0 / MIG 1`)
- [x] Ensure parent GPU aggregate metrics and per-instance metrics are both visible

### API / Prometheus Export
- [x] Export per-MIG-instance metrics with labels distinguishing parent GPU and instance index (e.g., `gpu="0"`, `mig_instance="0"`)
- [x] Include MIG mode status as a metric or info label
- [x] Maintain backward compatibility — non-MIG GPUs should produce unchanged metric output

### Edge Cases
- [x] Gracefully handle GPUs that do not support MIG (older architectures)
- [x] Handle partial MIG configurations (some GPUs MIG-enabled, others not)
- [x] Handle MIG mode transitions (enabled ↔ disabled) during runtime

## Technical Considerations

- **nvml-wrapper version**: Requires v0.12.0+ for MIG APIs. Check current pinned version and update if needed.
- **Testing**: MIG hardware is not commonly available in development environments. Consider extending the mock server to simulate MIG configurations for testing.
- **GPU reader trait**: May need to extend `GpuInfo` or introduce a new `MigInstanceInfo` struct to represent MIG partitions without breaking the existing `GpuReader` trait contract.
- **Backward compatibility**: Non-MIG systems must remain unaffected. MIG detection should fail gracefully on unsupported hardware.

## Acceptance Criteria

- [x] MIG-enabled GPUs are detected and their instances are enumerated
- [x] Per-MIG-instance utilization and memory metrics are collected
- [x] TUI displays MIG instances as sub-entries under their parent GPU
- [x] Prometheus metrics include per-MIG-instance data with appropriate labels
- [x] Non-MIG GPUs and unsupported hardware are handled gracefully with no regressions
- [x] Mock server can optionally simulate MIG configurations for development/testing (env-gated via `ALL_SMI_MOCK_MIG`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add NVIDIA MIG (Multi-Instance GPU) monitoring support #131

Problem / Background

Proposed Solution

Scope

Detection & Enumeration

Per-Instance Metrics

TUI Display

API / Prometheus Export

Edge Cases

Technical Considerations

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add NVIDIA MIG (Multi-Instance GPU) monitoring support #131

Description

Problem / Background

Proposed Solution

Scope

Detection & Enumeration

Per-Instance Metrics

TUI Display

API / Prometheus Export

Edge Cases

Technical Considerations

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions