feat: add NVIDIA MIG (Multi-Instance GPU) monitoring by inureyes · Pull Request #174 · lablup/all-smi

inureyes · 2026-04-15T09:43:40Z

Summary

Adds NVIDIA MIG (Multi-Instance GPU) detection and per-instance monitoring across the data model, NVML reader, TUI, Prometheus exporter, remote parser, and mock server. Mirrors the architectural pattern landed in feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172 (vGPU) and feat: add NVIDIA thermal thresholds and P-state monitoring #173 (thermal/P-state).
Surfaces MIG instances as nested rows beneath the parent GPU in the TUI, exports all_smi_gpu_mig_mode plus four all_smi_mig_instance_* families with gpu_uuid/mig_instance/mig_uuid/mig_profile/gpu_instance_id labels, and round-trips through the remote scrape parser losslessly.
Non-MIG GPUs and pre-Ampere architectures stay completely silent — every NVML call is .ok()-wrapped and the feature degrades to an empty Vec on any error (driver too old, missing permissions, MIG mode disabled).
Mock server can simulate a MIG-partitioned A100/H100 (5 instances per GPU spanning 1g.5gb/2g.10gb/3g.20gb/7g.40gb) when ALL_SMI_MOCK_MIG is set; otherwise the NVIDIA mock output is byte-identical to today.

Closes #131.

Files touched

src/device/types.rs — new MigInstanceInfo and MigGpuInfo structs.
src/device/traits.rs — GpuReader::get_mig_info(&self) with empty default.
src/device/readers/nvidia_mig.rs (new) + src/device/readers/nvidia.rs — wire NVML probe (mig_mode() + mig_device_by_index enumeration, raw FFI for nvmlDeviceGetGpuInstanceId / nvmlDeviceGetComputeInstanceId).
src/app_state.rs, src/view/data_collection/strategy.rs, src/view/render_snapshot.rs, src/view/data_collection/{local,remote}_collector.rs, src/network/client.rs — thread mig_info: Vec<MigGpuInfo> end-to-end (local + remote).
src/network/metrics_parser.rs — new MigParseState with MAX_MIG_GPUS=256 / MAX_MIG_INSTANCES=4096 / MAX_MIG_INSTANCE_INDEX=64 defensive caps; routed mig_instance_* and gpu_mig_mode lines.
src/api/metrics/mig.rs (new) + src/api/handlers.rs — Prometheus exporter with precomputed row cache (matches the perf pattern landed in feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172).
src/ui/renderers/mig_renderer.rs (new), src/view/frame_renderer.rs, src/ui/renderers/gpu_renderer.rs, src/ui/layout.rs — TUI nested rows, find_matching_mig_gpu UUID-first matcher, layout math updated so PgUp/PgDn page sizes account for MIG rows.
src/mock/templates/mig.rs (new) + src/mock/templates/nvidia.rs — env-gated mock template.
src/client.rs, src/prelude.rs — AllSmi::get_mig_info() and MigGpuInfo / MigInstanceInfo exports.
Tests: tests/mig_integration_test.rs (4 tests, mirrors tests/vgpu_integration_test.rs); inline coverage added to metrics_parser.rs (parser caps, oversized indices, missing optional ids), mig.rs exporter (label set, silent-when-empty), mig_renderer.rs (ANSI-aware), mig.rs mock template (placeholder substitution, env gating). Existing tests/{cpu_model,thermal_pstate_integration,vgpu_integration}_test.rs updated for the new 6-tuple parse_metrics signature.

Test plan

cargo build --features mock — green
cargo test --features mock — 422 lib + 500 bin + 38 integration + 4 new MIG integration tests, 0 failures
cargo clippy --features mock --all-targets — no MIG-related warnings (8 pre-existing in amd.rs)
cargo fmt — clean
Mock smoke test: ALL_SMI_MOCK_MIG=1 all-smi-mock-server --port-range 19099 --platform nvidia emits 168 MIG metric lines for 8 GPUs × 5 profiles; without the env var, /metrics shows 0 MIG lines (silent no-op verified)
Hardware validation on a real MIG-enabled A100/H100 host (no MIG hardware available in dev environment)

Notes for reviewers

The MIG_PROFILES table in src/mock/templates/mig.rs documents the synthetic partitioning used in mock mode (1g.5gb x2, 2g.10gb, 3g.20gb, 7g.40gb). Easy to swap for a different mix without touching the parser.
compute_instance_id is plumbed end-to-end (FFI → exporter labels → parser) but not currently used as a TUI label — kept available for future PRs that want to surface compute slice IDs in the UI.
Per-MIG-instance process attribution is intentionally out of scope here; it can be layered in by joining the existing NVML process accounting (which already reports gpu_instance_id/compute_instance_id per PID) against the new MigInstanceInfo records in a follow-up.

Introduces MigInstanceInfo and MigGpuInfo to represent NVIDIA MIG (Multi-Instance GPU) partitions, with a get_mig_info() method on the GpuReader trait that defaults to returning an empty vector. The new nvidia_mig reader probes mig_mode() per device, enumerates instances via mig_device_by_index, and silently degrades to empty on any NVML error so non-MIG GPUs and pre-Ampere architectures remain unchanged.

Adds a mig_info: Vec<MigGpuInfo> field to AppState, CollectionData, and RenderSnapshot, and wires the local + remote data collectors to populate it from the new GpuReader::get_mig_info() method. The network client fetch path now returns MIG records alongside vGPU records so remote clusters surface MIG instances the same way local hosts do.

Adds parser-level unit tests for the new MigParseState (host enumeration, instance counting, defensive caps, oversized index rejection, missing optional ids), and a tests/mig_integration_test.rs that round-trips a fixture mirroring MigMetricExporter output through MetricsParser to guard against silent format drift between exporter and parser. Also updates the existing thermal/cpu-model integration tests to consume the new 6-tuple parse_metrics return.

The MIG exporter computed `compute_instance_id_str` per instance but never actually emitted it as a Prometheus label, while the module docs, the mock template, and the integration test all assumed it was present. The omission only stayed invisible because the integration test used hand-crafted text that itself included the label, so the exporter's silent drop was never observed. - Extend `instance_labels` from a 9- to a 10-element array with the `compute_instance_id` slot appended next to `gpu_instance_id`. - Drop the `#[allow(dead_code)]` now that `compute_instance_id_str` is actually read by the exporter. - Expose `api::metrics` from the library (under the `cli` feature) so integration tests can round-trip through the real exporter. - Rewrite `tests/mig_integration_test.rs` to build its fixture via `MigMetricExporter::export_metrics` instead of hand-rolled text, so future label drops surface as failing tests. - Add exporter-side coverage for the empty-string fallback when NVML does not report `compute_instance_id`.

The TUI MIG section printed `[i]` using the `Vec<MigInstanceInfo>` enumeration index, not the instance's NVML-reported `instance_id`. When NVML enumerates instances at non-contiguous slots — e.g. 0, 1, 4 after tearing down the middle partitions — the TUI showed `[0], [1], [2]` while the Prometheus `mig_instance` label correctly reflected the real slot. UI and exporter then disagreed on which slot each row represented. - Switch the renderer to `&inst.instance_id.to_string()`. - Add a regression unit test with sparse instance_ids (0, 1, 4) that asserts `[4]` is rendered and `[2]` is not.

Previously the NVML reader skipped every GPU whose MIG mode was not currently enabled. As a consequence `all_smi_gpu_mig_mode = 0` was never emitted on real hardware, runtime MIG-mode transitions were invisible to Prometheus consumers, and the existing tests that exercised the `mig_mode: false` branch were exercising an unreachable code path. Reader: - Emit a `MigGpuInfo` row for every MIG-capable GPU regardless of current state; leave `instances` empty and `mig_mode = false` for disabled parents. - Skip `enumerate_mig_instances` when mode is disabled (nothing to enumerate) instead of unconditionally calling it. Parser: - Track UUIDs explicitly observed via `gpu_mig_mode` in a `HashSet` so `finish` retains disabled-MIG rows. The previous `is_mig_active()` retain filter dropped any host whose mode was zero and whose instance vec was empty, silently discarding the very state we now want to expose. Tests: - Exporter unit test asserting a disabled-and-empty parent still produces `gpu_mig_mode 0` and no per-instance data lines. - Integration test round-tripping a disabled parent through the real exporter and parser, verifying the consumer side observes mode=0. - Existing parser test flipped from "silent drop" to "retained row" to match the new contract.

`decode_cstr` had no callers — the two raw FFI helpers that needed it during an earlier draft (`mig_gpu_instance_id`, `mig_compute_instance_id`) were rewritten to read `u32` directly. The helper sat behind `#[allow(dead_code)]` pulling in `std::ffi::CStr` for no reason. Drop both; any future FFI that needs NUL-terminated buffers can re-add it with an actual caller.

…tate After the retain pass, iterate hosts and flip mig_mode to true whenever instances are present. A remote feed may emit mig_instance_* lines without a gpu_mig_mode line; the retain keeps such hosts alive but they would carry mig_mode=false — a contradictory state no real exporter produces. Add unit test mig_parser_infers_mig_mode_from_instance_presence that feeds only instance metrics with no gpu_mig_mode line and asserts mig_mode=true.

Document the five new MIG metric families (all_smi_gpu_mig_mode, all_smi_mig_instance_utilization_gpu/memory, all_smi_mig_instance_memory_used/total_bytes), their full label set, TUI behavior, mock env var (ALL_SMI_MOCK_MIG), and example PromQL queries. Update README feature bullet, API metrics list, library API table, mock server section, and v0.21.0 changelog entry.

inureyes · 2026-04-15T10:24:45Z

PR Finalization Complete

Summary

Lint/Format: cargo fmt --all produced no changes (code already formatted). cargo clippy --features mock --all-targets shows 6 warnings, all in amd.rs and tpu_grpc.rs — files not touched by this PR. No fixes needed for PR-touched files.

Tests: cargo test --features mock — all green (456+504+38+17+... test suites, 0 failures).

Documentation (docs: commit e25e4e7):

API.md: Added "NVIDIA MIG Metrics" section (after vGPU section, before Jetson) documenting all_smi_gpu_mig_mode, all_smi_mig_instance_utilization_gpu, all_smi_mig_instance_utilization_memory, all_smi_mig_instance_memory_used_bytes, all_smi_mig_instance_memory_total_bytes with full label table, per-label descriptions, behavioral notes, and 6 PromQL example queries in a new "NVIDIA MIG Specific" PromQL section. Added note 13 in the Notes section.
README.md: Updated NVIDIA feature bullet to mention MIG monitoring; added NVIDIA MIG line to the API metrics list; added ALL_SMI_MOCK_MIG=1 to the mock server section; added get_mig_info() row to the library API table; extended the v0.21.0 changelog entry with MIG details.

All checks passing. Ready for merge.

inureyes added 4 commits April 15, 2026 18:23

chore: apply cargo fmt to MIG additions

e2bcd52

inureyes added type:enhancement New feature or request priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 15, 2026

inureyes added 6 commits April 15, 2026 19:01

inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026

inureyes merged commit fa1224d into main Apr 15, 2026
2 checks passed

inureyes deleted the feature/issue-131-mig-monitoring branch April 15, 2026 10:25

inureyes self-assigned this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NVIDIA MIG (Multi-Instance GPU) monitoring#174

feat: add NVIDIA MIG (Multi-Instance GPU) monitoring#174
inureyes merged 10 commits into
mainfrom
feature/issue-131-mig-monitoring

inureyes commented Apr 15, 2026

Uh oh!

inureyes commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 15, 2026

Summary

Files touched

Test plan

Notes for reviewers

Uh oh!

inureyes commented Apr 15, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant