Skip to content

feat: add NVIDIA MIG (Multi-Instance GPU) monitoring#174

Merged
inureyes merged 10 commits into
mainfrom
feature/issue-131-mig-monitoring
Apr 15, 2026
Merged

feat: add NVIDIA MIG (Multi-Instance GPU) monitoring#174
inureyes merged 10 commits into
mainfrom
feature/issue-131-mig-monitoring

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

  • Adds NVIDIA MIG (Multi-Instance GPU) detection and per-instance monitoring across the data model, NVML reader, TUI, Prometheus exporter, remote parser, and mock server. Mirrors the architectural pattern landed in feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172 (vGPU) and feat: add NVIDIA thermal thresholds and P-state monitoring #173 (thermal/P-state).
  • Surfaces MIG instances as nested rows beneath the parent GPU in the TUI, exports all_smi_gpu_mig_mode plus four all_smi_mig_instance_* families with gpu_uuid/mig_instance/mig_uuid/mig_profile/gpu_instance_id labels, and round-trips through the remote scrape parser losslessly.
  • Non-MIG GPUs and pre-Ampere architectures stay completely silent — every NVML call is .ok()-wrapped and the feature degrades to an empty Vec on any error (driver too old, missing permissions, MIG mode disabled).
  • Mock server can simulate a MIG-partitioned A100/H100 (5 instances per GPU spanning 1g.5gb/2g.10gb/3g.20gb/7g.40gb) when ALL_SMI_MOCK_MIG is set; otherwise the NVIDIA mock output is byte-identical to today.

Closes #131.

Files touched

  • src/device/types.rs — new MigInstanceInfo and MigGpuInfo structs.
  • src/device/traits.rsGpuReader::get_mig_info(&self) with empty default.
  • src/device/readers/nvidia_mig.rs (new) + src/device/readers/nvidia.rs — wire NVML probe (mig_mode() + mig_device_by_index enumeration, raw FFI for nvmlDeviceGetGpuInstanceId / nvmlDeviceGetComputeInstanceId).
  • src/app_state.rs, src/view/data_collection/strategy.rs, src/view/render_snapshot.rs, src/view/data_collection/{local,remote}_collector.rs, src/network/client.rs — thread mig_info: Vec<MigGpuInfo> end-to-end (local + remote).
  • src/network/metrics_parser.rs — new MigParseState with MAX_MIG_GPUS=256 / MAX_MIG_INSTANCES=4096 / MAX_MIG_INSTANCE_INDEX=64 defensive caps; routed mig_instance_* and gpu_mig_mode lines.
  • src/api/metrics/mig.rs (new) + src/api/handlers.rs — Prometheus exporter with precomputed row cache (matches the perf pattern landed in feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172).
  • src/ui/renderers/mig_renderer.rs (new), src/view/frame_renderer.rs, src/ui/renderers/gpu_renderer.rs, src/ui/layout.rs — TUI nested rows, find_matching_mig_gpu UUID-first matcher, layout math updated so PgUp/PgDn page sizes account for MIG rows.
  • src/mock/templates/mig.rs (new) + src/mock/templates/nvidia.rs — env-gated mock template.
  • src/client.rs, src/prelude.rsAllSmi::get_mig_info() and MigGpuInfo / MigInstanceInfo exports.
  • Tests: tests/mig_integration_test.rs (4 tests, mirrors tests/vgpu_integration_test.rs); inline coverage added to metrics_parser.rs (parser caps, oversized indices, missing optional ids), mig.rs exporter (label set, silent-when-empty), mig_renderer.rs (ANSI-aware), mig.rs mock template (placeholder substitution, env gating). Existing tests/{cpu_model,thermal_pstate_integration,vgpu_integration}_test.rs updated for the new 6-tuple parse_metrics signature.

Test plan

  • cargo build --features mock — green
  • cargo test --features mock — 422 lib + 500 bin + 38 integration + 4 new MIG integration tests, 0 failures
  • cargo clippy --features mock --all-targets — no MIG-related warnings (8 pre-existing in amd.rs)
  • cargo fmt — clean
  • Mock smoke test: ALL_SMI_MOCK_MIG=1 all-smi-mock-server --port-range 19099 --platform nvidia emits 168 MIG metric lines for 8 GPUs × 5 profiles; without the env var, /metrics shows 0 MIG lines (silent no-op verified)
  • Hardware validation on a real MIG-enabled A100/H100 host (no MIG hardware available in dev environment)

Notes for reviewers

  • The MIG_PROFILES table in src/mock/templates/mig.rs documents the synthetic partitioning used in mock mode (1g.5gb x2, 2g.10gb, 3g.20gb, 7g.40gb). Easy to swap for a different mix without touching the parser.
  • compute_instance_id is plumbed end-to-end (FFI → exporter labels → parser) but not currently used as a TUI label — kept available for future PRs that want to surface compute slice IDs in the UI.
  • Per-MIG-instance process attribution is intentionally out of scope here; it can be layered in by joining the existing NVML process accounting (which already reports gpu_instance_id/compute_instance_id per PID) against the new MigInstanceInfo records in a follow-up.

Introduces MigInstanceInfo and MigGpuInfo to represent NVIDIA MIG
(Multi-Instance GPU) partitions, with a get_mig_info() method on the
GpuReader trait that defaults to returning an empty vector. The new
nvidia_mig reader probes mig_mode() per device, enumerates instances
via mig_device_by_index, and silently degrades to empty on any NVML
error so non-MIG GPUs and pre-Ampere architectures remain unchanged.
Adds a mig_info: Vec<MigGpuInfo> field to AppState, CollectionData, and
RenderSnapshot, and wires the local + remote data collectors to populate
it from the new GpuReader::get_mig_info() method. The network client
fetch path now returns MIG records alongside vGPU records so remote
clusters surface MIG instances the same way local hosts do.
Adds parser-level unit tests for the new MigParseState (host enumeration,
instance counting, defensive caps, oversized index rejection, missing
optional ids), and a tests/mig_integration_test.rs that round-trips a
fixture mirroring MigMetricExporter output through MetricsParser to
guard against silent format drift between exporter and parser. Also
updates the existing thermal/cpu-model integration tests to consume the
new 6-tuple parse_metrics return.
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 15, 2026
The MIG exporter computed `compute_instance_id_str` per instance but
never actually emitted it as a Prometheus label, while the module docs,
the mock template, and the integration test all assumed it was present.
The omission only stayed invisible because the integration test used
hand-crafted text that itself included the label, so the exporter's
silent drop was never observed.

- Extend `instance_labels` from a 9- to a 10-element array with the
  `compute_instance_id` slot appended next to `gpu_instance_id`.
- Drop the `#[allow(dead_code)]` now that `compute_instance_id_str` is
  actually read by the exporter.
- Expose `api::metrics` from the library (under the `cli` feature) so
  integration tests can round-trip through the real exporter.
- Rewrite `tests/mig_integration_test.rs` to build its fixture via
  `MigMetricExporter::export_metrics` instead of hand-rolled text, so
  future label drops surface as failing tests.
- Add exporter-side coverage for the empty-string fallback when NVML
  does not report `compute_instance_id`.
The TUI MIG section printed `[i]` using the `Vec<MigInstanceInfo>`
enumeration index, not the instance's NVML-reported `instance_id`. When
NVML enumerates instances at non-contiguous slots — e.g. 0, 1, 4 after
tearing down the middle partitions — the TUI showed `[0], [1], [2]`
while the Prometheus `mig_instance` label correctly reflected the real
slot. UI and exporter then disagreed on which slot each row represented.

- Switch the renderer to `&inst.instance_id.to_string()`.
- Add a regression unit test with sparse instance_ids (0, 1, 4) that
  asserts `[4]` is rendered and `[2]` is not.
Previously the NVML reader skipped every GPU whose MIG mode was not
currently enabled. As a consequence `all_smi_gpu_mig_mode = 0` was
never emitted on real hardware, runtime MIG-mode transitions were
invisible to Prometheus consumers, and the existing tests that
exercised the `mig_mode: false` branch were exercising an unreachable
code path.

Reader:
- Emit a `MigGpuInfo` row for every MIG-capable GPU regardless of
  current state; leave `instances` empty and `mig_mode = false` for
  disabled parents.
- Skip `enumerate_mig_instances` when mode is disabled (nothing to
  enumerate) instead of unconditionally calling it.

Parser:
- Track UUIDs explicitly observed via `gpu_mig_mode` in a `HashSet`
  so `finish` retains disabled-MIG rows. The previous
  `is_mig_active()` retain filter dropped any host whose mode was
  zero and whose instance vec was empty, silently discarding the
  very state we now want to expose.

Tests:
- Exporter unit test asserting a disabled-and-empty parent still
  produces `gpu_mig_mode 0` and no per-instance data lines.
- Integration test round-tripping a disabled parent through the real
  exporter and parser, verifying the consumer side observes mode=0.
- Existing parser test flipped from "silent drop" to "retained row"
  to match the new contract.
`decode_cstr` had no callers — the two raw FFI helpers that needed it
during an earlier draft (`mig_gpu_instance_id`, `mig_compute_instance_id`)
were rewritten to read `u32` directly. The helper sat behind
`#[allow(dead_code)]` pulling in `std::ffi::CStr` for no reason. Drop
both; any future FFI that needs NUL-terminated buffers can re-add it
with an actual caller.
…tate

After the retain pass, iterate hosts and flip mig_mode to true whenever
instances are present. A remote feed may emit mig_instance_* lines without
a gpu_mig_mode line; the retain keeps such hosts alive but they would carry
mig_mode=false — a contradictory state no real exporter produces.

Add unit test mig_parser_infers_mig_mode_from_instance_presence that feeds
only instance metrics with no gpu_mig_mode line and asserts mig_mode=true.
Document the five new MIG metric families (all_smi_gpu_mig_mode,
all_smi_mig_instance_utilization_gpu/memory,
all_smi_mig_instance_memory_used/total_bytes), their full label set,
TUI behavior, mock env var (ALL_SMI_MOCK_MIG), and example PromQL
queries. Update README feature bullet, API metrics list, library API
table, mock server section, and v0.21.0 changelog entry.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Lint/Format: cargo fmt --all produced no changes (code already formatted). cargo clippy --features mock --all-targets shows 6 warnings, all in amd.rs and tpu_grpc.rs — files not touched by this PR. No fixes needed for PR-touched files.

Tests: cargo test --features mock — all green (456+504+38+17+... test suites, 0 failures).

Documentation (docs: commit e25e4e7):

  • API.md: Added "NVIDIA MIG Metrics" section (after vGPU section, before Jetson) documenting all_smi_gpu_mig_mode, all_smi_mig_instance_utilization_gpu, all_smi_mig_instance_utilization_memory, all_smi_mig_instance_memory_used_bytes, all_smi_mig_instance_memory_total_bytes with full label table, per-label descriptions, behavioral notes, and 6 PromQL example queries in a new "NVIDIA MIG Specific" PromQL section. Added note 13 in the Notes section.
  • README.md: Updated NVIDIA feature bullet to mention MIG monitoring; added NVIDIA MIG line to the API metrics list; added ALL_SMI_MOCK_MIG=1 to the mock server section; added get_mig_info() row to the library API table; extended the v0.21.0 changelog entry with MIG details.

All checks passing. Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026
@inureyes inureyes merged commit fa1224d into main Apr 15, 2026
2 checks passed
@inureyes inureyes deleted the feature/issue-131-mig-monitoring branch April 15, 2026 10:25
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add NVIDIA MIG (Multi-Instance GPU) monitoring support

1 participant