feat: add NVIDIA MIG (Multi-Instance GPU) monitoring#174
Merged
Conversation
Introduces MigInstanceInfo and MigGpuInfo to represent NVIDIA MIG (Multi-Instance GPU) partitions, with a get_mig_info() method on the GpuReader trait that defaults to returning an empty vector. The new nvidia_mig reader probes mig_mode() per device, enumerates instances via mig_device_by_index, and silently degrades to empty on any NVML error so non-MIG GPUs and pre-Ampere architectures remain unchanged.
Adds a mig_info: Vec<MigGpuInfo> field to AppState, CollectionData, and RenderSnapshot, and wires the local + remote data collectors to populate it from the new GpuReader::get_mig_info() method. The network client fetch path now returns MIG records alongside vGPU records so remote clusters surface MIG instances the same way local hosts do.
Adds parser-level unit tests for the new MigParseState (host enumeration, instance counting, defensive caps, oversized index rejection, missing optional ids), and a tests/mig_integration_test.rs that round-trips a fixture mirroring MigMetricExporter output through MetricsParser to guard against silent format drift between exporter and parser. Also updates the existing thermal/cpu-model integration tests to consume the new 6-tuple parse_metrics return.
The MIG exporter computed `compute_instance_id_str` per instance but never actually emitted it as a Prometheus label, while the module docs, the mock template, and the integration test all assumed it was present. The omission only stayed invisible because the integration test used hand-crafted text that itself included the label, so the exporter's silent drop was never observed. - Extend `instance_labels` from a 9- to a 10-element array with the `compute_instance_id` slot appended next to `gpu_instance_id`. - Drop the `#[allow(dead_code)]` now that `compute_instance_id_str` is actually read by the exporter. - Expose `api::metrics` from the library (under the `cli` feature) so integration tests can round-trip through the real exporter. - Rewrite `tests/mig_integration_test.rs` to build its fixture via `MigMetricExporter::export_metrics` instead of hand-rolled text, so future label drops surface as failing tests. - Add exporter-side coverage for the empty-string fallback when NVML does not report `compute_instance_id`.
The TUI MIG section printed `[i]` using the `Vec<MigInstanceInfo>` enumeration index, not the instance's NVML-reported `instance_id`. When NVML enumerates instances at non-contiguous slots — e.g. 0, 1, 4 after tearing down the middle partitions — the TUI showed `[0], [1], [2]` while the Prometheus `mig_instance` label correctly reflected the real slot. UI and exporter then disagreed on which slot each row represented. - Switch the renderer to `&inst.instance_id.to_string()`. - Add a regression unit test with sparse instance_ids (0, 1, 4) that asserts `[4]` is rendered and `[2]` is not.
Previously the NVML reader skipped every GPU whose MIG mode was not currently enabled. As a consequence `all_smi_gpu_mig_mode = 0` was never emitted on real hardware, runtime MIG-mode transitions were invisible to Prometheus consumers, and the existing tests that exercised the `mig_mode: false` branch were exercising an unreachable code path. Reader: - Emit a `MigGpuInfo` row for every MIG-capable GPU regardless of current state; leave `instances` empty and `mig_mode = false` for disabled parents. - Skip `enumerate_mig_instances` when mode is disabled (nothing to enumerate) instead of unconditionally calling it. Parser: - Track UUIDs explicitly observed via `gpu_mig_mode` in a `HashSet` so `finish` retains disabled-MIG rows. The previous `is_mig_active()` retain filter dropped any host whose mode was zero and whose instance vec was empty, silently discarding the very state we now want to expose. Tests: - Exporter unit test asserting a disabled-and-empty parent still produces `gpu_mig_mode 0` and no per-instance data lines. - Integration test round-tripping a disabled parent through the real exporter and parser, verifying the consumer side observes mode=0. - Existing parser test flipped from "silent drop" to "retained row" to match the new contract.
`decode_cstr` had no callers — the two raw FFI helpers that needed it during an earlier draft (`mig_gpu_instance_id`, `mig_compute_instance_id`) were rewritten to read `u32` directly. The helper sat behind `#[allow(dead_code)]` pulling in `std::ffi::CStr` for no reason. Drop both; any future FFI that needs NUL-terminated buffers can re-add it with an actual caller.
…tate After the retain pass, iterate hosts and flip mig_mode to true whenever instances are present. A remote feed may emit mig_instance_* lines without a gpu_mig_mode line; the retain keeps such hosts alive but they would carry mig_mode=false — a contradictory state no real exporter produces. Add unit test mig_parser_infers_mig_mode_from_instance_presence that feeds only instance metrics with no gpu_mig_mode line and asserts mig_mode=true.
Document the five new MIG metric families (all_smi_gpu_mig_mode, all_smi_mig_instance_utilization_gpu/memory, all_smi_mig_instance_memory_used/total_bytes), their full label set, TUI behavior, mock env var (ALL_SMI_MOCK_MIG), and example PromQL queries. Update README feature bullet, API metrics list, library API table, mock server section, and v0.21.0 changelog entry.
Member
Author
PR Finalization CompleteSummaryLint/Format: Tests: Documentation (
All checks passing. Ready for merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
all_smi_gpu_mig_modeplus fourall_smi_mig_instance_*families withgpu_uuid/mig_instance/mig_uuid/mig_profile/gpu_instance_idlabels, and round-trips through the remote scrape parser losslessly..ok()-wrapped and the feature degrades to an emptyVecon any error (driver too old, missing permissions, MIG mode disabled).1g.5gb/2g.10gb/3g.20gb/7g.40gb) whenALL_SMI_MOCK_MIGis set; otherwise the NVIDIA mock output is byte-identical to today.Closes #131.
Files touched
src/device/types.rs— newMigInstanceInfoandMigGpuInfostructs.src/device/traits.rs—GpuReader::get_mig_info(&self)with empty default.src/device/readers/nvidia_mig.rs(new) +src/device/readers/nvidia.rs— wire NVML probe (mig_mode()+mig_device_by_indexenumeration, raw FFI fornvmlDeviceGetGpuInstanceId/nvmlDeviceGetComputeInstanceId).src/app_state.rs,src/view/data_collection/strategy.rs,src/view/render_snapshot.rs,src/view/data_collection/{local,remote}_collector.rs,src/network/client.rs— threadmig_info: Vec<MigGpuInfo>end-to-end (local + remote).src/network/metrics_parser.rs— newMigParseStatewithMAX_MIG_GPUS=256/MAX_MIG_INSTANCES=4096/MAX_MIG_INSTANCE_INDEX=64defensive caps; routedmig_instance_*andgpu_mig_modelines.src/api/metrics/mig.rs(new) +src/api/handlers.rs— Prometheus exporter with precomputed row cache (matches the perf pattern landed in feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 #172).src/ui/renderers/mig_renderer.rs(new),src/view/frame_renderer.rs,src/ui/renderers/gpu_renderer.rs,src/ui/layout.rs— TUI nested rows,find_matching_mig_gpuUUID-first matcher, layout math updated so PgUp/PgDn page sizes account for MIG rows.src/mock/templates/mig.rs(new) +src/mock/templates/nvidia.rs— env-gated mock template.src/client.rs,src/prelude.rs—AllSmi::get_mig_info()andMigGpuInfo/MigInstanceInfoexports.tests/mig_integration_test.rs(4 tests, mirrorstests/vgpu_integration_test.rs); inline coverage added tometrics_parser.rs(parser caps, oversized indices, missing optional ids),mig.rsexporter (label set, silent-when-empty),mig_renderer.rs(ANSI-aware),mig.rsmock template (placeholder substitution, env gating). Existingtests/{cpu_model,thermal_pstate_integration,vgpu_integration}_test.rsupdated for the new 6-tupleparse_metricssignature.Test plan
cargo build --features mock— greencargo test --features mock— 422 lib + 500 bin + 38 integration + 4 new MIG integration tests, 0 failurescargo clippy --features mock --all-targets— no MIG-related warnings (8 pre-existing inamd.rs)cargo fmt— cleanALL_SMI_MOCK_MIG=1 all-smi-mock-server --port-range 19099 --platform nvidiaemits 168 MIG metric lines for 8 GPUs × 5 profiles; without the env var,/metricsshows 0 MIG lines (silent no-op verified)Notes for reviewers
MIG_PROFILEStable insrc/mock/templates/mig.rsdocuments the synthetic partitioning used in mock mode (1g.5gb x2, 2g.10gb, 3g.20gb, 7g.40gb). Easy to swap for a different mix without touching the parser.compute_instance_idis plumbed end-to-end (FFI → exporter labels → parser) but not currently used as a TUI label — kept available for future PRs that want to surface compute slice IDs in the UI.gpu_instance_id/compute_instance_idper PID) against the newMigInstanceInforecords in a follow-up.