feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12 by inureyes · Pull Request #172 · lablup/all-smi

inureyes · 2026-04-15T07:16:22Z

Summary

Adds first-class NVIDIA vGPU visibility to all-smi across every surface (local NVML reader, TUI view, Prometheus /metrics endpoint, remote metrics parser, and mock server), while remaining a complete no-op on bare-metal and non-NVIDIA hosts.

Introduce VgpuInfo / VgpuHostInfo types plus a GpuReader::get_vgpu_info trait method with an empty default so only the NVIDIA reader opts in.
NVIDIA reader probes vgpu_host_mode per device; on success it collects scheduler state/capabilities, active vGPU instances, and per-instance accounting stats (UUID, VM id, framebuffer, utilization, memory utilization, active flag).
TUI renders a compact vGPU section beneath each matching physical GPU row, showing host mode, scheduler policy, ARR state, and a per-instance line per vGPU.
Prometheus exporter emits all_smi_vgpu_host_mode, _scheduler_state, _scheduler_policy, _utilization, _memory_utilization, _memory_used_bytes, _memory_total_bytes, and _active with the labels requested in the issue (gpu_index, gpu_uuid, vgpu_id, host, …). Stays silent on hosts with no vGPU data.
Remote metrics parser recognises the same families and reconstructs VgpuHostInfo records, so an all-smi instance scraping another all-smi node surfaces the exact same view.
Mock server optionally simulates vGPU responses when ALL_SMI_MOCK_VGPU is set, unlocking end-to-end exercising without real vGPU hardware.
All NVML calls are wrapped so NotSupported / FunctionNotFound / Uninitialized degrade to empty vectors — no panics, no error logs, no empty UI sections on bare-metal.

Closes #129

Test plan

cargo build / cargo build --features mock
cargo test --features mock (all 8 test binaries green, 360+ library tests + new integration tests)
New tests/vgpu_integration_test.rs round-trips exporter text through the remote parser and asserts every field survives.
New unit tests for VgpuMetricExporter, VgpuParseState, the TUI renderer (ANSI-aware), the mock template, and the NVIDIA reader host-mode mapping.
cargo clippy --features mock --all-targets introduces no new warnings (pre-existing amd.rs / tpu_grpc.rs warnings are untouched).
cargo fmt --check clean.
Smoke-test on a real vGPU-enabled host (not available in CI; the NVML error-swallowing contract is covered by unit tests + the no-op guarantee).

Notes

nvml-wrapper was already at 0.12.1; no dependency bump required.
On hosts without NVML, AllSmi::get_vgpu_info() returns an empty vector — confirmed by the library integration test that runs on every platform.
The exporter and parser agree on label ordering and naming; the round-trip test guards against future drift.

Introduce VgpuInfo and VgpuHostInfo types alongside a new GpuReader::get_vgpu_info trait method that defaults to an empty Vec so non-NVIDIA readers remain silent. The NVIDIA reader probes vgpu_host_mode per device and, when the host is vGPU-capable, collects scheduler state/capabilities, active vGPU instances, and per-instance accounting stats through vgpu_accounting_pids/vgpu_accounting_instance plus the raw FFI string queries (UUID, VmID, FbUsage, Type/FramebufferSize). All NVML calls are wrapped so any error (including the common NotSupported on bare-metal) degrades to an empty vector rather than a panic or log line. The local collector pulls vgpu data on every cycle and writes it into AppState; the remote collector reserves the field for parser-side wiring. Also expose AllSmi::get_vgpu_info and the types via the library prelude. Refs #129

When the active snapshot carries a VgpuHostInfo matching a GPU (by UUID, with a hostname + gpu_name fallback for remote metrics missing a UUID), the frame renderer emits a compact vGPU block directly under the parent GPU line. The block reports host mode, scheduler policy, ARR state, instance count, and a per-instance row showing type name, utilization, memory usage, VM id, and active/idle indicator. Bare-metal GPUs have no matching record so the section is entirely invisible — no empty placeholder, no header. Refs #129

Add a VgpuMetricExporter that emits one metric family per vGPU concern: all_smi_vgpu_host_mode, _scheduler_state, _scheduler_policy at the host level, and _utilization, _memory_utilization, _memory_used_bytes, _memory_total_bytes, _active per instance. Every line carries gpu_index / gpu_uuid / gpu / instance / host labels; per- instance lines additionally carry vgpu_id / vgpu_uuid / vgpu_type. The exporter writes nothing when the vgpu_info slice is empty, so non-vGPU hosts stay completely silent on the /metrics endpoint. Teach the remote metrics parser to recognise the same families and reconstruct VgpuHostInfo records, keyed by (gpu_uuid, vgpu_id) for instances and by gpu_uuid for host-scope metrics. The resulting Vec<VgpuHostInfo> now flows through the network client and into the AppState via the remote collector, completing the producer-exporter-consumer round trip. Refs #129

Add a vGPU template module that, when ALL_SMI_MOCK_VGPU is set in the mock server's environment, appends the full vGPU metric family collection to NVIDIA node responses. Each mock GPU gets four synthetic vGPU instances whose framebuffer totals sum to the parent GPU memory and whose utilization / memory / active state varies per render cycle. The env var gate preserves the default bare-metal mock behaviour, so existing test suites are unaffected unless they explicitly opt in. Add a cross-crate integration test that renders exporter-shaped metrics text, parses it, and asserts every field round-trips faithfully (including the 'no data line for missing utilization' case). Refs #129

The high-level Device::active_vgpus helper in nvml-wrapper 0.12.1 is gated on #[cfg(target_os = "linux")], so calling it unconditionally from nvidia_vgpu.rs (which is compiled on every platform) fails to build on Windows where reader_factory still instantiates NvidiaGpuReader. Switch to the raw nvmlDeviceGetActiveVgpus symbol via the existing libloading-based pattern used for the other vGPU queries in this module. Behaviour is unchanged on Linux; bare-metal and non-vGPU hosts still degrade to an empty vector. Also drop the dead parse_host_mode_code helper: it claimed to be called from api/metrics/vgpu.rs but is not, and its u32::MAX fallback for "Disabled" disagreed with the exporter's 2. The exporter's host_mode_code is now the single source of truth, eliminating a future drift hazard.

The NVIDIA reader already collects vgpu_vm_id and the TUI renders it as the "vm=" column, but the Prometheus exporter never emitted the field, so a remote all-smi scraping another all-smi reconstructed vm_id as an empty string. This broke the PR's claimed "remote surfaces the exact same view" property. Emit vgpu_vm_id as an extra instance label and read it back on the parser side. Update the mock template to include the label so the mock server mirrors the new contract, and extend the vgpu_integration_test round-trip to assert the field survives.

inureyes · 2026-04-15T07:41:12Z

Implementation Review Summary

Intent

Add first-class NVIDIA vGPU visibility (detection + TUI + Prometheus export + remote round-trip + mock) that is a silent no-op on bare-metal/non-NVIDIA hosts.

Findings Addressed (auto-fixed)

CRITICAL — Windows build break: device.active_vgpus() is #[cfg(target_os = "linux")] in nvml-wrapper 0.12.1. The PR called it unconditionally from nvidia_vgpu.rs (no cfg gate), which compiles on Linux but fails on Windows where reader_factory still instantiates NvidiaGpuReader. Windows is an active release target per .github/workflows/release.yml. Fixed by routing through the raw nvmlDeviceGetActiveVgpus FFI symbol, matching the pattern already used for vgpu_uuid/vgpu_vm_id/etc. in the same file. (commit e4f8c58)
MEDIUM — vm_id lost on remote round-trip: The reader collected vgpu_vm_id and the TUI rendered it, but the Prometheus exporter never emitted it, so remote scrapers saw blank vm=. Added vgpu_vm_id label to instance metrics, parser reads it, mock template emits it, integration test asserts it round-trips. (commit a287df4)
LOW — dead parse_host_mode_code encoded wrong convention: Returned u32::MAX for unknown/Disabled while the exporter uses 2. Its doc comment falsely claimed it was called from api/metrics/vgpu.rs. Removed to eliminate drift risk. (commit e4f8c58)

Remaining Items (not blocking)

MEDIUM — arbitrary PID selection in accounting stats — collect_single_vgpu picks pids[0] as the "representative" sample. NVML doesn't document PID ordering, so on vGPUs with multiple concurrent processes the reported util/mem-util/active can be non-deterministic. Reasonable fallback for a monitoring tool, worth revisiting later (aggregate or pick max util). (src/device/readers/nvidia_vgpu.rs:172-184)
LOW — unused _width parameter in print_vgpu_section (other renderers use width). (src/ui/renderers/vgpu_renderer.rs:39)
LOW — vgpu_info overwritten unconditionally in remote_collector.rs (unlike gpu_info which has a "preserve if invalid" guard). Could cause a one-frame flicker on transient poll failure. (src/view/data_collection/remote_collector.rs:209)

Verification

All stated requirements implemented (detection, collection, TUI, Prometheus, remote parser, mock)
No placeholder/mock code in production paths (the ALL_SMI_MOCK_VGPU-gated template lives in src/mock/templates/, correct location)
Fully integrated: reader → CollectionData → AppState → RenderSnapshot → frame_renderer + /metrics handler + NetworkClient round-trip
Project conventions followed (trait default for non-NVIDIA readers, MetricBuilder/MetricExporter reuse, renderer module layout, FFI pattern)
Existing modules reused (GpuReader trait, MetricBuilder, parser regex, frame renderer loop)
No unintended structural changes (existing GPU monitoring path untouched)
Bare-metal no-op verified (every NVML call wrapped; empty vec → empty exporter output → no TUI section)
Tests pass: 363 lib + 428 mock + 3 vGPU integration + 31 device tests, etc. — all green
cargo fmt --check clean, cargo clippy --features mock produces no new warnings (only the 6 pre-existing amd.rs/tpu_grpc.rs warnings)

The PR is now in a reviewable-done state.

sanitize_label_value panicked whenever the byte at MAX_LABEL_VALUE_LENGTH landed inside a multi-byte UTF-8 codepoint. A remote exporter emitting an overlong non-ASCII label could therefore crash the parser worker. Walk back to the nearest char boundary before slicing.

MetricBuilder::metric already escapes backslash, double-quote, and newline per the Prometheus exposition spec, but a stray carriage return in a label value (e.g. Windows-origin firmware strings) would pass through literally. Downstream scrapers treat \r as a line terminator on Windows, so an unescaped CR could split one metric into two and let the tail be reinterpreted as attacker-controlled content. Escape it to \\r to match the existing \\n handling.

active_vgpus_ffi previously trusted the count NVML wrote during the probe phase and passed it straight to vec![0; count as usize]. A misbehaving or hostile driver returning u32::MAX would drive an instant ~16 GiB allocation. Clamp to MAX_VGPU_INSTANCES_PER_DEVICE (256, well above any real deployment) and bail out to an empty vec when exceeded. Introduce a distinct count_in for the fetch call so the buffer capacity stays explicit and defensively clamp the post-call truncate length to the allocated buffer size.

export_instance_metrics used to call iter_instances five times per scrape (once per metric family). Each call rebuilt every gpu_index and instance_id string, producing ~5 * 2 * N allocations per scrape for N vGPUs. Collect the flattened (host, vgpu, gpu_index_str, instance_id_str) tuples into a Vec<Row> once and iterate over borrowed slices for each family. Cuts per-scrape allocation count by 4x and keeps the exposition output byte-for-byte identical.

Every other metric family in parse_metrics was already guarded by MAX_DEVICES_PER_TYPE=256, but the vGPU path had no bound. A malicious remote exporter could advertise unbounded gpu_uuid / (gpu_uuid, vgpu_id) values and drive VgpuParseState.hosts and .instances to exhaust memory. Introduce MAX_VGPU_HOSTS=256 and MAX_VGPU_INSTANCES=4096 (~16 vGPUs * 256 hosts) and gate only the new-key insertion path so scrapes that legitimately update existing entries are unaffected. Add two regression tests that push past each cap and assert the extras are silently dropped without panicking.

parse_labels naively split on every comma, which a VM-owner-controlled label value containing `",` could use to break out and inject fake labels (e.g. forging a `host=` key to misattribute metrics to another host). Combined with H-1, this unlocked cross-host metric spoofing. Introduce split_labels_respecting_quotes to honour double-quoted values and backslash escapes, and unescape_label_value to reverse the `\\`, `\"`, `\n`, and `\r` sequences produced by MetricBuilder::metric. Add regression tests covering comma-injection and the newline/carriage-return round-trip.

Document all eight vGPU Prometheus metric families (host-level host_mode, scheduler_state, scheduler_policy; per-instance utilization, memory_utilization, memory_used_bytes, memory_total_bytes, active) with complete label sets and descriptions. Add PromQL query examples for vGPU monitoring, note on ALL_SMI_MOCK_VGPU env var, note 12 in the API Notes section, get_vgpu_info() in the library API table, and a changelog entry for the upcoming release in README.

inureyes · 2026-04-15T08:10:20Z

PR Finalization Complete

Lint / Format

cargo fmt --all -- --check: clean (no formatting changes needed)
cargo clippy --features mock --all-targets: only pre-existing warnings in amd.rs and tpu_grpc.rs — zero new warnings from PR-introduced files

Tests

cargo test --features mock: all 17 vGPU-specific tests pass plus the full suite (368 lib + 434 bin + integration tests, 0 failures)
Test names are descriptive and accurate — no renames needed
Integration test in tests/vgpu_integration_test.rs covers the full exporter → parser round-trip

Documentation (1 commit: `0d36357`)

API.md — new "NVIDIA vGPU Metrics" section inserted after the existing NVIDIA GPU Specific Metrics section:

Host-level metric table: all_smi_vgpu_host_mode, all_smi_vgpu_scheduler_state, all_smi_vgpu_scheduler_policy with full label sets
Per-instance metric table: all_smi_vgpu_utilization, all_smi_vgpu_memory_utilization, all_smi_vgpu_memory_used_bytes, all_smi_vgpu_memory_total_bytes, all_smi_vgpu_active with full label sets
ALL_SMI_MOCK_VGPU env var documented in the notes
Six vGPU PromQL example queries added to the "Example Prometheus Queries" section
Note 12 added to the API Notes section covering behaviour and requirements

README.md:

NVIDIA feature bullet updated to mention vGPU SR-IOV monitoring
Mock server bullet updated to mention ALL_SMI_MOCK_VGPU=1
get_vgpu_info() row added to the Library API method table
Changelog entry added for the upcoming release

No Korean documentation exists in this repository; English-only update applied.

All checks passing. Ready for merge.

…#176) * fix(security): cap CPU core_id in remote parser to prevent OOM A malicious upstream sending core_id="4294967295" via the Prometheus exposition format would cause the parser to grow the per_core_utilization vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES cap (1024) and reject any core_id at or above that threshold. * fix(security): strip control characters from all metric label values A compromised remote endpoint can inject ANSI escape sequences (e.g. \x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid, cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values directly, causing the terminal to execute the injected sequences. Add strip_control_chars() to parsing/common.rs and apply it inside sanitize_label_value() so ALL label values flowing through the parser are cleaned. Also apply stripping in unescape_label_value() for quoted values, and in MetricBuilder::metric() on the exporter side to defend against local NVML returning unexpected strings. * refactor: replace parse_metrics 6-tuple with ParsedMetrics struct The parse_metrics return type was an unreadable and unsustainable 6-element tuple. Replace it with a named ParsedMetrics struct with clearly documented fields. Update all call sites in client.rs and all integration/unit tests. * perf: replace O(N^2) linear scans with HashMap lookups for vGPU/MIG matching In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M) per frame, causing millions of comparisons. Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as a rare linear scan for entries not found by UUID. * fix(api): standardize Prometheus labels to gpu_index/gpu_uuid across all exporters BREAKING CHANGE: The GPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid` to match the more explicit naming already used by the MIG and vGPU exporters. The remote parser accepts both old and new label names for backward compatibility with nodes running older versions. Existing Prometheus dashboards and alert rules that filter on `index=` or `uuid=` labels will need to be updated to use `gpu_index=` and `gpu_uuid=` respectively. * test: add missing vGPU memory_utilization round-trip assertion The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture omitted all_smi_vgpu_memory_utilization from the test data. Add the metric line to the fixture and assert that memory_utilization=45 survives the parse round-trip. * fix: correct vgpu doc comment and sanitize GPU detail map values Fix the vgpu.rs module doc that incorrectly referenced all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode). Apply sanitize_label_value() to detail HashMap values in gpu.rs export_device_info to strip control characters from NVML-sourced strings before they reach the Prometheus exposition format. * perf: precompute GPU exporter row cache to reduce per-scrape allocations Add a GpuRow struct and collect_rows function to the GPU exporter, following the same pattern used by the MIG, vGPU, and hardware exporters. The stringified gpu_index is computed once per GPU per scrape instead of being re-created 5x across the five export methods. * chore: apply cargo fmt formatting * chore: fix clippy needless_lifetimes warnings * fix: update stale label names in hardware exporter doc and tests - Update module doc comment to use `gpu_uuid`/`gpu_index` matching the actual `base_labels` implementation - Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix so substring matching cannot mask a regression to old label names - Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since it was superseded by `gpu_render_line_count_with_lookup` for production use and is now only called from tests * docs: note breaking label rename in v0.21.0 changelog entry

inureyes added 4 commits April 15, 2026 16:15

inureyes added type:enhancement New feature or request priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 15, 2026

inureyes added 2 commits April 15, 2026 16:40

inureyes added 7 commits April 15, 2026 17:00

inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026

inureyes merged commit 8ca866d into main Apr 15, 2026
2 checks passed

inureyes deleted the feature/issue-129-vgpu-monitoring branch April 15, 2026 08:11

This was referenced Apr 15, 2026

feat: add NVIDIA thermal thresholds and P-state monitoring #173

Merged

feat: add NVIDIA MIG (Multi-Instance GPU) monitoring #174

Merged

fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175) #176

Merged

inureyes self-assigned this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12#172

feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12#172
inureyes merged 13 commits into
mainfrom
feature/issue-129-vgpu-monitoring

inureyes commented Apr 15, 2026

Uh oh!

inureyes commented Apr 15, 2026

Uh oh!

inureyes commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 15, 2026

Summary

Test plan

Notes

Uh oh!

inureyes commented Apr 15, 2026

Implementation Review Summary

Intent

Findings Addressed (auto-fixed)

Remaining Items (not blocking)

Verification

Uh oh!

inureyes commented Apr 15, 2026

PR Finalization Complete

Lint / Format

Tests

Documentation (1 commit: 0d36357)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Documentation (1 commit: `0d36357`)