Skip to content

feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12#172

Merged
inureyes merged 13 commits into
mainfrom
feature/issue-129-vgpu-monitoring
Apr 15, 2026
Merged

feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12#172
inureyes merged 13 commits into
mainfrom
feature/issue-129-vgpu-monitoring

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Adds first-class NVIDIA vGPU visibility to all-smi across every surface (local NVML reader, TUI view, Prometheus /metrics endpoint, remote metrics parser, and mock server), while remaining a complete no-op on bare-metal and non-NVIDIA hosts.

  • Introduce VgpuInfo / VgpuHostInfo types plus a GpuReader::get_vgpu_info trait method with an empty default so only the NVIDIA reader opts in.
  • NVIDIA reader probes vgpu_host_mode per device; on success it collects scheduler state/capabilities, active vGPU instances, and per-instance accounting stats (UUID, VM id, framebuffer, utilization, memory utilization, active flag).
  • TUI renders a compact vGPU section beneath each matching physical GPU row, showing host mode, scheduler policy, ARR state, and a per-instance line per vGPU.
  • Prometheus exporter emits all_smi_vgpu_host_mode, _scheduler_state, _scheduler_policy, _utilization, _memory_utilization, _memory_used_bytes, _memory_total_bytes, and _active with the labels requested in the issue (gpu_index, gpu_uuid, vgpu_id, host, …). Stays silent on hosts with no vGPU data.
  • Remote metrics parser recognises the same families and reconstructs VgpuHostInfo records, so an all-smi instance scraping another all-smi node surfaces the exact same view.
  • Mock server optionally simulates vGPU responses when ALL_SMI_MOCK_VGPU is set, unlocking end-to-end exercising without real vGPU hardware.
  • All NVML calls are wrapped so NotSupported / FunctionNotFound / Uninitialized degrade to empty vectors — no panics, no error logs, no empty UI sections on bare-metal.

Closes #129

Test plan

  • cargo build / cargo build --features mock
  • cargo test --features mock (all 8 test binaries green, 360+ library tests + new integration tests)
  • New tests/vgpu_integration_test.rs round-trips exporter text through the remote parser and asserts every field survives.
  • New unit tests for VgpuMetricExporter, VgpuParseState, the TUI renderer (ANSI-aware), the mock template, and the NVIDIA reader host-mode mapping.
  • cargo clippy --features mock --all-targets introduces no new warnings (pre-existing amd.rs / tpu_grpc.rs warnings are untouched).
  • cargo fmt --check clean.
  • Smoke-test on a real vGPU-enabled host (not available in CI; the NVML error-swallowing contract is covered by unit tests + the no-op guarantee).

Notes

  • nvml-wrapper was already at 0.12.1; no dependency bump required.
  • On hosts without NVML, AllSmi::get_vgpu_info() returns an empty vector — confirmed by the library integration test that runs on every platform.
  • The exporter and parser agree on label ordering and naming; the round-trip test guards against future drift.

Introduce VgpuInfo and VgpuHostInfo types alongside a new
GpuReader::get_vgpu_info trait method that defaults to an empty Vec so
non-NVIDIA readers remain silent. The NVIDIA reader probes vgpu_host_mode
per device and, when the host is vGPU-capable, collects scheduler
state/capabilities, active vGPU instances, and per-instance accounting
stats through vgpu_accounting_pids/vgpu_accounting_instance plus the
raw FFI string queries (UUID, VmID, FbUsage, Type/FramebufferSize).

All NVML calls are wrapped so any error (including the common
NotSupported on bare-metal) degrades to an empty vector rather than a
panic or log line. The local collector pulls vgpu data on every cycle
and writes it into AppState; the remote collector reserves the field
for parser-side wiring. Also expose AllSmi::get_vgpu_info and the
types via the library prelude.

Refs #129
When the active snapshot carries a VgpuHostInfo matching a GPU (by
UUID, with a hostname + gpu_name fallback for remote metrics missing
a UUID), the frame renderer emits a compact vGPU block directly under
the parent GPU line. The block reports host mode, scheduler policy,
ARR state, instance count, and a per-instance row showing type name,
utilization, memory usage, VM id, and active/idle indicator.

Bare-metal GPUs have no matching record so the section is entirely
invisible — no empty placeholder, no header.

Refs #129
Add a VgpuMetricExporter that emits one metric family per vGPU
concern: all_smi_vgpu_host_mode, _scheduler_state, _scheduler_policy
at the host level, and _utilization, _memory_utilization,
_memory_used_bytes, _memory_total_bytes, _active per instance. Every
line carries gpu_index / gpu_uuid / gpu / instance / host labels; per-
instance lines additionally carry vgpu_id / vgpu_uuid / vgpu_type. The
exporter writes nothing when the vgpu_info slice is empty, so non-vGPU
hosts stay completely silent on the /metrics endpoint.

Teach the remote metrics parser to recognise the same families and
reconstruct VgpuHostInfo records, keyed by (gpu_uuid, vgpu_id) for
instances and by gpu_uuid for host-scope metrics. The resulting
Vec<VgpuHostInfo> now flows through the network client and into the
AppState via the remote collector, completing the
producer-exporter-consumer round trip.

Refs #129
Add a vGPU template module that, when ALL_SMI_MOCK_VGPU is set in the
mock server's environment, appends the full vGPU metric family
collection to NVIDIA node responses. Each mock GPU gets four synthetic
vGPU instances whose framebuffer totals sum to the parent GPU memory
and whose utilization / memory / active state varies per render cycle.

The env var gate preserves the default bare-metal mock behaviour, so
existing test suites are unaffected unless they explicitly opt in.
Add a cross-crate integration test that renders exporter-shaped
metrics text, parses it, and asserts every field round-trips
faithfully (including the 'no data line for missing utilization' case).

Refs #129
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 15, 2026
The high-level Device::active_vgpus helper in nvml-wrapper 0.12.1 is
gated on #[cfg(target_os = "linux")], so calling it unconditionally
from nvidia_vgpu.rs (which is compiled on every platform) fails to
build on Windows where reader_factory still instantiates
NvidiaGpuReader. Switch to the raw nvmlDeviceGetActiveVgpus symbol
via the existing libloading-based pattern used for the other vGPU
queries in this module. Behaviour is unchanged on Linux; bare-metal and
non-vGPU hosts still degrade to an empty vector.

Also drop the dead parse_host_mode_code helper: it claimed to be
called from api/metrics/vgpu.rs but is not, and its u32::MAX fallback
for "Disabled" disagreed with the exporter's 2. The exporter's
host_mode_code is now the single source of truth, eliminating a
future drift hazard.
The NVIDIA reader already collects vgpu_vm_id and the TUI renders it as
the "vm=" column, but the Prometheus exporter never emitted the field,
so a remote all-smi scraping another all-smi reconstructed vm_id as an
empty string. This broke the PR's claimed "remote surfaces the exact
same view" property.

Emit vgpu_vm_id as an extra instance label and read it back on the
parser side. Update the mock template to include the label so the mock
server mirrors the new contract, and extend the vgpu_integration_test
round-trip to assert the field survives.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Add first-class NVIDIA vGPU visibility (detection + TUI + Prometheus export + remote round-trip + mock) that is a silent no-op on bare-metal/non-NVIDIA hosts.

Findings Addressed (auto-fixed)

  • CRITICAL — Windows build break: device.active_vgpus() is #[cfg(target_os = "linux")] in nvml-wrapper 0.12.1. The PR called it unconditionally from nvidia_vgpu.rs (no cfg gate), which compiles on Linux but fails on Windows where reader_factory still instantiates NvidiaGpuReader. Windows is an active release target per .github/workflows/release.yml. Fixed by routing through the raw nvmlDeviceGetActiveVgpus FFI symbol, matching the pattern already used for vgpu_uuid/vgpu_vm_id/etc. in the same file. (commit e4f8c58)
  • MEDIUM — vm_id lost on remote round-trip: The reader collected vgpu_vm_id and the TUI rendered it, but the Prometheus exporter never emitted it, so remote scrapers saw blank vm=. Added vgpu_vm_id label to instance metrics, parser reads it, mock template emits it, integration test asserts it round-trips. (commit a287df4)
  • LOW — dead parse_host_mode_code encoded wrong convention: Returned u32::MAX for unknown/Disabled while the exporter uses 2. Its doc comment falsely claimed it was called from api/metrics/vgpu.rs. Removed to eliminate drift risk. (commit e4f8c58)

Remaining Items (not blocking)

  • MEDIUM — arbitrary PID selection in accounting statscollect_single_vgpu picks pids[0] as the "representative" sample. NVML doesn't document PID ordering, so on vGPUs with multiple concurrent processes the reported util/mem-util/active can be non-deterministic. Reasonable fallback for a monitoring tool, worth revisiting later (aggregate or pick max util). (src/device/readers/nvidia_vgpu.rs:172-184)
  • LOW — unused _width parameter in print_vgpu_section (other renderers use width). (src/ui/renderers/vgpu_renderer.rs:39)
  • LOW — vgpu_info overwritten unconditionally in remote_collector.rs (unlike gpu_info which has a "preserve if invalid" guard). Could cause a one-frame flicker on transient poll failure. (src/view/data_collection/remote_collector.rs:209)

Verification

  • All stated requirements implemented (detection, collection, TUI, Prometheus, remote parser, mock)
  • No placeholder/mock code in production paths (the ALL_SMI_MOCK_VGPU-gated template lives in src/mock/templates/, correct location)
  • Fully integrated: reader → CollectionDataAppStateRenderSnapshot → frame_renderer + /metrics handler + NetworkClient round-trip
  • Project conventions followed (trait default for non-NVIDIA readers, MetricBuilder/MetricExporter reuse, renderer module layout, FFI pattern)
  • Existing modules reused (GpuReader trait, MetricBuilder, parser regex, frame renderer loop)
  • No unintended structural changes (existing GPU monitoring path untouched)
  • Bare-metal no-op verified (every NVML call wrapped; empty vec → empty exporter output → no TUI section)
  • Tests pass: 363 lib + 428 mock + 3 vGPU integration + 31 device tests, etc. — all green
  • cargo fmt --check clean, cargo clippy --features mock produces no new warnings (only the 6 pre-existing amd.rs/tpu_grpc.rs warnings)

The PR is now in a reviewable-done state.

sanitize_label_value panicked whenever the byte at MAX_LABEL_VALUE_LENGTH
landed inside a multi-byte UTF-8 codepoint. A remote exporter emitting
an overlong non-ASCII label could therefore crash the parser worker.
Walk back to the nearest char boundary before slicing.
MetricBuilder::metric already escapes backslash, double-quote, and
newline per the Prometheus exposition spec, but a stray carriage
return in a label value (e.g. Windows-origin firmware strings) would
pass through literally. Downstream scrapers treat \r as a line
terminator on Windows, so an unescaped CR could split one metric
into two and let the tail be reinterpreted as attacker-controlled
content. Escape it to \\r to match the existing \\n handling.
active_vgpus_ffi previously trusted the count NVML wrote during the
probe phase and passed it straight to vec![0; count as usize]. A
misbehaving or hostile driver returning u32::MAX would drive an
instant ~16 GiB allocation. Clamp to MAX_VGPU_INSTANCES_PER_DEVICE
(256, well above any real deployment) and bail out to an empty vec
when exceeded. Introduce a distinct count_in for the fetch call so
the buffer capacity stays explicit and defensively clamp the
post-call truncate length to the allocated buffer size.
export_instance_metrics used to call iter_instances five times per
scrape (once per metric family). Each call rebuilt every gpu_index
and instance_id string, producing ~5 * 2 * N allocations per scrape
for N vGPUs. Collect the flattened (host, vgpu, gpu_index_str,
instance_id_str) tuples into a Vec<Row> once and iterate over
borrowed slices for each family. Cuts per-scrape allocation count
by 4x and keeps the exposition output byte-for-byte identical.
Every other metric family in parse_metrics was already guarded by
MAX_DEVICES_PER_TYPE=256, but the vGPU path had no bound. A malicious
remote exporter could advertise unbounded gpu_uuid / (gpu_uuid,
vgpu_id) values and drive VgpuParseState.hosts and .instances to
exhaust memory. Introduce MAX_VGPU_HOSTS=256 and
MAX_VGPU_INSTANCES=4096 (~16 vGPUs * 256 hosts) and gate only the
new-key insertion path so scrapes that legitimately update existing
entries are unaffected. Add two regression tests that push past each
cap and assert the extras are silently dropped without panicking.
parse_labels naively split on every comma, which a VM-owner-controlled
label value containing `",` could use to break out and inject fake
labels (e.g. forging a `host=` key to misattribute metrics to another
host). Combined with H-1, this unlocked cross-host metric spoofing.
Introduce split_labels_respecting_quotes to honour double-quoted
values and backslash escapes, and unescape_label_value to reverse the
`\\`, `\"`, `\n`, and `\r` sequences produced by
MetricBuilder::metric. Add regression tests covering comma-injection
and the newline/carriage-return round-trip.
Document all eight vGPU Prometheus metric families (host-level host_mode,
scheduler_state, scheduler_policy; per-instance utilization,
memory_utilization, memory_used_bytes, memory_total_bytes, active) with
complete label sets and descriptions. Add PromQL query examples for vGPU
monitoring, note on ALL_SMI_MOCK_VGPU env var, note 12 in the API Notes
section, get_vgpu_info() in the library API table, and a changelog entry
for the upcoming release in README.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Lint / Format

  • cargo fmt --all -- --check: clean (no formatting changes needed)
  • cargo clippy --features mock --all-targets: only pre-existing warnings in amd.rs and tpu_grpc.rs — zero new warnings from PR-introduced files

Tests

  • cargo test --features mock: all 17 vGPU-specific tests pass plus the full suite (368 lib + 434 bin + integration tests, 0 failures)
  • Test names are descriptive and accurate — no renames needed
  • Integration test in tests/vgpu_integration_test.rs covers the full exporter → parser round-trip

Documentation (1 commit: 0d36357)

API.md — new "NVIDIA vGPU Metrics" section inserted after the existing NVIDIA GPU Specific Metrics section:

  • Host-level metric table: all_smi_vgpu_host_mode, all_smi_vgpu_scheduler_state, all_smi_vgpu_scheduler_policy with full label sets
  • Per-instance metric table: all_smi_vgpu_utilization, all_smi_vgpu_memory_utilization, all_smi_vgpu_memory_used_bytes, all_smi_vgpu_memory_total_bytes, all_smi_vgpu_active with full label sets
  • ALL_SMI_MOCK_VGPU env var documented in the notes
  • Six vGPU PromQL example queries added to the "Example Prometheus Queries" section
  • Note 12 added to the API Notes section covering behaviour and requirements

README.md:

  • NVIDIA feature bullet updated to mention vGPU SR-IOV monitoring
  • Mock server bullet updated to mention ALL_SMI_MOCK_VGPU=1
  • get_vgpu_info() row added to the Library API method table
  • Changelog entry added for the upcoming release

No Korean documentation exists in this repository; English-only update applied.

All checks passing. Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 15, 2026
@inureyes inureyes merged commit 8ca866d into main Apr 15, 2026
2 checks passed
@inureyes inureyes deleted the feature/issue-129-vgpu-monitoring branch April 15, 2026 08:11
inureyes added a commit that referenced this pull request Apr 16, 2026
…#176)

* fix(security): cap CPU core_id in remote parser to prevent OOM

A malicious upstream sending core_id="4294967295" via the Prometheus
exposition format would cause the parser to grow the per_core_utilization
vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES
cap (1024) and reject any core_id at or above that threshold.

* fix(security): strip control characters from all metric label values

A compromised remote endpoint can inject ANSI escape sequences (e.g.
\x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid,
cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values
directly, causing the terminal to execute the injected sequences.

Add strip_control_chars() to parsing/common.rs and apply it inside
sanitize_label_value() so ALL label values flowing through the parser
are cleaned. Also apply stripping in unescape_label_value() for quoted
values, and in MetricBuilder::metric() on the exporter side to defend
against local NVML returning unexpected strings.

* refactor: replace parse_metrics 6-tuple with ParsedMetrics struct

The parse_metrics return type was an unreadable and unsustainable
6-element tuple. Replace it with a named ParsedMetrics struct with
clearly documented fields. Update all call sites in client.rs and
all integration/unit tests.

* perf: replace O(N^2) linear scans with HashMap lookups for vGPU/MIG matching

In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in
find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M)
per frame, causing millions of comparisons.

Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and
use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as
a rare linear scan for entries not found by UUID.

* fix(api): standardize Prometheus labels to gpu_index/gpu_uuid across all exporters

BREAKING CHANGE: The GPU base labels `index` and `uuid` in the
Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid`
to match the more explicit naming already used by the MIG and vGPU
exporters. The remote parser accepts both old and new label names for
backward compatibility with nodes running older versions.

Existing Prometheus dashboards and alert rules that filter on
`index=` or `uuid=` labels will need to be updated to use
`gpu_index=` and `gpu_uuid=` respectively.

* test: add missing vGPU memory_utilization round-trip assertion

The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture
omitted all_smi_vgpu_memory_utilization from the test data. Add the
metric line to the fixture and assert that memory_utilization=45
survives the parse round-trip.

* fix: correct vgpu doc comment and sanitize GPU detail map values

Fix the vgpu.rs module doc that incorrectly referenced
all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode).

Apply sanitize_label_value() to detail HashMap values in gpu.rs
export_device_info to strip control characters from NVML-sourced
strings before they reach the Prometheus exposition format.

* perf: precompute GPU exporter row cache to reduce per-scrape allocations

Add a GpuRow struct and collect_rows function to the GPU exporter,
following the same pattern used by the MIG, vGPU, and hardware
exporters. The stringified gpu_index is computed once per GPU per
scrape instead of being re-created 5x across the five export methods.

* chore: apply cargo fmt formatting

* chore: fix clippy needless_lifetimes warnings

* fix: update stale label names in hardware exporter doc and tests

- Update module doc comment to use `gpu_uuid`/`gpu_index` matching the
  actual `base_labels` implementation
- Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix
  so substring matching cannot mask a regression to old label names
- Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since
  it was superseded by `gpu_render_line_count_with_lookup` for production
  use and is now only called from tests

* docs: note breaking label rename in v0.21.0 changelog entry
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add NVIDIA vGPU monitoring support via nvml-wrapper 0.12

1 participant