feat: add NVIDIA vGPU monitoring via nvml-wrapper 0.12#172
Merged
Conversation
Introduce VgpuInfo and VgpuHostInfo types alongside a new GpuReader::get_vgpu_info trait method that defaults to an empty Vec so non-NVIDIA readers remain silent. The NVIDIA reader probes vgpu_host_mode per device and, when the host is vGPU-capable, collects scheduler state/capabilities, active vGPU instances, and per-instance accounting stats through vgpu_accounting_pids/vgpu_accounting_instance plus the raw FFI string queries (UUID, VmID, FbUsage, Type/FramebufferSize). All NVML calls are wrapped so any error (including the common NotSupported on bare-metal) degrades to an empty vector rather than a panic or log line. The local collector pulls vgpu data on every cycle and writes it into AppState; the remote collector reserves the field for parser-side wiring. Also expose AllSmi::get_vgpu_info and the types via the library prelude. Refs #129
When the active snapshot carries a VgpuHostInfo matching a GPU (by UUID, with a hostname + gpu_name fallback for remote metrics missing a UUID), the frame renderer emits a compact vGPU block directly under the parent GPU line. The block reports host mode, scheduler policy, ARR state, instance count, and a per-instance row showing type name, utilization, memory usage, VM id, and active/idle indicator. Bare-metal GPUs have no matching record so the section is entirely invisible — no empty placeholder, no header. Refs #129
Add a VgpuMetricExporter that emits one metric family per vGPU concern: all_smi_vgpu_host_mode, _scheduler_state, _scheduler_policy at the host level, and _utilization, _memory_utilization, _memory_used_bytes, _memory_total_bytes, _active per instance. Every line carries gpu_index / gpu_uuid / gpu / instance / host labels; per- instance lines additionally carry vgpu_id / vgpu_uuid / vgpu_type. The exporter writes nothing when the vgpu_info slice is empty, so non-vGPU hosts stay completely silent on the /metrics endpoint. Teach the remote metrics parser to recognise the same families and reconstruct VgpuHostInfo records, keyed by (gpu_uuid, vgpu_id) for instances and by gpu_uuid for host-scope metrics. The resulting Vec<VgpuHostInfo> now flows through the network client and into the AppState via the remote collector, completing the producer-exporter-consumer round trip. Refs #129
Add a vGPU template module that, when ALL_SMI_MOCK_VGPU is set in the mock server's environment, appends the full vGPU metric family collection to NVIDIA node responses. Each mock GPU gets four synthetic vGPU instances whose framebuffer totals sum to the parent GPU memory and whose utilization / memory / active state varies per render cycle. The env var gate preserves the default bare-metal mock behaviour, so existing test suites are unaffected unless they explicitly opt in. Add a cross-crate integration test that renders exporter-shaped metrics text, parses it, and asserts every field round-trips faithfully (including the 'no data line for missing utilization' case). Refs #129
The high-level Device::active_vgpus helper in nvml-wrapper 0.12.1 is gated on #[cfg(target_os = "linux")], so calling it unconditionally from nvidia_vgpu.rs (which is compiled on every platform) fails to build on Windows where reader_factory still instantiates NvidiaGpuReader. Switch to the raw nvmlDeviceGetActiveVgpus symbol via the existing libloading-based pattern used for the other vGPU queries in this module. Behaviour is unchanged on Linux; bare-metal and non-vGPU hosts still degrade to an empty vector. Also drop the dead parse_host_mode_code helper: it claimed to be called from api/metrics/vgpu.rs but is not, and its u32::MAX fallback for "Disabled" disagreed with the exporter's 2. The exporter's host_mode_code is now the single source of truth, eliminating a future drift hazard.
The NVIDIA reader already collects vgpu_vm_id and the TUI renders it as the "vm=" column, but the Prometheus exporter never emitted the field, so a remote all-smi scraping another all-smi reconstructed vm_id as an empty string. This broke the PR's claimed "remote surfaces the exact same view" property. Emit vgpu_vm_id as an extra instance label and read it back on the parser side. Update the mock template to include the label so the mock server mirrors the new contract, and extend the vgpu_integration_test round-trip to assert the field survives.
Member
Author
Implementation Review SummaryIntentAdd first-class NVIDIA vGPU visibility (detection + TUI + Prometheus export + remote round-trip + mock) that is a silent no-op on bare-metal/non-NVIDIA hosts. Findings Addressed (auto-fixed)
Remaining Items (not blocking)
Verification
The PR is now in a reviewable-done state. |
sanitize_label_value panicked whenever the byte at MAX_LABEL_VALUE_LENGTH landed inside a multi-byte UTF-8 codepoint. A remote exporter emitting an overlong non-ASCII label could therefore crash the parser worker. Walk back to the nearest char boundary before slicing.
MetricBuilder::metric already escapes backslash, double-quote, and newline per the Prometheus exposition spec, but a stray carriage return in a label value (e.g. Windows-origin firmware strings) would pass through literally. Downstream scrapers treat \r as a line terminator on Windows, so an unescaped CR could split one metric into two and let the tail be reinterpreted as attacker-controlled content. Escape it to \\r to match the existing \\n handling.
active_vgpus_ffi previously trusted the count NVML wrote during the probe phase and passed it straight to vec![0; count as usize]. A misbehaving or hostile driver returning u32::MAX would drive an instant ~16 GiB allocation. Clamp to MAX_VGPU_INSTANCES_PER_DEVICE (256, well above any real deployment) and bail out to an empty vec when exceeded. Introduce a distinct count_in for the fetch call so the buffer capacity stays explicit and defensively clamp the post-call truncate length to the allocated buffer size.
export_instance_metrics used to call iter_instances five times per scrape (once per metric family). Each call rebuilt every gpu_index and instance_id string, producing ~5 * 2 * N allocations per scrape for N vGPUs. Collect the flattened (host, vgpu, gpu_index_str, instance_id_str) tuples into a Vec<Row> once and iterate over borrowed slices for each family. Cuts per-scrape allocation count by 4x and keeps the exposition output byte-for-byte identical.
Every other metric family in parse_metrics was already guarded by MAX_DEVICES_PER_TYPE=256, but the vGPU path had no bound. A malicious remote exporter could advertise unbounded gpu_uuid / (gpu_uuid, vgpu_id) values and drive VgpuParseState.hosts and .instances to exhaust memory. Introduce MAX_VGPU_HOSTS=256 and MAX_VGPU_INSTANCES=4096 (~16 vGPUs * 256 hosts) and gate only the new-key insertion path so scrapes that legitimately update existing entries are unaffected. Add two regression tests that push past each cap and assert the extras are silently dropped without panicking.
parse_labels naively split on every comma, which a VM-owner-controlled label value containing `",` could use to break out and inject fake labels (e.g. forging a `host=` key to misattribute metrics to another host). Combined with H-1, this unlocked cross-host metric spoofing. Introduce split_labels_respecting_quotes to honour double-quoted values and backslash escapes, and unescape_label_value to reverse the `\\`, `\"`, `\n`, and `\r` sequences produced by MetricBuilder::metric. Add regression tests covering comma-injection and the newline/carriage-return round-trip.
Document all eight vGPU Prometheus metric families (host-level host_mode, scheduler_state, scheduler_policy; per-instance utilization, memory_utilization, memory_used_bytes, memory_total_bytes, active) with complete label sets and descriptions. Add PromQL query examples for vGPU monitoring, note on ALL_SMI_MOCK_VGPU env var, note 12 in the API Notes section, get_vgpu_info() in the library API table, and a changelog entry for the upcoming release in README.
Member
Author
PR Finalization CompleteLint / Format
Tests
Documentation (1 commit:
|
This was referenced Apr 15, 2026
inureyes
added a commit
that referenced
this pull request
Apr 16, 2026
…#176) * fix(security): cap CPU core_id in remote parser to prevent OOM A malicious upstream sending core_id="4294967295" via the Prometheus exposition format would cause the parser to grow the per_core_utilization vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES cap (1024) and reject any core_id at or above that threshold. * fix(security): strip control characters from all metric label values A compromised remote endpoint can inject ANSI escape sequences (e.g. \x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid, cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values directly, causing the terminal to execute the injected sequences. Add strip_control_chars() to parsing/common.rs and apply it inside sanitize_label_value() so ALL label values flowing through the parser are cleaned. Also apply stripping in unescape_label_value() for quoted values, and in MetricBuilder::metric() on the exporter side to defend against local NVML returning unexpected strings. * refactor: replace parse_metrics 6-tuple with ParsedMetrics struct The parse_metrics return type was an unreadable and unsustainable 6-element tuple. Replace it with a named ParsedMetrics struct with clearly documented fields. Update all call sites in client.rs and all integration/unit tests. * perf: replace O(N^2) linear scans with HashMap lookups for vGPU/MIG matching In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M) per frame, causing millions of comparisons. Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as a rare linear scan for entries not found by UUID. * fix(api): standardize Prometheus labels to gpu_index/gpu_uuid across all exporters BREAKING CHANGE: The GPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid` to match the more explicit naming already used by the MIG and vGPU exporters. The remote parser accepts both old and new label names for backward compatibility with nodes running older versions. Existing Prometheus dashboards and alert rules that filter on `index=` or `uuid=` labels will need to be updated to use `gpu_index=` and `gpu_uuid=` respectively. * test: add missing vGPU memory_utilization round-trip assertion The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture omitted all_smi_vgpu_memory_utilization from the test data. Add the metric line to the fixture and assert that memory_utilization=45 survives the parse round-trip. * fix: correct vgpu doc comment and sanitize GPU detail map values Fix the vgpu.rs module doc that incorrectly referenced all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode). Apply sanitize_label_value() to detail HashMap values in gpu.rs export_device_info to strip control characters from NVML-sourced strings before they reach the Prometheus exposition format. * perf: precompute GPU exporter row cache to reduce per-scrape allocations Add a GpuRow struct and collect_rows function to the GPU exporter, following the same pattern used by the MIG, vGPU, and hardware exporters. The stringified gpu_index is computed once per GPU per scrape instead of being re-created 5x across the five export methods. * chore: apply cargo fmt formatting * chore: fix clippy needless_lifetimes warnings * fix: update stale label names in hardware exporter doc and tests - Update module doc comment to use `gpu_uuid`/`gpu_index` matching the actual `base_labels` implementation - Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix so substring matching cannot mask a regression to old label names - Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since it was superseded by `gpu_render_line_count_with_lookup` for production use and is now only called from tests * docs: note breaking label rename in v0.21.0 changelog entry
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class NVIDIA vGPU visibility to all-smi across every surface (local NVML reader, TUI view, Prometheus
/metricsendpoint, remote metrics parser, and mock server), while remaining a complete no-op on bare-metal and non-NVIDIA hosts.VgpuInfo/VgpuHostInfotypes plus aGpuReader::get_vgpu_infotrait method with an empty default so only the NVIDIA reader opts in.vgpu_host_modeper device; on success it collects scheduler state/capabilities, active vGPU instances, and per-instance accounting stats (UUID, VM id, framebuffer, utilization, memory utilization, active flag).all_smi_vgpu_host_mode,_scheduler_state,_scheduler_policy,_utilization,_memory_utilization,_memory_used_bytes,_memory_total_bytes, and_activewith the labels requested in the issue (gpu_index,gpu_uuid,vgpu_id,host, …). Stays silent on hosts with no vGPU data.VgpuHostInforecords, so an all-smi instance scraping another all-smi node surfaces the exact same view.ALL_SMI_MOCK_VGPUis set, unlocking end-to-end exercising without real vGPU hardware.NotSupported/FunctionNotFound/Uninitializeddegrade to empty vectors — no panics, no error logs, no empty UI sections on bare-metal.Closes #129
Test plan
cargo build/cargo build --features mockcargo test --features mock(all 8 test binaries green, 360+ library tests + new integration tests)tests/vgpu_integration_test.rsround-trips exporter text through the remote parser and asserts every field survives.VgpuMetricExporter,VgpuParseState, the TUI renderer (ANSI-aware), the mock template, and the NVIDIA reader host-mode mapping.cargo clippy --features mock --all-targetsintroduces no new warnings (pre-existingamd.rs/tpu_grpc.rswarnings are untouched).cargo fmt --checkclean.Notes
nvml-wrapperwas already at0.12.1; no dependency bump required.AllSmi::get_vgpu_info()returns an empty vector — confirmed by the library integration test that runs on every platform.