fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175)#176
Conversation
A malicious upstream sending core_id="4294967295" via the Prometheus exposition format would cause the parser to grow the per_core_utilization vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES cap (1024) and reject any core_id at or above that threshold.
A compromised remote endpoint can inject ANSI escape sequences (e.g. \x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid, cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values directly, causing the terminal to execute the injected sequences. Add strip_control_chars() to parsing/common.rs and apply it inside sanitize_label_value() so ALL label values flowing through the parser are cleaned. Also apply stripping in unescape_label_value() for quoted values, and in MetricBuilder::metric() on the exporter side to defend against local NVML returning unexpected strings.
The parse_metrics return type was an unreadable and unsustainable 6-element tuple. Replace it with a named ParsedMetrics struct with clearly documented fields. Update all call sites in client.rs and all integration/unit tests.
…atching In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M) per frame, causing millions of comparisons. Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as a rare linear scan for entries not found by UUID.
…all exporters BREAKING CHANGE: The GPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid` to match the more explicit naming already used by the MIG and vGPU exporters. The remote parser accepts both old and new label names for backward compatibility with nodes running older versions. Existing Prometheus dashboards and alert rules that filter on `index=` or `uuid=` labels will need to be updated to use `gpu_index=` and `gpu_uuid=` respectively.
The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture omitted all_smi_vgpu_memory_utilization from the test data. Add the metric line to the fixture and assert that memory_utilization=45 survives the parse round-trip.
Fix the vgpu.rs module doc that incorrectly referenced all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode). Apply sanitize_label_value() to detail HashMap values in gpu.rs export_device_info to strip control characters from NVML-sourced strings before they reach the Prometheus exposition format.
Add a GpuRow struct and collect_rows function to the GPU exporter, following the same pattern used by the MIG, vGPU, and hardware exporters. The stringified gpu_index is computed once per GPU per scrape instead of being re-created 5x across the five export methods.
Implementation Review SummaryIntent
Findings AddressedAll 10 commits in this PR were reviewed against their stated goals. No CRITICAL or HIGH findings require changes. Remaining Items
Verification
Detailed Findings by Concern AreaB1 (Label Rename -- BREAKING CHANGE) C2 (Control Char Stripping) D1 (HashMap Optimization) B7 (ParsedMetrics Struct) C1 (CPU Core ID OOM Cap) B5 (GPU Row Cache) |
Security & Performance Review -- PR #176Reviewed by: claude-opus-4-6 security/perf agent Verification of claimed security fixesC1 -- CPU core_id OOM cap: VERIFIED. C2 -- Control character stripping: VERIFIED.
Note: C3 -- GPU detail map value sanitization: VERIFIED. ParsedMetrics struct exposure: NO ISSUE. HashMap collision DoS surface: NO ISSUE. Label rename backward compatibility (index to gpu_index): VERIFIED. FindingsLOW-1: Stale doc comment in hardware exporter uses old label namesFile: The module-level doc comment says: But the actual LOW-2: Hardware exporter test uses old label names (weak assertion)File: The r#"uuid="GPU-ABC""#,
r#"index="0""#,These assertions pass only because r#"gpu_uuid="GPU-ABC""#,
r#"gpu_index="0""#,LOW-3: Dead function
|
- Update module doc comment to use `gpu_uuid`/`gpu_index` matching the actual `base_labels` implementation - Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix so substring matching cannot mask a regression to old label names - Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since it was superseded by `gpu_render_line_count_with_lookup` for production use and is now only called from tests
PR Finalization ReportLint & Format
Tests
Documentation
Commits Added
All checks passing. Ready for merge. |
Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid` across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) and their vendor-specific metric families. Align the NPU mock templates with the new labeling, and extend the remote parser to accept both the new and legacy label aliases. This completes the label standardization started in PR #176 for NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`). The GPU-family metrics that are shared across GPU and NPU devices (e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry `gpu_index`/`gpu_uuid`, matching the real exporter in `src/api/metrics/gpu.rs`; only the NPU-specific metric families (`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`, `all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry the NPU-prefixed labels. The remote parser now falls back through `gpu_uuid` → `npu_uuid` → `uuid` and `gpu_index` → `npu_index` → `index` (with a matching `gpu` → `npu` fallback for the device name label), so nodes running pre-v0.21.0 or mixed versions remain parseable. Two new round-trip tests in `src/network/metrics_parser.rs` lock in the new aliasing and the precedence order. API.md and the v0.21.0 changelog entry in README.md are updated to reflect the rename. BREAKING CHANGE: The NPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `npu_index` and `npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing Prometheus dashboards and alert rules that filter NPU metrics by `index=` or `uuid=` labels will need to be updated to use `npu_index=` and `npu_uuid=` respectively. The remote parser accepts both old and new label names for backward compatibility.
* feat(api): standardize NPU Prometheus labels to npu_index/npu_uuid Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid` across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) and their vendor-specific metric families. Align the NPU mock templates with the new labeling, and extend the remote parser to accept both the new and legacy label aliases. This completes the label standardization started in PR #176 for NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`). The GPU-family metrics that are shared across GPU and NPU devices (e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry `gpu_index`/`gpu_uuid`, matching the real exporter in `src/api/metrics/gpu.rs`; only the NPU-specific metric families (`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`, `all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry the NPU-prefixed labels. The remote parser now falls back through `gpu_uuid` → `npu_uuid` → `uuid` and `gpu_index` → `npu_index` → `index` (with a matching `gpu` → `npu` fallback for the device name label), so nodes running pre-v0.21.0 or mixed versions remain parseable. Two new round-trip tests in `src/network/metrics_parser.rs` lock in the new aliasing and the precedence order. API.md and the v0.21.0 changelog entry in README.md are updated to reflect the rename. BREAKING CHANGE: The NPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `npu_index` and `npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing Prometheus dashboards and alert rules that filter NPU metrics by `index=` or `uuid=` labels will need to be updated to use `npu_index=` and `npu_uuid=` respectively. The remote parser accepts both old and new label names for backward compatibility. * fix: resolve pre-existing clippy warnings in amd and tpu_grpc readers - Use `const {}` block for compile-time assertions in amd.rs version component validation test (clippy::assertions_on_constants) - Use range `.contains()` instead of manual bounds checks in amd.rs (clippy::manual_range_contains) - Use inline format argument in tpu_grpc.rs debug println (clippy::uninlined_format_args)
Summary
Cross-PR review fixes addressing issues found in a unified review of PRs #172-#175 (vGPU, thermal, MIG, hardware monitoring features).
Security fixes
core_idin remote parser to prevent OOM from malicious upstream sendingcore_id="4294967295"sanitize_label_value,unescape_label_value, andMetricBuilder::metricexport_device_info(previously only keys were sanitized)Performance fixes
frame_renderer.rsandgpu_renderer.rsGpuRowstruct) to reduce per-scrape string allocationsAPI standardization
index/uuidtogpu_index/gpu_uuidacross all exporters. Parser accepts both old and new names for backward compatibilityRefactoring
parse_metrics6-tuple return type with namedParsedMetricsstructvgpu.rs(all_smi_vgpu_host_info->all_smi_vgpu_host_mode)Test improvements
vgpu_memory_utilizationround-trip assertion in vGPU integration testHardens #129, #130, #131, #132
Test plan
cargo fmt-- cleancargo build --features mock-- greencargo clippy --features mock --all-targets-- only pre-existing warningscargo test --features mock-- 1132 tests pass, 0 failures