fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175) by inureyes · Pull Request #176 · lablup/all-smi

inureyes · 2026-04-16T06:25:06Z

Summary

Cross-PR review fixes addressing issues found in a unified review of PRs #172-#175 (vGPU, thermal, MIG, hardware monitoring features).

Security fixes

C1 (Critical): Cap CPU core_id in remote parser to prevent OOM from malicious upstream sending core_id="4294967295"
C2 (High): Strip control characters from ALL metric label values to prevent ANSI escape injection in TUI from compromised remote endpoints. Applied in sanitize_label_value, unescape_label_value, and MetricBuilder::metric
C3: Sanitize GPU detail map values in export_device_info (previously only keys were sanitized)

Performance fixes

D1 (High): Replace O(GV) + O(GM) per-frame linear scans with O(1) HashMap lookups for vGPU/MIG matching in frame_renderer.rs and gpu_renderer.rs
B5: Precompute GPU exporter row cache (GpuRow struct) to reduce per-scrape string allocations

API standardization

B1 (High, BREAKING): Standardize all Prometheus GPU labels from index/uuid to gpu_index/gpu_uuid across all exporters. Parser accepts both old and new names for backward compatibility

Refactoring

B7: Replace parse_metrics 6-tuple return type with named ParsedMetrics struct
A3: Fix stale doc comment in vgpu.rs (all_smi_vgpu_host_info -> all_smi_vgpu_host_mode)

Test improvements

A2: Add missing vgpu_memory_utilization round-trip assertion in vGPU integration test

Hardens #129, #130, #131, #132

Test plan

cargo fmt -- clean
cargo build --features mock -- green
cargo clippy --features mock --all-targets -- only pre-existing warnings
cargo test --features mock -- 1132 tests pass, 0 failures

A malicious upstream sending core_id="4294967295" via the Prometheus exposition format would cause the parser to grow the per_core_utilization vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES cap (1024) and reject any core_id at or above that threshold.

A compromised remote endpoint can inject ANSI escape sequences (e.g. \x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid, cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values directly, causing the terminal to execute the injected sequences. Add strip_control_chars() to parsing/common.rs and apply it inside sanitize_label_value() so ALL label values flowing through the parser are cleaned. Also apply stripping in unescape_label_value() for quoted values, and in MetricBuilder::metric() on the exporter side to defend against local NVML returning unexpected strings.

The parse_metrics return type was an unreadable and unsustainable 6-element tuple. Replace it with a named ParsedMetrics struct with clearly documented fields. Update all call sites in client.rs and all integration/unit tests.

…atching In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M) per frame, causing millions of comparisons. Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as a rare linear scan for entries not found by UUID.

…all exporters BREAKING CHANGE: The GPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid` to match the more explicit naming already used by the MIG and vGPU exporters. The remote parser accepts both old and new label names for backward compatibility with nodes running older versions. Existing Prometheus dashboards and alert rules that filter on `index=` or `uuid=` labels will need to be updated to use `gpu_index=` and `gpu_uuid=` respectively.

The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture omitted all_smi_vgpu_memory_utilization from the test data. Add the metric line to the fixture and assert that memory_utilization=45 survives the parse round-trip.

Fix the vgpu.rs module doc that incorrectly referenced all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode). Apply sanitize_label_value() to detail HashMap values in gpu.rs export_device_info to strip control characters from NVML-sourced strings before they reach the Prometheus exposition format.

Add a GpuRow struct and collect_rows function to the GPU exporter, following the same pattern used by the MIG, vGPU, and hardware exporters. The stringified gpu_index is computed once per GPU per scrape instead of being re-created 5x across the five export methods.

inureyes · 2026-04-16T06:28:52Z

Implementation Review Summary

Intent

Cross-PR review fixes for NVIDIA monitoring features (#172-#175): security hardening, performance improvements, API label standardization, and code quality refactoring.

Findings Addressed

All 10 commits in this PR were reviewed against their stated goals. No CRITICAL or HIGH findings require changes.

Remaining Items

(MEDIUM) API.md incomplete B1 migration for NPU sections: API.md still documents uuid, index as label names for NPU-specific metric tables (Tenstorrent, Rebellions, Furiosa, Gaudi, Google TPU -- lines 284-457). The corresponding NPU exporters also still emit uuid/index. This is technically consistent (exporter matches docs, parser handles it), but creates a split convention where GPU metrics use gpu_uuid/gpu_index while NPU metrics use uuid/index. Consider a follow-up to standardize NPU labels to npu_uuid/npu_index for consistency, or document the rationale for the split.
(MEDIUM) C2 (control char stripping) is lossy for embedded newlines: unescape_label_value at src/network/metrics_parser.rs:907 un-escapes \\n to a literal \n, then strip_control_chars at line 923 removes it. This means a label value containing a legitimate Prometheus-escaped newline (\\n in the text format) will silently lose the embedded newline. This is an acceptable security trade-off for a TUI application (embedded newlines in label values would corrupt terminal output), but worth documenting in a code comment.
(LOW) vgpu.rs doc comment fix introduces a duplicate: The A3 fix at src/api/metrics/vgpu.rs:26 changes all_smi_vgpu_host_info to all_smi_vgpu_host_mode, but line 24 already documents all_smi_vgpu_host_mode. The result is two consecutive bullet points both documenting all_smi_vgpu_host_mode with different descriptions ("0=NonSriov, 1=Sriov, 2=Disabled" vs "label-only info row per host GPU"). One should be a different metric name or the duplicate entry should be removed.
(LOW) Some integration tests still use old label names: Tests in tests/hardware_details_integration_test.rs at lines 137, 183-244 still use uuid= and index= (old format) in their test fixtures. This is not a bug -- it exercises the backward-compatibility parsing path -- but is inconsistent with the other test files that were migrated to gpu_uuid/gpu_index.

Verification

All stated requirements implemented
No placeholder/mock code remaining
Integrated into project code flow
Project conventions followed
Existing modules reused where applicable
No unintended structural changes
Tests pass (1132 tests, 0 failures)

Detailed Findings by Concern Area

B1 (Label Rename -- BREAKING CHANGE)
Correctly implemented across all GPU exporters (gpu.rs, hardware.rs), all 8 mock templates, API.md NVIDIA sections, and integration tests. Backward-compatible parsing of old uuid/index labels confirmed working via .or_else(|| labels.get("uuid")) fallback at src/network/metrics_parser.rs:224 and the existing test_parse_gpu_metrics test (line 1319) which uses old label names and passes.

C2 (Control Char Stripping)
Applied at three layers: (1) sanitize_label_value in src/parsing/common.rs:69, (2) unescape_label_value in src/network/metrics_parser.rs:923, (3) MetricBuilder::metric in src/api/metrics/mod.rs:97. Test coverage at src/parsing/common.rs:129-146 and src/api/metrics/mod.rs:159-167. The test_metric_builder_label_escaping_newline test at line 143 confirms that explicit \n in label values is escaped to \\n first (before control char stripping), so the exporter path preserves newlines correctly. The lossy behavior only occurs on the parser/inbound side.

D1 (HashMap Optimization)
Correctly implemented in both frame_renderer.rs (lines 264-265, 695-758) and gpu_renderer.rs (lines 75-77, 113-164). Lookup maps are built once per frame with HashMap::with_capacity. Fallback to hostname+gpu_name linear scan is correctly implemented for entries not found by UUID. Parallel implementations in gpu_renderer.rs (for layout calculation) and frame_renderer.rs (for rendering) are consistent.

B7 (ParsedMetrics Struct)
All callsites migrated: client.rs (line 428), all integration tests, and all internal parser tests. The struct definition at src/network/metrics_parser.rs:31-38 includes #[derive(Debug, Default)] which is appropriate.

C1 (CPU Core ID OOM Cap)
Cap at MAX_CPU_CORES = 1024 at src/network/metrics_parser.rs:588-589. Correctly rejects core_id >= 1024 before the vector resize at line 600.

B5 (GPU Row Cache)
GpuRow struct at src/api/metrics/gpu.rs:430-433 and collect_rows at line 412. Correctly precomputes idx.to_string() once per GPU instead of 5x per export method.

inureyes · 2026-04-16T06:29:18Z

Security & Performance Review -- PR #176

Reviewed by: claude-opus-4-6 security/perf agent
Branch: fix/nvidia-cross-pr-review (10 commits)
Test suite: 1153 tests pass (493 unit + 541 integration + rest doc), 0 failures

Verification of claimed security fixes

C1 -- CPU core_id OOM cap: VERIFIED.
MAX_CPU_CORES = 1024 constant at src/network/metrics_parser.rs:821 is checked at line 588 before the while-loop that extends per_core_utilization. A malicious core_id="4294967295" from a remote endpoint is correctly rejected before the Vec allocation. The guard uses >= comparison against usize, which is correct.

C2 -- Control character stripping: VERIFIED.
strip_control_chars at src/parsing/common.rs:54-56 uses char::is_control() which covers all Unicode control characters including ESC (\x1b), NUL (\x00), BEL (\x07), and newlines. Applied in all three required paths:

sanitize_label_value (line 69) -- unquoted label values
unescape_label_value (line 923 of metrics_parser.rs) -- quoted label values after unescape
MetricBuilder::metric (line 91-98 of api/metrics/mod.rs) -- exporter output

Note: unescape_label_value correctly processes escape sequences (e.g., \n to literal newline) before calling strip_control_chars, so the sequence \n in Prometheus exposition format becomes a literal newline and is then stripped. This is the correct order.

C3 -- GPU detail map value sanitization: VERIFIED.
src/api/metrics/gpu.rs:191 now calls sanitize_label_value(v) on detail map values. Previously only keys were sanitized via sanitize_label_name(k).

ParsedMetrics struct exposure: NO ISSUE.
ParsedMetrics (src/network/metrics_parser.rs:31-38) has pub fields but they are all Vec<T> of already-public device types. The struct is consumed only inside client.rs:422-440 via field destructuring. No internal implementation state is leaked.

HashMap collision DoS surface: NO ISSUE.
The build_vgpu_lookup and build_mig_lookup functions in frame_renderer.rs (lines 695-711) use standard HashMap with the default SipHash hasher. The keys are GPU UUIDs from parsed data, not directly attacker-controlled strings (they pass through sanitize_label_value first). For internal, bounded-length data with the default hasher, this is appropriate.

Label rename backward compatibility (index to gpu_index): VERIFIED.
Parser at src/network/metrics_parser.rs:222-231 uses .get("gpu_uuid").or_else(|| .get("uuid")) and the same pattern for gpu_index/index. This ensures nodes running older exporter versions (emitting uuid/index) are still parsed correctly. Existing unit tests at lines 1325-1331, 1357-1363 exercise the old label names, confirming the backward-compat path is tested.

Findings

LOW-1: Stale doc comment in hardware exporter uses old label names

File: /home/inureyes/Development/backend.ai/all-smi/src/api/metrics/hardware.rs, lines 37-38
Severity: LOW (documentation accuracy)

The module-level doc comment says:

//! All metrics carry the standard GPU label set (`gpu`, `instance`, `uuid`,
//! `index`)

But the actual base_labels function (lines 82-83) emits gpu_uuid and gpu_index. The doc should be updated to match the code.

LOW-2: Hardware exporter test uses old label names (weak assertion)

File: /home/inureyes/Development/backend.ai/all-smi/src/api/metrics/hardware.rs, lines 452-456
Severity: LOW (test quality)

The preserves_standard_gpu_labels_on_all_metrics test asserts:

r#"uuid="GPU-ABC""#,
r#"index="0""#,

These assertions pass only because uuid= is a substring of gpu_uuid= and index= is a substring of gpu_index=. The test would also pass with the old label names, making it unable to detect a regression to the old naming. Should assert the full label name:

r#"gpu_uuid="GPU-ABC""#,
r#"gpu_index="0""#,

LOW-3: Dead function `gpu_render_line_count` produces clippy warning

File: /home/inureyes/Development/backend.ai/all-smi/src/ui/renderers/gpu_renderer.rs, line 68
Severity: LOW (code quality)

gpu_render_line_count is a convenience wrapper superseded by gpu_render_line_count_with_lookup. It is only called from tests (15 call sites in the test module). Adding #[cfg(test)] would silence the warning and clarify intent.

Summary

All three claimed security fixes (C1, C2, C3) are correctly implemented and effective. The performance improvements (HashMap lookups, GpuRow cache) are sound. The label rename maintains backward compatibility. No CRITICAL or HIGH severity findings. Three LOW findings identified (stale doc, weak test assertion, dead code warning).

- Update module doc comment to use `gpu_uuid`/`gpu_index` matching the actual `base_labels` implementation - Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix so substring matching cannot mask a regression to old label names - Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since it was superseded by `gpu_render_line_count_with_lookup` for production use and is now only called from tests

inureyes · 2026-04-16T06:33:53Z

PR Finalization Report

Lint & Format

cargo fmt --all: no changes needed (already clean)
cargo clippy --features mock --all-targets: 6 warnings found, all in pre-existing files not touched by this PR (src/device/readers/amd.rs, src/device/readers/tpu_grpc.rs) — no action taken per scope restriction

Tests

cargo test --features mock: all passes (493 + 541 + 43 + misc = 1132 tests, 0 failures)

Documentation

API.md: already fully updated with gpu_index/gpu_uuid labels across all NVIDIA-specific metric tables (NVIDIA GPU Specific, Hardware Details, vGPU, MIG sections) — no changes needed
README.md: added **BREAKING** note to the v0.21.0 changelog entry documenting the index→gpu_index and uuid→gpu_uuid label rename with the backward-compatibility note

Commits Added

docs: note breaking label rename in v0.21.0 changelog entry

All checks passing. Ready for merge.

Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid` across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) and their vendor-specific metric families. Align the NPU mock templates with the new labeling, and extend the remote parser to accept both the new and legacy label aliases. This completes the label standardization started in PR #176 for NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`). The GPU-family metrics that are shared across GPU and NPU devices (e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry `gpu_index`/`gpu_uuid`, matching the real exporter in `src/api/metrics/gpu.rs`; only the NPU-specific metric families (`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`, `all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry the NPU-prefixed labels. The remote parser now falls back through `gpu_uuid` → `npu_uuid` → `uuid` and `gpu_index` → `npu_index` → `index` (with a matching `gpu` → `npu` fallback for the device name label), so nodes running pre-v0.21.0 or mixed versions remain parseable. Two new round-trip tests in `src/network/metrics_parser.rs` lock in the new aliasing and the precedence order. API.md and the v0.21.0 changelog entry in README.md are updated to reflect the rename. BREAKING CHANGE: The NPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `npu_index` and `npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing Prometheus dashboards and alert rules that filter NPU metrics by `index=` or `uuid=` labels will need to be updated to use `npu_index=` and `npu_uuid=` respectively. The remote parser accepts both old and new label names for backward compatibility.

* feat(api): standardize NPU Prometheus labels to npu_index/npu_uuid Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid` across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) and their vendor-specific metric families. Align the NPU mock templates with the new labeling, and extend the remote parser to accept both the new and legacy label aliases. This completes the label standardization started in PR #176 for NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`). The GPU-family metrics that are shared across GPU and NPU devices (e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry `gpu_index`/`gpu_uuid`, matching the real exporter in `src/api/metrics/gpu.rs`; only the NPU-specific metric families (`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`, `all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry the NPU-prefixed labels. The remote parser now falls back through `gpu_uuid` → `npu_uuid` → `uuid` and `gpu_index` → `npu_index` → `index` (with a matching `gpu` → `npu` fallback for the device name label), so nodes running pre-v0.21.0 or mixed versions remain parseable. Two new round-trip tests in `src/network/metrics_parser.rs` lock in the new aliasing and the precedence order. API.md and the v0.21.0 changelog entry in README.md are updated to reflect the rename. BREAKING CHANGE: The NPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `npu_index` and `npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing Prometheus dashboards and alert rules that filter NPU metrics by `index=` or `uuid=` labels will need to be updated to use `npu_index=` and `npu_uuid=` respectively. The remote parser accepts both old and new label names for backward compatibility. * fix: resolve pre-existing clippy warnings in amd and tpu_grpc readers - Use `const {}` block for compile-time assertions in amd.rs version component validation test (clippy::assertions_on_constants) - Use range `.contains()` instead of manual bounds checks in amd.rs (clippy::manual_range_contains) - Use inline format argument in tpu_grpc.rs debug println (clippy::uninlined_format_args)

inureyes added 10 commits April 16, 2026 15:08

chore: apply cargo fmt formatting

3ddd01a

chore: fix clippy needless_lifetimes warnings

9914953

inureyes added 2 commits April 16, 2026 15:30

docs: note breaking label rename in v0.21.0 changelog entry

2e5677a

inureyes merged commit 48ba01c into main Apr 16, 2026
1 check passed

inureyes deleted the fix/nvidia-cross-pr-review branch April 16, 2026 06:34

inureyes mentioned this pull request Apr 17, 2026

enhancement: Standardize NPU Prometheus label names to npu_index/npu_uuid #177

Closed

inureyes mentioned this pull request Apr 17, 2026

feat: standardize NPU Prometheus labels to npu_index/npu_uuid #181

Merged

5 tasks

inureyes self-assigned this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175)#176

fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175)#176
inureyes merged 12 commits into
mainfrom
fix/nvidia-cross-pr-review

inureyes commented Apr 16, 2026

Uh oh!

inureyes commented Apr 16, 2026

Uh oh!

inureyes commented Apr 16, 2026

Uh oh!

inureyes commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 16, 2026

Summary

Security fixes

Performance fixes

API standardization

Refactoring

Test improvements

Test plan

Uh oh!

inureyes commented Apr 16, 2026

Implementation Review Summary

Intent

Findings Addressed

Remaining Items

Verification

Detailed Findings by Concern Area

Uh oh!

inureyes commented Apr 16, 2026

Security & Performance Review -- PR #176

Verification of claimed security fixes

Findings

LOW-1: Stale doc comment in hardware exporter uses old label names

LOW-2: Hardware exporter test uses old label names (weak assertion)

LOW-3: Dead function gpu_render_line_count produces clippy warning

Summary

Uh oh!

inureyes commented Apr 16, 2026

PR Finalization Report

Lint & Format

Tests

Documentation

Commits Added

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LOW-3: Dead function `gpu_render_line_count` produces clippy warning