Skip to content

fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175)#176

Merged
inureyes merged 12 commits into
mainfrom
fix/nvidia-cross-pr-review
Apr 16, 2026
Merged

fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175)#176
inureyes merged 12 commits into
mainfrom
fix/nvidia-cross-pr-review

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Cross-PR review fixes addressing issues found in a unified review of PRs #172-#175 (vGPU, thermal, MIG, hardware monitoring features).

Security fixes

  • C1 (Critical): Cap CPU core_id in remote parser to prevent OOM from malicious upstream sending core_id="4294967295"
  • C2 (High): Strip control characters from ALL metric label values to prevent ANSI escape injection in TUI from compromised remote endpoints. Applied in sanitize_label_value, unescape_label_value, and MetricBuilder::metric
  • C3: Sanitize GPU detail map values in export_device_info (previously only keys were sanitized)

Performance fixes

  • D1 (High): Replace O(GV) + O(GM) per-frame linear scans with O(1) HashMap lookups for vGPU/MIG matching in frame_renderer.rs and gpu_renderer.rs
  • B5: Precompute GPU exporter row cache (GpuRow struct) to reduce per-scrape string allocations

API standardization

  • B1 (High, BREAKING): Standardize all Prometheus GPU labels from index/uuid to gpu_index/gpu_uuid across all exporters. Parser accepts both old and new names for backward compatibility

Refactoring

  • B7: Replace parse_metrics 6-tuple return type with named ParsedMetrics struct
  • A3: Fix stale doc comment in vgpu.rs (all_smi_vgpu_host_info -> all_smi_vgpu_host_mode)

Test improvements

  • A2: Add missing vgpu_memory_utilization round-trip assertion in vGPU integration test

Hardens #129, #130, #131, #132

Test plan

  • cargo fmt -- clean
  • cargo build --features mock -- green
  • cargo clippy --features mock --all-targets -- only pre-existing warnings
  • cargo test --features mock -- 1132 tests pass, 0 failures

inureyes added 10 commits April 16, 2026 15:08
A malicious upstream sending core_id="4294967295" via the Prometheus
exposition format would cause the parser to grow the per_core_utilization
vector to ~4 billion entries, exhausting memory. Add a MAX_CPU_CORES
cap (1024) and reject any core_id at or above that threshold.
A compromised remote endpoint can inject ANSI escape sequences (e.g.
\x1b[2J clear-screen) into any string-valued label (gpu_name, gpu_uuid,
cpu_model, vgpu_vm_id, mig_profile, etc.). The TUI renders these values
directly, causing the terminal to execute the injected sequences.

Add strip_control_chars() to parsing/common.rs and apply it inside
sanitize_label_value() so ALL label values flowing through the parser
are cleaned. Also apply stripping in unescape_label_value() for quoted
values, and in MetricBuilder::metric() on the exporter side to defend
against local NVML returning unexpected strings.
The parse_metrics return type was an unreadable and unsustainable
6-element tuple. Replace it with a named ParsedMetrics struct with
clearly documented fields. Update all call sites in client.rs and
all integration/unit tests.
…atching

In remote/cluster mode with 800+ GPUs, the per-GPU linear scans in
find_matching_vgpu_host and find_matching_mig_gpu were O(G*V) + O(G*M)
per frame, causing millions of comparisons.

Build gpu_uuid -> index HashMap lookups once per frame in O(V+M) and
use O(1) lookups for each GPU. Keep the hostname+gpu_name fallback as
a rare linear scan for entries not found by UUID.
…all exporters

BREAKING CHANGE: The GPU base labels `index` and `uuid` in the
Prometheus exposition format are renamed to `gpu_index` and `gpu_uuid`
to match the more explicit naming already used by the MIG and vGPU
exporters. The remote parser accepts both old and new label names for
backward compatibility with nodes running older versions.

Existing Prometheus dashboards and alert rules that filter on
`index=` or `uuid=` labels will need to be updated to use
`gpu_index=` and `gpu_uuid=` respectively.
The vgpu_metrics_parser_roundtrip_preserves_all_fields test fixture
omitted all_smi_vgpu_memory_utilization from the test data. Add the
metric line to the fixture and assert that memory_utilization=45
survives the parse round-trip.
Fix the vgpu.rs module doc that incorrectly referenced
all_smi_vgpu_host_info (should be all_smi_vgpu_host_mode).

Apply sanitize_label_value() to detail HashMap values in gpu.rs
export_device_info to strip control characters from NVML-sourced
strings before they reach the Prometheus exposition format.
Add a GpuRow struct and collect_rows function to the GPU exporter,
following the same pattern used by the MIG, vGPU, and hardware
exporters. The stringified gpu_index is computed once per GPU per
scrape instead of being re-created 5x across the five export methods.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Cross-PR review fixes for NVIDIA monitoring features (#172-#175): security hardening, performance improvements, API label standardization, and code quality refactoring.

Findings Addressed

All 10 commits in this PR were reviewed against their stated goals. No CRITICAL or HIGH findings require changes.

Remaining Items

  • (MEDIUM) API.md incomplete B1 migration for NPU sections: API.md still documents uuid, index as label names for NPU-specific metric tables (Tenstorrent, Rebellions, Furiosa, Gaudi, Google TPU -- lines 284-457). The corresponding NPU exporters also still emit uuid/index. This is technically consistent (exporter matches docs, parser handles it), but creates a split convention where GPU metrics use gpu_uuid/gpu_index while NPU metrics use uuid/index. Consider a follow-up to standardize NPU labels to npu_uuid/npu_index for consistency, or document the rationale for the split.

  • (MEDIUM) C2 (control char stripping) is lossy for embedded newlines: unescape_label_value at src/network/metrics_parser.rs:907 un-escapes \\n to a literal \n, then strip_control_chars at line 923 removes it. This means a label value containing a legitimate Prometheus-escaped newline (\\n in the text format) will silently lose the embedded newline. This is an acceptable security trade-off for a TUI application (embedded newlines in label values would corrupt terminal output), but worth documenting in a code comment.

  • (LOW) vgpu.rs doc comment fix introduces a duplicate: The A3 fix at src/api/metrics/vgpu.rs:26 changes all_smi_vgpu_host_info to all_smi_vgpu_host_mode, but line 24 already documents all_smi_vgpu_host_mode. The result is two consecutive bullet points both documenting all_smi_vgpu_host_mode with different descriptions ("0=NonSriov, 1=Sriov, 2=Disabled" vs "label-only info row per host GPU"). One should be a different metric name or the duplicate entry should be removed.

  • (LOW) Some integration tests still use old label names: Tests in tests/hardware_details_integration_test.rs at lines 137, 183-244 still use uuid= and index= (old format) in their test fixtures. This is not a bug -- it exercises the backward-compatibility parsing path -- but is inconsistent with the other test files that were migrated to gpu_uuid/gpu_index.

Verification

  • All stated requirements implemented
  • No placeholder/mock code remaining
  • Integrated into project code flow
  • Project conventions followed
  • Existing modules reused where applicable
  • No unintended structural changes
  • Tests pass (1132 tests, 0 failures)

Detailed Findings by Concern Area

B1 (Label Rename -- BREAKING CHANGE)
Correctly implemented across all GPU exporters (gpu.rs, hardware.rs), all 8 mock templates, API.md NVIDIA sections, and integration tests. Backward-compatible parsing of old uuid/index labels confirmed working via .or_else(|| labels.get("uuid")) fallback at src/network/metrics_parser.rs:224 and the existing test_parse_gpu_metrics test (line 1319) which uses old label names and passes.

C2 (Control Char Stripping)
Applied at three layers: (1) sanitize_label_value in src/parsing/common.rs:69, (2) unescape_label_value in src/network/metrics_parser.rs:923, (3) MetricBuilder::metric in src/api/metrics/mod.rs:97. Test coverage at src/parsing/common.rs:129-146 and src/api/metrics/mod.rs:159-167. The test_metric_builder_label_escaping_newline test at line 143 confirms that explicit \n in label values is escaped to \\n first (before control char stripping), so the exporter path preserves newlines correctly. The lossy behavior only occurs on the parser/inbound side.

D1 (HashMap Optimization)
Correctly implemented in both frame_renderer.rs (lines 264-265, 695-758) and gpu_renderer.rs (lines 75-77, 113-164). Lookup maps are built once per frame with HashMap::with_capacity. Fallback to hostname+gpu_name linear scan is correctly implemented for entries not found by UUID. Parallel implementations in gpu_renderer.rs (for layout calculation) and frame_renderer.rs (for rendering) are consistent.

B7 (ParsedMetrics Struct)
All callsites migrated: client.rs (line 428), all integration tests, and all internal parser tests. The struct definition at src/network/metrics_parser.rs:31-38 includes #[derive(Debug, Default)] which is appropriate.

C1 (CPU Core ID OOM Cap)
Cap at MAX_CPU_CORES = 1024 at src/network/metrics_parser.rs:588-589. Correctly rejects core_id >= 1024 before the vector resize at line 600.

B5 (GPU Row Cache)
GpuRow struct at src/api/metrics/gpu.rs:430-433 and collect_rows at line 412. Correctly precomputes idx.to_string() once per GPU instead of 5x per export method.

@inureyes

Copy link
Copy Markdown
Member Author

Security & Performance Review -- PR #176

Reviewed by: claude-opus-4-6 security/perf agent
Branch: fix/nvidia-cross-pr-review (10 commits)
Test suite: 1153 tests pass (493 unit + 541 integration + rest doc), 0 failures


Verification of claimed security fixes

C1 -- CPU core_id OOM cap: VERIFIED.
MAX_CPU_CORES = 1024 constant at src/network/metrics_parser.rs:821 is checked at line 588 before the while-loop that extends per_core_utilization. A malicious core_id="4294967295" from a remote endpoint is correctly rejected before the Vec allocation. The guard uses >= comparison against usize, which is correct.

C2 -- Control character stripping: VERIFIED.
strip_control_chars at src/parsing/common.rs:54-56 uses char::is_control() which covers all Unicode control characters including ESC (\x1b), NUL (\x00), BEL (\x07), and newlines. Applied in all three required paths:

  • sanitize_label_value (line 69) -- unquoted label values
  • unescape_label_value (line 923 of metrics_parser.rs) -- quoted label values after unescape
  • MetricBuilder::metric (line 91-98 of api/metrics/mod.rs) -- exporter output

Note: unescape_label_value correctly processes escape sequences (e.g., \n to literal newline) before calling strip_control_chars, so the sequence \n in Prometheus exposition format becomes a literal newline and is then stripped. This is the correct order.

C3 -- GPU detail map value sanitization: VERIFIED.
src/api/metrics/gpu.rs:191 now calls sanitize_label_value(v) on detail map values. Previously only keys were sanitized via sanitize_label_name(k).

ParsedMetrics struct exposure: NO ISSUE.
ParsedMetrics (src/network/metrics_parser.rs:31-38) has pub fields but they are all Vec<T> of already-public device types. The struct is consumed only inside client.rs:422-440 via field destructuring. No internal implementation state is leaked.

HashMap collision DoS surface: NO ISSUE.
The build_vgpu_lookup and build_mig_lookup functions in frame_renderer.rs (lines 695-711) use standard HashMap with the default SipHash hasher. The keys are GPU UUIDs from parsed data, not directly attacker-controlled strings (they pass through sanitize_label_value first). For internal, bounded-length data with the default hasher, this is appropriate.

Label rename backward compatibility (index to gpu_index): VERIFIED.
Parser at src/network/metrics_parser.rs:222-231 uses .get("gpu_uuid").or_else(|| .get("uuid")) and the same pattern for gpu_index/index. This ensures nodes running older exporter versions (emitting uuid/index) are still parsed correctly. Existing unit tests at lines 1325-1331, 1357-1363 exercise the old label names, confirming the backward-compat path is tested.


Findings

LOW-1: Stale doc comment in hardware exporter uses old label names

File: /home/inureyes/Development/backend.ai/all-smi/src/api/metrics/hardware.rs, lines 37-38
Severity: LOW (documentation accuracy)

The module-level doc comment says:

//! All metrics carry the standard GPU label set (`gpu`, `instance`, `uuid`,
//! `index`)

But the actual base_labels function (lines 82-83) emits gpu_uuid and gpu_index. The doc should be updated to match the code.

LOW-2: Hardware exporter test uses old label names (weak assertion)

File: /home/inureyes/Development/backend.ai/all-smi/src/api/metrics/hardware.rs, lines 452-456
Severity: LOW (test quality)

The preserves_standard_gpu_labels_on_all_metrics test asserts:

r#"uuid="GPU-ABC""#,
r#"index="0""#,

These assertions pass only because uuid= is a substring of gpu_uuid= and index= is a substring of gpu_index=. The test would also pass with the old label names, making it unable to detect a regression to the old naming. Should assert the full label name:

r#"gpu_uuid="GPU-ABC""#,
r#"gpu_index="0""#,

LOW-3: Dead function gpu_render_line_count produces clippy warning

File: /home/inureyes/Development/backend.ai/all-smi/src/ui/renderers/gpu_renderer.rs, line 68
Severity: LOW (code quality)

gpu_render_line_count is a convenience wrapper superseded by gpu_render_line_count_with_lookup. It is only called from tests (15 call sites in the test module). Adding #[cfg(test)] would silence the warning and clarify intent.


Summary

All three claimed security fixes (C1, C2, C3) are correctly implemented and effective. The performance improvements (HashMap lookups, GpuRow cache) are sound. The label rename maintains backward compatibility. No CRITICAL or HIGH severity findings. Three LOW findings identified (stale doc, weak test assertion, dead code warning).

- Update module doc comment to use `gpu_uuid`/`gpu_index` matching the
  actual `base_labels` implementation
- Strengthen test assertion to use full `gpu_uuid=`/`gpu_index=` prefix
  so substring matching cannot mask a regression to old label names
- Gate unused `gpu_render_line_count` wrapper with `#[cfg(test)]` since
  it was superseded by `gpu_render_line_count_with_lookup` for production
  use and is now only called from tests
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Report

Lint & Format

  • cargo fmt --all: no changes needed (already clean)
  • cargo clippy --features mock --all-targets: 6 warnings found, all in pre-existing files not touched by this PR (src/device/readers/amd.rs, src/device/readers/tpu_grpc.rs) — no action taken per scope restriction

Tests

  • cargo test --features mock: all passes (493 + 541 + 43 + misc = 1132 tests, 0 failures)

Documentation

  • API.md: already fully updated with gpu_index/gpu_uuid labels across all NVIDIA-specific metric tables (NVIDIA GPU Specific, Hardware Details, vGPU, MIG sections) — no changes needed
  • README.md: added **BREAKING** note to the v0.21.0 changelog entry documenting the indexgpu_index and uuidgpu_uuid label rename with the backward-compatibility note

Commits Added

  • docs: note breaking label rename in v0.21.0 changelog entry

All checks passing. Ready for merge.

@inureyes inureyes merged commit 48ba01c into main Apr 16, 2026
1 check passed
@inureyes inureyes deleted the fix/nvidia-cross-pr-review branch April 16, 2026 06:34
inureyes added a commit that referenced this pull request Apr 17, 2026
Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid`
across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa,
Intel Gaudi, Google TPU) and their vendor-specific metric families.
Align the NPU mock templates with the new labeling, and extend the
remote parser to accept both the new and legacy label aliases.

This completes the label standardization started in PR #176 for
NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`).
The GPU-family metrics that are shared across GPU and NPU devices
(e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry
`gpu_index`/`gpu_uuid`, matching the real exporter in
`src/api/metrics/gpu.rs`; only the NPU-specific metric families
(`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`,
`all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry
the NPU-prefixed labels.

The remote parser now falls back through `gpu_uuid` → `npu_uuid` →
`uuid` and `gpu_index` → `npu_index` → `index` (with a matching
`gpu` → `npu` fallback for the device name label), so nodes running
pre-v0.21.0 or mixed versions remain parseable. Two new round-trip
tests in `src/network/metrics_parser.rs` lock in the new aliasing
and the precedence order.

API.md and the v0.21.0 changelog entry in README.md are updated to
reflect the rename.

BREAKING CHANGE: The NPU base labels `index` and `uuid` in the
Prometheus exposition format are renamed to `npu_index` and
`npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing
Prometheus dashboards and alert rules that filter NPU metrics by
`index=` or `uuid=` labels will need to be updated to use
`npu_index=` and `npu_uuid=` respectively. The remote parser
accepts both old and new label names for backward compatibility.
inureyes added a commit that referenced this pull request Apr 17, 2026
* feat(api): standardize NPU Prometheus labels to npu_index/npu_uuid

Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid`
across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa,
Intel Gaudi, Google TPU) and their vendor-specific metric families.
Align the NPU mock templates with the new labeling, and extend the
remote parser to accept both the new and legacy label aliases.

This completes the label standardization started in PR #176 for
NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`).
The GPU-family metrics that are shared across GPU and NPU devices
(e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry
`gpu_index`/`gpu_uuid`, matching the real exporter in
`src/api/metrics/gpu.rs`; only the NPU-specific metric families
(`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`,
`all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry
the NPU-prefixed labels.

The remote parser now falls back through `gpu_uuid` → `npu_uuid` →
`uuid` and `gpu_index` → `npu_index` → `index` (with a matching
`gpu` → `npu` fallback for the device name label), so nodes running
pre-v0.21.0 or mixed versions remain parseable. Two new round-trip
tests in `src/network/metrics_parser.rs` lock in the new aliasing
and the precedence order.

API.md and the v0.21.0 changelog entry in README.md are updated to
reflect the rename.

BREAKING CHANGE: The NPU base labels `index` and `uuid` in the
Prometheus exposition format are renamed to `npu_index` and
`npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing
Prometheus dashboards and alert rules that filter NPU metrics by
`index=` or `uuid=` labels will need to be updated to use
`npu_index=` and `npu_uuid=` respectively. The remote parser
accepts both old and new label names for backward compatibility.

* fix: resolve pre-existing clippy warnings in amd and tpu_grpc readers

- Use `const {}` block for compile-time assertions in amd.rs version
  component validation test (clippy::assertions_on_constants)
- Use range `.contains()` instead of manual bounds checks in amd.rs
  (clippy::manual_range_contains)
- Use inline format argument in tpu_grpc.rs debug println
  (clippy::uninlined_format_args)
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant