feat: standardize NPU Prometheus labels to npu_index/npu_uuid by inureyes · Pull Request #181 · lablup/all-smi

inureyes · 2026-04-17T06:22:36Z

Summary

Completes the label standardization started in PR #176 for NVIDIA GPUs. All non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) now emit npu_index/npu_uuid in place of the generic index/uuid base labels.

Closes #177.

Rename uuid/index → npu_uuid/npu_index in all NPU exporters under src/api/metrics/npu/ and in the NPU-specific mock templates under src/mock/templates/.
Leave the GPU-family base labels (gpu_uuid/gpu_index) on metrics that are shared across GPU and NPU devices (all_smi_gpu_*, all_smi_ane_*) since those are emitted by the unified exporter in src/api/metrics/gpu.rs.
Extend the remote parser to fall back through gpu_uuid → npu_uuid → uuid and gpu_index → npu_index → index (with a matching gpu → npu device-name fallback), so mixed-version clusters continue to parse. Two new round-trip tests lock the aliasing and precedence order in.
Update the common::add_basic_gpu_metrics mock helper to emit gpu_uuid/gpu_index (closing a gap left by PR fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175) #176) so the mock stream now matches the real exporter for all_smi_gpu_* lines.
Refresh API.md rows for all affected NPU metric families and extend the v0.21.0 entry in README.md to note the NPU rename.

Breaking change

The NPU base labels index and uuid in the Prometheus exposition format are renamed to npu_index and npu_uuid for all non-NVIDIA NPU-specific metrics (all_smi_npu_*, all_smi_tenstorrent_*, all_smi_rebellions_*, all_smi_furiosa_*, all_smi_gaudi_*, all_smi_tpu_*).

Existing Prometheus dashboards and alert rules that filter NPU metrics by index= or uuid= labels must be migrated to use npu_index= and npu_uuid= respectively.

Migration

- all_smi_tpu_utilization_percent{uuid="TPU-0", index="0"}
+ all_smi_tpu_utilization_percent{npu_uuid="TPU-0", npu_index="0"}

- all_smi_tenstorrent_asic_temperature_celsius{uuid="Tenstorrent-0", index="0"}
+ all_smi_tenstorrent_asic_temperature_celsius{npu_uuid="Tenstorrent-0", npu_index="0"}

The remote all-smi view parser accepts both the new and legacy label names for one release so that nodes running older builds can still be consumed by newer clients during a rolling upgrade.

Test plan

cargo build passes
cargo test — all 495 unit tests + 543 binary tests + integration suites pass
New tests parser_accepts_npu_labels_for_uuid_and_index and parser_prefers_gpu_uuid_when_both_gpu_and_npu_labels_present cover the new aliasing and precedence
cargo fmt clean
Manual inspection confirms all_smi_gpu_*/all_smi_ane_* metrics continue to use gpu_uuid/gpu_index while all_smi_npu_*/vendor-specific metrics use npu_uuid/npu_index

Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid` across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) and their vendor-specific metric families. Align the NPU mock templates with the new labeling, and extend the remote parser to accept both the new and legacy label aliases. This completes the label standardization started in PR #176 for NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`). The GPU-family metrics that are shared across GPU and NPU devices (e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry `gpu_index`/`gpu_uuid`, matching the real exporter in `src/api/metrics/gpu.rs`; only the NPU-specific metric families (`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`, `all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry the NPU-prefixed labels. The remote parser now falls back through `gpu_uuid` → `npu_uuid` → `uuid` and `gpu_index` → `npu_index` → `index` (with a matching `gpu` → `npu` fallback for the device name label), so nodes running pre-v0.21.0 or mixed versions remain parseable. Two new round-trip tests in `src/network/metrics_parser.rs` lock in the new aliasing and the precedence order. API.md and the v0.21.0 changelog entry in README.md are updated to reflect the rename. BREAKING CHANGE: The NPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `npu_index` and `npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing Prometheus dashboards and alert rules that filter NPU metrics by `index=` or `uuid=` labels will need to be updated to use `npu_index=` and `npu_uuid=` respectively. The remote parser accepts both old and new label names for backward compatibility.

- Use `const {}` block for compile-time assertions in amd.rs version component validation test (clippy::assertions_on_constants) - Use range `.contains()` instead of manual bounds checks in amd.rs (clippy::manual_range_contains) - Use inline format argument in tpu_grpc.rs debug println (clippy::uninlined_format_args)

inureyes · 2026-04-17T06:40:18Z

PR Finalization Complete

Summary

Tests: Both new parser tests (, ) confirmed passing. Full suite: 495 lib unit tests + 543 binary tests + 21 doc tests + integration suites — all green.
Documentation: API.md rows updated for all NPU metric families (confirmed). README.md v0.21.0 changelog entry includes prominent BREAKING callout (confirmed). No Korean translation files exist in this repo, so no additional translation needed.
Lint/Format: Fixed two pre-existing clippy issues not related to this PR's changes:
- src/device/readers/amd.rs: moved constant assertions into const {} block and converted manual range comparisons to .contains()
- src/device/readers/tpu_grpc.rs: converted to inline format argument in debug println!
- cargo fmt --check clean, cargo clippy --all-targets -- -D warnings (default features) clean

All checks passing. Ready for merge.

inureyes · 2026-04-17T06:40:25Z

PR Finalization Complete

Summary

Tests: Both new parser tests (parser_accepts_npu_labels_for_uuid_and_index, parser_prefers_gpu_uuid_when_both_gpu_and_npu_labels_present) confirmed passing. Full suite: 495 lib unit tests + 543 binary tests + 21 doc tests + integration suites — all green.
Documentation: API.md rows updated for all NPU metric families (confirmed). README.md v0.21.0 changelog entry includes prominent BREAKING callout (confirmed). No Korean translation files exist in this repo, so no additional translation needed.
Lint/Format: Fixed two pre-existing clippy issues not related to this PR changes:
- src/device/readers/amd.rs: moved constant assertions into const{} block and converted manual range comparisons to .contains()
- src/device/readers/tpu_grpc.rs: converted to inline format argument in debug println!
- cargo fmt --check clean, cargo clippy --all-targets -- -D warnings (default features) clean

All checks passing. Ready for merge.

inureyes added type:enhancement New feature or request priority:low Low priority issue impact:breaking Breaking change that requires migration device:npu NPU (Neural Processing Unit) related status:review Under review labels Apr 17, 2026

inureyes added status:done Completed and removed status:review Under review labels Apr 17, 2026

inureyes merged commit a6015fa into main Apr 17, 2026
2 checks passed

inureyes deleted the feature/issue-177-npu-label-standardization branch April 17, 2026 06:41

inureyes self-assigned this Apr 17, 2026

inureyes mentioned this pull request Apr 17, 2026

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: standardize NPU Prometheus labels to npu_index/npu_uuid#181

feat: standardize NPU Prometheus labels to npu_index/npu_uuid#181
inureyes merged 2 commits into
mainfrom
feature/issue-177-npu-label-standardization

inureyes commented Apr 17, 2026

Uh oh!

inureyes commented Apr 17, 2026

Uh oh!

inureyes commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 17, 2026

Summary

Breaking change

Migration

Test plan

Uh oh!

inureyes commented Apr 17, 2026

PR Finalization Complete

Summary

Uh oh!

inureyes commented Apr 17, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant