Skip to content

feat: standardize NPU Prometheus labels to npu_index/npu_uuid#181

Merged
inureyes merged 2 commits into
mainfrom
feature/issue-177-npu-label-standardization
Apr 17, 2026
Merged

feat: standardize NPU Prometheus labels to npu_index/npu_uuid#181
inureyes merged 2 commits into
mainfrom
feature/issue-177-npu-label-standardization

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Completes the label standardization started in PR #176 for NVIDIA GPUs. All non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) now emit npu_index/npu_uuid in place of the generic index/uuid base labels.

Closes #177.

  • Rename uuid/indexnpu_uuid/npu_index in all NPU exporters under src/api/metrics/npu/ and in the NPU-specific mock templates under src/mock/templates/.
  • Leave the GPU-family base labels (gpu_uuid/gpu_index) on metrics that are shared across GPU and NPU devices (all_smi_gpu_*, all_smi_ane_*) since those are emitted by the unified exporter in src/api/metrics/gpu.rs.
  • Extend the remote parser to fall back through gpu_uuidnpu_uuiduuid and gpu_indexnpu_indexindex (with a matching gpunpu device-name fallback), so mixed-version clusters continue to parse. Two new round-trip tests lock the aliasing and precedence order in.
  • Update the common::add_basic_gpu_metrics mock helper to emit gpu_uuid/gpu_index (closing a gap left by PR fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175) #176) so the mock stream now matches the real exporter for all_smi_gpu_* lines.
  • Refresh API.md rows for all affected NPU metric families and extend the v0.21.0 entry in README.md to note the NPU rename.

Breaking change

The NPU base labels index and uuid in the Prometheus exposition format are renamed to npu_index and npu_uuid for all non-NVIDIA NPU-specific metrics (all_smi_npu_*, all_smi_tenstorrent_*, all_smi_rebellions_*, all_smi_furiosa_*, all_smi_gaudi_*, all_smi_tpu_*).

Existing Prometheus dashboards and alert rules that filter NPU metrics by index= or uuid= labels must be migrated to use npu_index= and npu_uuid= respectively.

Migration

- all_smi_tpu_utilization_percent{uuid="TPU-0", index="0"}
+ all_smi_tpu_utilization_percent{npu_uuid="TPU-0", npu_index="0"}

- all_smi_tenstorrent_asic_temperature_celsius{uuid="Tenstorrent-0", index="0"}
+ all_smi_tenstorrent_asic_temperature_celsius{npu_uuid="Tenstorrent-0", npu_index="0"}

The remote all-smi view parser accepts both the new and legacy label names for one release so that nodes running older builds can still be consumed by newer clients during a rolling upgrade.

Test plan

  • cargo build passes
  • cargo test — all 495 unit tests + 543 binary tests + integration suites pass
  • New tests parser_accepts_npu_labels_for_uuid_and_index and parser_prefers_gpu_uuid_when_both_gpu_and_npu_labels_present cover the new aliasing and precedence
  • cargo fmt clean
  • Manual inspection confirms all_smi_gpu_*/all_smi_ane_* metrics continue to use gpu_uuid/gpu_index while all_smi_npu_*/vendor-specific metrics use npu_uuid/npu_index

Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid`
across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa,
Intel Gaudi, Google TPU) and their vendor-specific metric families.
Align the NPU mock templates with the new labeling, and extend the
remote parser to accept both the new and legacy label aliases.

This completes the label standardization started in PR #176 for
NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`).
The GPU-family metrics that are shared across GPU and NPU devices
(e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry
`gpu_index`/`gpu_uuid`, matching the real exporter in
`src/api/metrics/gpu.rs`; only the NPU-specific metric families
(`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`,
`all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry
the NPU-prefixed labels.

The remote parser now falls back through `gpu_uuid` → `npu_uuid` →
`uuid` and `gpu_index` → `npu_index` → `index` (with a matching
`gpu` → `npu` fallback for the device name label), so nodes running
pre-v0.21.0 or mixed versions remain parseable. Two new round-trip
tests in `src/network/metrics_parser.rs` lock in the new aliasing
and the precedence order.

API.md and the v0.21.0 changelog entry in README.md are updated to
reflect the rename.

BREAKING CHANGE: The NPU base labels `index` and `uuid` in the
Prometheus exposition format are renamed to `npu_index` and
`npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing
Prometheus dashboards and alert rules that filter NPU metrics by
`index=` or `uuid=` labels will need to be updated to use
`npu_index=` and `npu_uuid=` respectively. The remote parser
accepts both old and new label names for backward compatibility.
@inureyes inureyes added type:enhancement New feature or request priority:low Low priority issue impact:breaking Breaking change that requires migration device:npu NPU (Neural Processing Unit) related status:review Under review labels Apr 17, 2026
- Use `const {}` block for compile-time assertions in amd.rs version
  component validation test (clippy::assertions_on_constants)
- Use range `.contains()` instead of manual bounds checks in amd.rs
  (clippy::manual_range_contains)
- Use inline format argument in tpu_grpc.rs debug println
  (clippy::uninlined_format_args)
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

  • Tests: Both new parser tests (, ) confirmed passing. Full suite: 495 lib unit tests + 543 binary tests + 21 doc tests + integration suites — all green.
  • Documentation: API.md rows updated for all NPU metric families (confirmed). README.md v0.21.0 changelog entry includes prominent BREAKING callout (confirmed). No Korean translation files exist in this repo, so no additional translation needed.
  • Lint/Format: Fixed two pre-existing clippy issues not related to this PR's changes:
    • src/device/readers/amd.rs: moved constant assertions into const {} block and converted manual range comparisons to .contains()
    • src/device/readers/tpu_grpc.rs: converted to inline format argument in debug println!
    • cargo fmt --check clean, cargo clippy --all-targets -- -D warnings (default features) clean

All checks passing. Ready for merge.

@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

  • Tests: Both new parser tests (parser_accepts_npu_labels_for_uuid_and_index, parser_prefers_gpu_uuid_when_both_gpu_and_npu_labels_present) confirmed passing. Full suite: 495 lib unit tests + 543 binary tests + 21 doc tests + integration suites — all green.
  • Documentation: API.md rows updated for all NPU metric families (confirmed). README.md v0.21.0 changelog entry includes prominent BREAKING callout (confirmed). No Korean translation files exist in this repo, so no additional translation needed.
  • Lint/Format: Fixed two pre-existing clippy issues not related to this PR changes:
    • src/device/readers/amd.rs: moved constant assertions into const{} block and converted manual range comparisons to .contains()
    • src/device/readers/tpu_grpc.rs: converted to inline format argument in debug println!
    • cargo fmt --check clean, cargo clippy --all-targets -- -D warnings (default features) clean

All checks passing. Ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 17, 2026
@inureyes inureyes merged commit a6015fa into main Apr 17, 2026
2 checks passed
@inureyes inureyes deleted the feature/issue-177-npu-label-standardization branch April 17, 2026 06:41
@inureyes inureyes self-assigned this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:npu NPU (Neural Processing Unit) related impact:breaking Breaking change that requires migration priority:low Low priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhancement: Standardize NPU Prometheus label names to npu_index/npu_uuid

1 participant