feat: standardize NPU Prometheus labels to npu_index/npu_uuid#181
Merged
Conversation
Rename the generic `index`/`uuid` base labels to `npu_index`/`npu_uuid` across all non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) and their vendor-specific metric families. Align the NPU mock templates with the new labeling, and extend the remote parser to accept both the new and legacy label aliases. This completes the label standardization started in PR #176 for NVIDIA GPUs (#176 renamed `index`/`uuid` to `gpu_index`/`gpu_uuid`). The GPU-family metrics that are shared across GPU and NPU devices (e.g. `all_smi_gpu_utilization`, `all_smi_ane_*`) continue to carry `gpu_index`/`gpu_uuid`, matching the real exporter in `src/api/metrics/gpu.rs`; only the NPU-specific metric families (`all_smi_npu_*`, `all_smi_tenstorrent_*`, `all_smi_rebellions_*`, `all_smi_furiosa_*`, `all_smi_gaudi_*`, `all_smi_tpu_*`) carry the NPU-prefixed labels. The remote parser now falls back through `gpu_uuid` → `npu_uuid` → `uuid` and `gpu_index` → `npu_index` → `index` (with a matching `gpu` → `npu` fallback for the device name label), so nodes running pre-v0.21.0 or mixed versions remain parseable. Two new round-trip tests in `src/network/metrics_parser.rs` lock in the new aliasing and the precedence order. API.md and the v0.21.0 changelog entry in README.md are updated to reflect the rename. BREAKING CHANGE: The NPU base labels `index` and `uuid` in the Prometheus exposition format are renamed to `npu_index` and `npu_uuid` for all non-NVIDIA NPU-specific metrics. Existing Prometheus dashboards and alert rules that filter NPU metrics by `index=` or `uuid=` labels will need to be updated to use `npu_index=` and `npu_uuid=` respectively. The remote parser accepts both old and new label names for backward compatibility.
- Use `const {}` block for compile-time assertions in amd.rs version
component validation test (clippy::assertions_on_constants)
- Use range `.contains()` instead of manual bounds checks in amd.rs
(clippy::manual_range_contains)
- Use inline format argument in tpu_grpc.rs debug println
(clippy::uninlined_format_args)
Member
Author
PR Finalization CompleteSummary
All checks passing. Ready for merge. |
Member
Author
PR Finalization CompleteSummary
All checks passing. Ready for merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the label standardization started in PR #176 for NVIDIA GPUs. All non-NVIDIA NPU exporters (Tenstorrent, Rebellions, Furiosa, Intel Gaudi, Google TPU) now emit
npu_index/npu_uuidin place of the genericindex/uuidbase labels.Closes #177.
uuid/index→npu_uuid/npu_indexin all NPU exporters undersrc/api/metrics/npu/and in the NPU-specific mock templates undersrc/mock/templates/.gpu_uuid/gpu_index) on metrics that are shared across GPU and NPU devices (all_smi_gpu_*,all_smi_ane_*) since those are emitted by the unified exporter insrc/api/metrics/gpu.rs.gpu_uuid→npu_uuid→uuidandgpu_index→npu_index→index(with a matchinggpu→npudevice-name fallback), so mixed-version clusters continue to parse. Two new round-trip tests lock the aliasing and precedence order in.common::add_basic_gpu_metricsmock helper to emitgpu_uuid/gpu_index(closing a gap left by PR fix: cross-PR review fixes for NVIDIA monitoring features (#172-#175) #176) so the mock stream now matches the real exporter forall_smi_gpu_*lines.Breaking change
The NPU base labels
indexanduuidin the Prometheus exposition format are renamed tonpu_indexandnpu_uuidfor all non-NVIDIA NPU-specific metrics (all_smi_npu_*,all_smi_tenstorrent_*,all_smi_rebellions_*,all_smi_furiosa_*,all_smi_gaudi_*,all_smi_tpu_*).Existing Prometheus dashboards and alert rules that filter NPU metrics by
index=oruuid=labels must be migrated to usenpu_index=andnpu_uuid=respectively.Migration
The remote
all-smi viewparser accepts both the new and legacy label names for one release so that nodes running older builds can still be consumed by newer clients during a rolling upgrade.Test plan
cargo buildpassescargo test— all 495 unit tests + 543 binary tests + integration suites passparser_accepts_npu_labels_for_uuid_and_indexandparser_prefers_gpu_uuid_when_both_gpu_and_npu_labels_presentcover the new aliasing and precedencecargo fmtcleanall_smi_gpu_*/all_smi_ane_*metrics continue to usegpu_uuid/gpu_indexwhileall_smi_npu_*/vendor-specific metrics usenpu_uuid/npu_index