Overview
Currently, each accelerator platform uses platform-specific labels for AI acceleration libraries in API metrics:
- NVIDIA:
cuda_version="13.0"
- AMD:
rocm_version="7.0.2"
- Jetson:
cuda_version="..." (same as NVIDIA)
This makes it difficult to create unified queries and dashboards that work across different accelerator types. We should add standardized lib_name and lib_version labels while keeping the platform-specific labels for backward compatibility.
Current State
NVIDIA GPUs
all_smi_gpu_info{gpu="NVIDIA H100",instance="...",uuid="...",index="0",
driver_version="580.82.07",cuda_version="13.0"} 1
AMD GPUs
all_smi_gpu_info{gpu="AMD Instinct MI300X",instance="...",uuid="...",index="0",
driver_version="30.10.1",rocm_version="7.0.2"} 1
NVIDIA Jetson
all_smi_gpu_info{gpu="Jetson AGX Orin",instance="...",uuid="...",index="0",
driver_version="...",cuda_version="..."} 1
Proposed Solution
Add two new standardized labels to all_smi_gpu_info metric:
lib_name: Name of the AI acceleration library/framework
lib_version: Version of the AI acceleration library/framework
After Implementation
NVIDIA GPUs:
all_smi_gpu_info{gpu="NVIDIA H100",instance="...",uuid="...",index="0",
driver_version="580.82.07",cuda_version="13.0",
lib_name="CUDA",lib_version="13.0"} 1
AMD GPUs:
all_smi_gpu_info{gpu="AMD Instinct MI300X",instance="...",uuid="...",index="0",
driver_version="30.10.1",rocm_version="7.0.2",
lib_name="ROCm",lib_version="7.0.2"} 1
NVIDIA Jetson:
all_smi_gpu_info{gpu="Jetson AGX Orin",instance="...",uuid="...",index="0",
driver_version="...",cuda_version="...",
lib_name="CUDA",lib_version="..."} 1
Apple Silicon (future):
all_smi_gpu_info{gpu="Apple M4 Max",instance="...",uuid="...",index="0",
lib_name="Metal",lib_version="..."} 1
Benefits
1. Unified Queries
Users can write platform-agnostic PromQL queries:
# Get all devices with AI library version >= 7.0
all_smi_gpu_info{lib_version=~"[7-9].*"}
# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)
# Alert on old library versions
all_smi_gpu_info{lib_name="CUDA", lib_version!~"1[2-9].*"}
2. Simplified Dashboards
Create single Grafana panels that work across all platforms:
- Library distribution pie chart
- Version upgrade tracking
- Cross-platform compatibility matrix
3. Better Cluster Management
- Easily identify which AI frameworks are deployed
- Track version consistency across heterogeneous clusters
- Plan upgrades based on library versions
4. Backward Compatibility
Keep existing platform-specific labels (cuda_version, rocm_version) for existing queries and tools.
Implementation Tasks
1. Update GPU Readers
Files: src/device/readers/{nvidia,amd,nvidia_jetson,apple_silicon}.rs
2. Update Mock Templates
Files: src/mock/templates/{nvidia,amd_gpu,jetson,apple_silicon}.rs
3. API Mode Integration
File: src/api/metrics/gpu.rs
The export_device_info() function already dynamically includes all fields from the detail HashMap, so:
- ✅ New labels will be automatically exposed
- ✅ No code changes needed in API metrics generation
- ✅ Label sanitization already handles spaces: "lib name" → "lib_name"
4. Documentation Updates
Files: API.md, README.md
5. Testing
Platform Mapping
| Platform |
driver_version |
Platform-Specific |
lib_name |
lib_version |
| NVIDIA GPU |
580.82.07 |
cuda_version="13.0" |
CUDA |
13.0 |
| AMD GPU |
30.10.1 |
rocm_version="7.0.2" |
ROCm |
7.0.2 |
| Jetson |
... |
cuda_version="..." |
CUDA |
... |
| Apple Silicon |
N/A |
N/A |
Metal |
... |
| Tenstorrent |
... |
N/A |
N/A* |
N/A* |
| Rebellions |
... |
N/A |
N/A* |
N/A* |
| Furiosa |
... |
N/A |
N/A* |
N/A* |
* NPUs may not have an AI framework version in the traditional sense
Example PromQL Queries
# Count devices by AI library
count by (lib_name) (all_smi_gpu_info)
# Get all CUDA devices version 12+
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}
# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1
# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)
Acceptance Criteria
Related Issues
Overview
Currently, each accelerator platform uses platform-specific labels for AI acceleration libraries in API metrics:
cuda_version="13.0"rocm_version="7.0.2"cuda_version="..."(same as NVIDIA)This makes it difficult to create unified queries and dashboards that work across different accelerator types. We should add standardized
lib_nameandlib_versionlabels while keeping the platform-specific labels for backward compatibility.Current State
NVIDIA GPUs
AMD GPUs
NVIDIA Jetson
Proposed Solution
Add two new standardized labels to
all_smi_gpu_infometric:lib_name: Name of the AI acceleration library/frameworklib_version: Version of the AI acceleration library/frameworkAfter Implementation
NVIDIA GPUs:
AMD GPUs:
NVIDIA Jetson:
Apple Silicon (future):
Benefits
1. Unified Queries
Users can write platform-agnostic PromQL queries:
2. Simplified Dashboards
Create single Grafana panels that work across all platforms:
3. Better Cluster Management
4. Backward Compatibility
Keep existing platform-specific labels (
cuda_version,rocm_version) for existing queries and tools.Implementation Tasks
1. Update GPU Readers
Files:
src/device/readers/{nvidia,amd,nvidia_jetson,apple_silicon}.rsNVIDIA Reader (
nvidia.rs):lib_name="CUDA"to detail HashMaplib_version=<cuda_version>to detail HashMapcuda_versionlabelAMD Reader (
amd.rs):lib_name="ROCm"to detail HashMaplib_version=<rocm_version>to detail HashMaprocm_versionlabelJetson Reader (
nvidia_jetson.rs):lib_name="CUDA"to detail HashMaplib_version=<cuda_version>to detail HashMapcuda_versionlabelApple Silicon (
apple_silicon.rs):lib_name="Metal"to detail HashMaplib_version=<metal_version>if available2. Update Mock Templates
Files:
src/mock/templates/{nvidia,amd_gpu,jetson,apple_silicon}.rsNVIDIA Mock (
nvidia.rs):lib_name="CUDA"toall_smi_gpu_infolabelslib_version="13.0"(same as cuda_version)AMD Mock (
amd_gpu.rs):lib_name="ROCm"toall_smi_gpu_infolabelslib_version="7.0.2"(same as rocm_version)Jetson Mock (
jetson.rs):lib_name="CUDA"toall_smi_gpu_infolabelslib_version(same as cuda_version)Apple Silicon Mock (
apple_silicon.rs):lib_name="Metal"toall_smi_gpu_infolabelslib_versionif applicable3. API Mode Integration
File:
src/api/metrics/gpu.rsThe
export_device_info()function already dynamically includes all fields from thedetailHashMap, so:4. Documentation Updates
Files:
API.md,README.md5. Testing
Platform Mapping
* NPUs may not have an AI framework version in the traditional sense
Example PromQL Queries
Acceptance Criteria
Related Issues