Skip to content

Add unified AI acceleration library naming for cross-platform consistency in API mode #66

Description

@inureyes

Overview

Currently, each accelerator platform uses platform-specific labels for AI acceleration libraries in API metrics:

  • NVIDIA: cuda_version="13.0"
  • AMD: rocm_version="7.0.2"
  • Jetson: cuda_version="..." (same as NVIDIA)

This makes it difficult to create unified queries and dashboards that work across different accelerator types. We should add standardized lib_name and lib_version labels while keeping the platform-specific labels for backward compatibility.

Current State

NVIDIA GPUs

all_smi_gpu_info{gpu="NVIDIA H100",instance="...",uuid="...",index="0",
  driver_version="580.82.07",cuda_version="13.0"} 1

AMD GPUs

all_smi_gpu_info{gpu="AMD Instinct MI300X",instance="...",uuid="...",index="0",
  driver_version="30.10.1",rocm_version="7.0.2"} 1

NVIDIA Jetson

all_smi_gpu_info{gpu="Jetson AGX Orin",instance="...",uuid="...",index="0",
  driver_version="...",cuda_version="..."} 1

Proposed Solution

Add two new standardized labels to all_smi_gpu_info metric:

  • lib_name: Name of the AI acceleration library/framework
  • lib_version: Version of the AI acceleration library/framework

After Implementation

NVIDIA GPUs:

all_smi_gpu_info{gpu="NVIDIA H100",instance="...",uuid="...",index="0",
  driver_version="580.82.07",cuda_version="13.0",
  lib_name="CUDA",lib_version="13.0"} 1

AMD GPUs:

all_smi_gpu_info{gpu="AMD Instinct MI300X",instance="...",uuid="...",index="0",
  driver_version="30.10.1",rocm_version="7.0.2",
  lib_name="ROCm",lib_version="7.0.2"} 1

NVIDIA Jetson:

all_smi_gpu_info{gpu="Jetson AGX Orin",instance="...",uuid="...",index="0",
  driver_version="...",cuda_version="...",
  lib_name="CUDA",lib_version="..."} 1

Apple Silicon (future):

all_smi_gpu_info{gpu="Apple M4 Max",instance="...",uuid="...",index="0",
  lib_name="Metal",lib_version="..."} 1

Benefits

1. Unified Queries

Users can write platform-agnostic PromQL queries:

# Get all devices with AI library version >= 7.0
all_smi_gpu_info{lib_version=~"[7-9].*"}

# Count devices by AI library type
count by (lib_name) (all_smi_gpu_info)

# Alert on old library versions
all_smi_gpu_info{lib_name="CUDA", lib_version!~"1[2-9].*"}

2. Simplified Dashboards

Create single Grafana panels that work across all platforms:

  • Library distribution pie chart
  • Version upgrade tracking
  • Cross-platform compatibility matrix

3. Better Cluster Management

  • Easily identify which AI frameworks are deployed
  • Track version consistency across heterogeneous clusters
  • Plan upgrades based on library versions

4. Backward Compatibility

Keep existing platform-specific labels (cuda_version, rocm_version) for existing queries and tools.

Implementation Tasks

1. Update GPU Readers

Files: src/device/readers/{nvidia,amd,nvidia_jetson,apple_silicon}.rs

  • NVIDIA Reader (nvidia.rs):

    • Add lib_name="CUDA" to detail HashMap
    • Add lib_version=<cuda_version> to detail HashMap
    • Keep existing cuda_version label
  • AMD Reader (amd.rs):

    • Add lib_name="ROCm" to detail HashMap
    • Add lib_version=<rocm_version> to detail HashMap
    • Keep existing rocm_version label
  • Jetson Reader (nvidia_jetson.rs):

    • Add lib_name="CUDA" to detail HashMap
    • Add lib_version=<cuda_version> to detail HashMap
    • Keep existing cuda_version label
  • Apple Silicon (apple_silicon.rs):

    • Add lib_name="Metal" to detail HashMap
    • Add lib_version=<metal_version> if available

2. Update Mock Templates

Files: src/mock/templates/{nvidia,amd_gpu,jetson,apple_silicon}.rs

  • NVIDIA Mock (nvidia.rs):

    • Add lib_name="CUDA" to all_smi_gpu_info labels
    • Add lib_version="13.0" (same as cuda_version)
  • AMD Mock (amd_gpu.rs):

    • Add lib_name="ROCm" to all_smi_gpu_info labels
    • Add lib_version="7.0.2" (same as rocm_version)
  • Jetson Mock (jetson.rs):

    • Add lib_name="CUDA" to all_smi_gpu_info labels
    • Add lib_version (same as cuda_version)
  • Apple Silicon Mock (apple_silicon.rs):

    • Add lib_name="Metal" to all_smi_gpu_info labels
    • Add lib_version if applicable

3. API Mode Integration

File: src/api/metrics/gpu.rs

The export_device_info() function already dynamically includes all fields from the detail HashMap, so:

  • ✅ New labels will be automatically exposed
  • ✅ No code changes needed in API metrics generation
  • ✅ Label sanitization already handles spaces: "lib name" → "lib_name"

4. Documentation Updates

Files: API.md, README.md

  • Update API.md with new unified labels section
  • Add PromQL query examples using lib_name and lib_version
  • Document label mapping for each platform
  • Update cross-platform comparison table

5. Testing

  • Verify NVIDIA GPUs expose lib_name="CUDA" and lib_version
  • Verify AMD GPUs expose lib_name="ROCm" and lib_version
  • Verify Jetson devices expose lib_name="CUDA" and lib_version
  • Verify backward compatibility with existing queries
  • Test PromQL queries with new labels
  • Update integration tests

Platform Mapping

Platform driver_version Platform-Specific lib_name lib_version
NVIDIA GPU 580.82.07 cuda_version="13.0" CUDA 13.0
AMD GPU 30.10.1 rocm_version="7.0.2" ROCm 7.0.2
Jetson ... cuda_version="..." CUDA ...
Apple Silicon N/A N/A Metal ...
Tenstorrent ... N/A N/A* N/A*
Rebellions ... N/A N/A* N/A*
Furiosa ... N/A N/A* N/A*

* NPUs may not have an AI framework version in the traditional sense

Example PromQL Queries

# Count devices by AI library
count by (lib_name) (all_smi_gpu_info)

# Get all CUDA devices version 12+
all_smi_gpu_info{lib_name="CUDA", lib_version=~"1[2-9].*|[2-9][0-9].*"}

# Alert on outdated ROCm versions (< 7.0)
all_smi_gpu_info{lib_name="ROCm", lib_version!~"[7-9].*"} == 1

# Cross-platform library distribution
sum by (lib_name, lib_version) (all_smi_gpu_info)

Acceptance Criteria

  • All GPU readers add lib_name and lib_version to detail HashMap
  • All mock templates include lib_name and lib_version in all_smi_gpu_info
  • Platform-specific labels remain unchanged (backward compatibility)
  • API mode exposes new labels in all_smi_gpu_info metric
  • Documentation includes PromQL examples with new labels
  • All tests pass with new labels
  • PromQL queries work across different platforms

Related Issues

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions