Skip to content

Add AMD GPU driver version extraction and expose in API mode and mock server #63

Description

@inureyes

Overview

Currently, all-smi reads ROCm version for AMD GPUs but does not extract the kernel driver version. The libamdgpu_top library provides get_drm_version_struct() method that returns driver version information (major.minor.patchlevel format).

Current State

Already Implemented ✅

  • ROCm Version: Retrieved via libamdgpu_top::get_rocm_version()
    • Source: $ROCM_PATH/.info/version file
    • Currently added to detail HashMap as "ROCm Version"
    • Location: src/device/readers/amd.rs:219-221

Not Implemented ❌

  • Driver Version: Available but not extracted
    • Method: DeviceHandle::get_drm_version_struct() returns drmVersion struct
    • Fields: version_major, version_minor, version_patchlevel
    • Reference: references/amdgpu_top/crates/libamdgpu_top/src/app.rs:386-388
    • Reference: references/amdgpu_top/crates/amdgpu_top_json/src/dump.rs (JSON export example)

Implementation Tasks

1. AMD GPU Reader Enhancement

File: src/device/readers/amd.rs

  • Add driver version extraction in collect_gpu_info() method
  • Call device_handle.get_drm_version_struct() to get drmVersion
  • Format as "major.minor.patchlevel" string (e.g., "6.12.0")
  • Add to detail HashMap as "Driver Version"

Code location: Around line 210-220 where ROCm version is currently added

2. API Mode (Auto-exposed)

File: src/api/metrics/gpu.rs

  • Verify all_smi_gpu_info metric automatically includes new driver_version label
  • The export_device_info() function already dynamically adds all detail HashMap fields as labels
  • Labels will be sanitized: "Driver Version" → "driver_version", "ROCm Version" → "rocm_version"

Expected output:

all_smi_gpu_info{gpu="AMD Instinct MI300X",instance="hostname",uuid="...",index="0",type="GPU",driver_version="6.12.0",rocm_version="6.3.0"} 1

3. Mock Server Enhancement

File: src/mock/templates/amd_gpu.rs

Currently, the AMD GPU mock template does NOT generate all_smi_gpu_info metric at all.

  • Add add_gpu_info_metric() function similar to NVIDIA mock
  • Include driver_version and rocm_version labels
  • Use constants from src/mock/constants.rs:
    • Add DEFAULT_AMD_DRIVER_VERSION: &str = "6.12.0" (Linux kernel AMDGPU driver version)
    • Add DEFAULT_AMD_ROCM_VERSION: &str = "6.3.0"
  • Call from add_gpu_metrics() after basic GPU metrics
  • Ensure it does NOT include NVIDIA-specific labels (cuda_version)

Code location: After line 104 in build_amd_template()

4. Documentation Updates

  • Update API.md AMD GPU section to list driver_version in metric labels table
  • Update README.md if necessary

Technical References

drmVersion Struct

From references/amdgpu_top/crates/libamdgpu_top/src/xdna/bindings.rs:412-417:

pub struct drm_version {
    pub version_major: ::std::os::raw::c_int,
    pub version_minor: ::std::os::raw::c_int,
    pub version_patchlevel: ::std::os::raw::c_int,
    pub name_len: __kernel_size_t,
    pub name: *mut ::std::os::raw::c_char,
    // ... other fields
}

Usage Example

From references/amdgpu_top/crates/amdgpu_top_json/src/dump.rs:

let drm = self.get_drm_version_struct().map_or(Value::Null, |drm| json!({
    "major": drm.version_major,
    "minor": drm.version_minor,
    "patchlevel": drm.version_patchlevel,
}));

Expected Benefits

  1. Consistency: Match NVIDIA GPU functionality which already exposes driver_version and cuda_version
  2. Debugging: Easier troubleshooting with driver version information
  3. Monitoring: Better cluster management with version tracking
  4. API Completeness: Full hardware information in Prometheus metrics

Platform Requirements

  • Linux with AMD GPU and ROCm/AMDGPU drivers
  • glibc builds only (not available in musl due to library dependencies)
  • Requires sudo or membership in video and render groups

Related Files

  • src/device/readers/amd.rs - AMD GPU reader implementation
  • src/api/metrics/gpu.rs - API metrics generation
  • src/mock/templates/amd_gpu.rs - AMD GPU mock template
  • src/mock/constants.rs - Mock server constants
  • references/amdgpu_top/ - libamdgpu_top library reference

Acceptance Criteria

  • Driver version appears in detail HashMap for AMD GPUs
  • all_smi_gpu_info metric includes driver_version and rocm_version labels in API mode
  • Mock server generates all_smi_gpu_info with appropriate AMD-specific labels
  • Labels are properly sanitized (spaces to underscores)
  • Documentation is updated
  • No impact on NVIDIA or other platform metrics

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions