Overview
Currently, all-smi reads ROCm version for AMD GPUs but does not extract the kernel driver version. The libamdgpu_top library provides get_drm_version_struct() method that returns driver version information (major.minor.patchlevel format).
Current State
Already Implemented ✅
- ROCm Version: Retrieved via
libamdgpu_top::get_rocm_version()
- Source:
$ROCM_PATH/.info/version file
- Currently added to
detail HashMap as "ROCm Version"
- Location:
src/device/readers/amd.rs:219-221
Not Implemented ❌
- Driver Version: Available but not extracted
- Method:
DeviceHandle::get_drm_version_struct() returns drmVersion struct
- Fields:
version_major, version_minor, version_patchlevel
- Reference:
references/amdgpu_top/crates/libamdgpu_top/src/app.rs:386-388
- Reference:
references/amdgpu_top/crates/amdgpu_top_json/src/dump.rs (JSON export example)
Implementation Tasks
1. AMD GPU Reader Enhancement
File: src/device/readers/amd.rs
Code location: Around line 210-220 where ROCm version is currently added
2. API Mode (Auto-exposed)
File: src/api/metrics/gpu.rs
Expected output:
all_smi_gpu_info{gpu="AMD Instinct MI300X",instance="hostname",uuid="...",index="0",type="GPU",driver_version="6.12.0",rocm_version="6.3.0"} 1
3. Mock Server Enhancement
File: src/mock/templates/amd_gpu.rs
Currently, the AMD GPU mock template does NOT generate all_smi_gpu_info metric at all.
Code location: After line 104 in build_amd_template()
4. Documentation Updates
Technical References
drmVersion Struct
From references/amdgpu_top/crates/libamdgpu_top/src/xdna/bindings.rs:412-417:
pub struct drm_version {
pub version_major: ::std::os::raw::c_int,
pub version_minor: ::std::os::raw::c_int,
pub version_patchlevel: ::std::os::raw::c_int,
pub name_len: __kernel_size_t,
pub name: *mut ::std::os::raw::c_char,
// ... other fields
}
Usage Example
From references/amdgpu_top/crates/amdgpu_top_json/src/dump.rs:
let drm = self.get_drm_version_struct().map_or(Value::Null, |drm| json!({
"major": drm.version_major,
"minor": drm.version_minor,
"patchlevel": drm.version_patchlevel,
}));
Expected Benefits
- Consistency: Match NVIDIA GPU functionality which already exposes driver_version and cuda_version
- Debugging: Easier troubleshooting with driver version information
- Monitoring: Better cluster management with version tracking
- API Completeness: Full hardware information in Prometheus metrics
Platform Requirements
- Linux with AMD GPU and ROCm/AMDGPU drivers
- glibc builds only (not available in musl due to library dependencies)
- Requires sudo or membership in
video and render groups
Related Files
src/device/readers/amd.rs - AMD GPU reader implementation
src/api/metrics/gpu.rs - API metrics generation
src/mock/templates/amd_gpu.rs - AMD GPU mock template
src/mock/constants.rs - Mock server constants
references/amdgpu_top/ - libamdgpu_top library reference
Acceptance Criteria
Overview
Currently, all-smi reads ROCm version for AMD GPUs but does not extract the kernel driver version. The libamdgpu_top library provides
get_drm_version_struct()method that returns driver version information (major.minor.patchlevel format).Current State
Already Implemented ✅
libamdgpu_top::get_rocm_version()$ROCM_PATH/.info/versionfiledetailHashMap as "ROCm Version"src/device/readers/amd.rs:219-221Not Implemented ❌
DeviceHandle::get_drm_version_struct()returnsdrmVersionstructversion_major,version_minor,version_patchlevelreferences/amdgpu_top/crates/libamdgpu_top/src/app.rs:386-388references/amdgpu_top/crates/amdgpu_top_json/src/dump.rs(JSON export example)Implementation Tasks
1. AMD GPU Reader Enhancement
File:
src/device/readers/amd.rscollect_gpu_info()methoddevice_handle.get_drm_version_struct()to getdrmVersiondetailHashMap as "Driver Version"Code location: Around line 210-220 where ROCm version is currently added
2. API Mode (Auto-exposed)
File:
src/api/metrics/gpu.rsall_smi_gpu_infometric automatically includes new driver_version labelexport_device_info()function already dynamically adds alldetailHashMap fields as labelsExpected output:
3. Mock Server Enhancement
File:
src/mock/templates/amd_gpu.rsCurrently, the AMD GPU mock template does NOT generate
all_smi_gpu_infometric at all.add_gpu_info_metric()function similar to NVIDIA mockdriver_versionandrocm_versionlabelssrc/mock/constants.rs:DEFAULT_AMD_DRIVER_VERSION: &str = "6.12.0"(Linux kernel AMDGPU driver version)DEFAULT_AMD_ROCM_VERSION: &str = "6.3.0"add_gpu_metrics()after basic GPU metricsCode location: After line 104 in
build_amd_template()4. Documentation Updates
API.mdAMD GPU section to list driver_version in metric labels tableREADME.mdif necessaryTechnical References
drmVersion Struct
From
references/amdgpu_top/crates/libamdgpu_top/src/xdna/bindings.rs:412-417:Usage Example
From
references/amdgpu_top/crates/amdgpu_top_json/src/dump.rs:Expected Benefits
Platform Requirements
videoandrendergroupsRelated Files
src/device/readers/amd.rs- AMD GPU reader implementationsrc/api/metrics/gpu.rs- API metrics generationsrc/mock/templates/amd_gpu.rs- AMD GPU mock templatesrc/mock/constants.rs- Mock server constantsreferences/amdgpu_top/- libamdgpu_top library referenceAcceptance Criteria
detailHashMap for AMD GPUsall_smi_gpu_infometric includesdriver_versionandrocm_versionlabels in API modeall_smi_gpu_infowith appropriate AMD-specific labels