Skip to content

feat(intel-gpu): real Linux utilization via perf engine-busy counters #246

Description

@inureyes

Problem / Background

PR #245 (closes #244) shipped a sysfs-based Intel client GPU reader for Linux. As a documented v1 limitation, IntelGpuLinuxReader::get_gpu_info() always reports utilization = 0.0 and tags the detail map with "Utilization": "Requires intel_gpu_top (perf engine counters)". The follow-up to compute a real value was explicitly deferred.

src/device/readers/intel_gpu_linux.rs does not currently read any engine-busy counters. As a result, downstream consumers that select an inference backend based on observed accelerator activity (e.g. the SYCL/oneAPI accelerator-auto-selection layer referenced by issue #244) see Intel hosts as "GPU present, but always idle" and cannot distinguish a free Intel GPU from a fully loaded one.

The AMD reader (src/device/readers/amd.rs) computes real utilization via libamdgpu_top; the NVIDIA reader uses NVML's nvmlDeviceGetUtilizationRates. The Intel reader is the only GPU reader currently reporting a fabricated-zero utilization.

Proposed Solution

Add real engine-busy% computation to the Intel Linux reader by reading kernel perf-style engine counters and tracking deltas across polling intervals:

  • For i915: read /sys/class/drm/card<N>/engine/<class>/<instance>/busy_ns (or the equivalent perf event)
  • For xe: read the equivalent under /sys/class/drm/card<N>/device/tile0/gt0/engines/...
  • Track the previous (busy_ns, wall_ns) pair per device in reader-owned state
  • Compute (delta_busy / delta_wall) * 100.0, clamped to [0, 100]
  • Aggregate across engine classes (render, compute, video, copy) — surface render+compute as the primary utilization, expose the per-class breakdown via the detail map

Mirror the existing per-call vs. cached-static-info split used by the AMD reader (AmdGpuDevice.static_info: OnceLock<DeviceStaticInfo>).

Acceptance Criteria

  • IntelGpuLinuxReader::get_gpu_info() returns a non-zero utilization value when the GPU is actively executing work, verified on at least one Arc (discrete) or Iris Xe / Xe-LPG (integrated) host. (awaits hardware verification by maintainer)
  • The "Utilization": "Requires intel_gpu_top (perf engine counters)" placeholder in the detail map is removed (or replaced with an engine-class breakdown).
  • Per-engine-class utilization (render / compute / video / copy where available) is surfaced via the detail map.
  • State tracking handles the device-removed and clock-skew cases gracefully (return last-known or 0.0, never panic).
  • Unit tests cover the delta computation logic with synthetic timestamps.
  • The implementation is fully integrated into the codebase (registered in the existing reader, no orphaned modules).

Technical Considerations

  • The kernel exposes engine-busy via PMU events when CONFIG_PERF_EVENTS is enabled — these are the same counters intel_gpu_top reads. The sysfs path (engine/.../busy_ns) is the simpler entry point but is not universally available across kernel versions; PMU is more portable but requires perf_event_open(2) syscalls.
  • Engine class names and instance counts differ between i915 and xe. The reader already handles driver detection in discover_cards — extend that to drive the engine enumeration.
  • Holding mutable per-device state in the reader changes the existing stateless shape. Use the same OnceLock + interior-mutability pattern (Mutex<EngineCounters>) the AMD reader uses for vram_usage.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions