Skip to content

feat(intel-gpu): compute Linux utilization from engine-busy counters#249

Merged
inureyes merged 3 commits into
mainfrom
feat/issue-246-intel-utilization-engine-counters
May 27, 2026
Merged

feat(intel-gpu): compute Linux utilization from engine-busy counters#249
inureyes merged 3 commits into
mainfrom
feat/issue-246-intel-utilization-engine-counters

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

The Intel client GPU reader merged in #245 reported utilization = 0.0 for every Arc/Iris/Xe card, with a placeholder detail["Utilization"] = "Requires intel_gpu_top..." line. This PR replaces that with a real engine-busy percentage computed from the kernel's per-engine monotonic busy counters in sysfs, mirroring how intel_gpu_top itself derives the same number.

What changed

  • New module src/device/readers/intel_gpu_engine.rs — owns the per-card delta tracker (EngineState), the refresh/refresh_with_lock core, the apply_engine_readout detail-map fold, and the EngineCounter/EngineReadout types. Mutex-poisoning recovery mirrors amd.rs's VramUsage pattern (warn, replace state, continue serving).
  • New helper module src/device/readers/intel_gpu_engine/discovery.rs — sysfs walking for both i915 (flat engine/rcs0/busy and nested engine/rcs/0/busy) and xe (flat tile*/gt*/engines/rcs0/busy_ns, nested tile*/gt*/engines/RENDER/0/busy_ns, multi-GT). Engine class names normalised to short tokens (render, compute, copy, video, video-enhance, other).
  • Reader integration in src/device/readers/intel_gpu_linux.rsIntelGpuCard gains an engine_state: Mutex<EngineState> field initialised at discovery. get_gpu_info drives one refresh per card, clamps the primary utilization to [0, 100], and folds the per-class result into the detail map under Engine: <class> keys. The old Utilization: Requires intel_gpu_top... placeholder is removed; when the kernel does not expose counters the reader surfaces Engine counters unavailable (kernel does not expose engine busy) instead.
  • Tests — 19 engine-module tests (src/device/readers/intel_gpu_engine/tests.rs) cover class normalisation, every discovery layout, seeding semantics, delta computation, [0, 100] clamp, counter-reset safety, multi-engine aggregation, missing-counter graceful handling, and zero-wall-delta short-circuit. Synthetic clock via EngineState::with_clock(now_fn). Two reader-level tests (src/device/readers/intel_gpu_linux/tests.rs) confirm the seeding-call detail entry and the post-seeding Engine: render key + cleared Utilization note. The pre-existing reader test is tightened to assert the new no-counter message.

v1 scope limitations (deferred follow-ups)

  • Sysfs only. The PMU perf_event_open(2) fallback used by intel_gpu_top on locked-down kernels is not shipped — when sysfs returns nothing, the reader continues to report 0.0 with the explanatory detail["Utilization"] note. Adding PMU is tracked as future work.
  • Seeding call returns 0.0. The first refresh per card stamps the counter baselines; real engine-busy percentages appear from the second refresh onward. This is the standard delta-tracker shape — the collector polls at refresh interval anyway.
  • Primary utilization = max(render, compute), not an aggregate. A copy-engine-heavy workload no longer inflates the compute-relevance number; full per-class breakdown is available via detail.
  • Per-process engine-time deltas (referenced from feat(intel-gpu): per-process GPU memory accounting via fdinfo on Linux #247) remain deferred. get_process_info still returns an empty Vec for Intel — that requires /proc/<pid>/fdinfo/* parsing and is a separate piece of work.

Hardware verification

The maintainer's first AC ("non-zero utilization on an Arc / Iris Xe / Xe-LPG host") is unchecked because this PR was authored on a machine without an Intel client GPU. The 60 unit tests in cargo test --lib device::readers::intel_gpu exercise the synthetic-sysfs / synthetic-clock paths comprehensively, but the real-hardware path (an i915 or xe kernel emitting actual busy counters) needs a maintainer with hardware access to confirm. Annotated on the issue.

Files touched

  • src/device/readers/mod.rs — declare the new intel_gpu_engine module
  • src/device/readers/intel_gpu_engine.rs — new
  • src/device/readers/intel_gpu_engine/discovery.rs — new
  • src/device/readers/intel_gpu_engine/tests.rs — new
  • src/device/readers/intel_gpu_linux.rs — engine integration in IntelGpuCard and get_gpu_info
  • src/device/readers/intel_gpu_linux/tests.rs — assertions for the new detail-map shape

No public API or Cargo.toml changes. All touched files stay under the 500-line cap.

Test plan

  • cargo check --lib --tests clean
  • cargo clippy --lib --tests -- -D warnings clean
  • cargo test --lib device::readers::intel_gpu_linux -> 16/16 pass
  • cargo test --lib device::readers::intel_gpu_engine -> 19/19 pass
  • cargo test --lib device::readers::intel_gpu_sysfs -> 10/10 pass
  • Real-hardware run on an Arc / Iris Xe / Xe-LPG host (awaits maintainer)

Closes #246

The Intel client GPU reader merged in #245 reported `utilization = 0.0` for every Arc/Iris/Xe card with a placeholder `detail["Utilization"] = "Requires intel_gpu_top..."`. Replace that with a real engine-busy percentage computed from the kernel's per-engine monotonic busy counters in sysfs.

What changed:

- New `device::readers::intel_gpu_engine` module owning the delta tracker, lock handling, and discovery walks. Splits cleanly into `intel_gpu_engine.rs` (state + refresh + reader-facing helpers) and `intel_gpu_engine/discovery.rs` (i915 flat/nested and xe flat/nested/multi-GT sysfs probing + class-name normalisation).
- `IntelGpuCard` gains an `engine_state: Mutex<EngineState>` field initialised at discovery time. Per-card mutex shape mirrors `AmdGpuDevice.vram_usage` including the poisoning-recovery flow (warn, replace with fresh `EngineState`, continue serving).
- `IntelGpuReader::get_gpu_info` now drives one engine refresh per card, clamps to `[0, 100]`, and folds the result into the produced `GpuInfo`. The primary utilization is `max(render, compute)`; per-class breakdown lands in `detail["Engine: <class>"]`. The first call per card is a seeding refresh that returns 0.0 and stamps the baseline; subsequent calls compute deltas.
- Kernels without engine counters surface a new explanatory `detail["Utilization"]` string (`"Engine counters unavailable (kernel does not expose engine busy)"`) instead of the old `intel_gpu_top` placeholder.

Tests added:

- 19 engine-module tests covering class-name normalisation, all four discovery layouts (i915 flat/nested, xe flat/nested/multi-GT), seeding semantics, delta computation, the `[0, 100]` clamp, counter-reset safety, multi-engine aggregation (render vs copy vs compute, multiple instances of the same class), graceful read-failure handling, and zero-wall-delta short-circuit. Wall clock injected via `EngineState::with_clock(now_fn)` so no real sleeps are involved.
- 2 reader-level tests confirming the seeding-call detail entry and the post-seeding `Engine: render` detail key + cleared `Utilization` note.
- Pre-existing reader test tightened to assert the new no-counter message rather than just `contains_key("Utilization")`.

v1 scope limitations (documented in module header and intentionally deferred):

- Sysfs only; the PMU `perf_event_open(2)` fallback used by `intel_gpu_top` on locked-down kernels is not shipped.
- First refresh per card returns `0.0` (seeding); real values appear from the second refresh onward.
- Primary `utilization` is `max(render, compute)`, not an aggregate across all engines.
- Per-process engine-time deltas (`/proc/<pid>/fdinfo`-driven) remain deferred — `get_process_info` still returns an empty Vec.

Verification (narrow scopes per the watchdog guard):

- `cargo check --lib --tests` clean
- `cargo clippy --lib --tests -- -D warnings` clean
- `cargo test --lib device::readers::intel_gpu_linux` -> 16/16 pass
- `cargo test --lib device::readers::intel_gpu_engine` -> 19/19 pass
- `cargo test --lib device::readers::intel_gpu_sysfs` -> 10/10 pass

All touched files stay under the 500-line cap (engine discovery split into `intel_gpu_engine/discovery.rs` and tests into `intel_gpu_engine/tests.rs`).

No public API or `Cargo.toml` changes.

Closes #246
@inureyes inureyes added status:review Under review type:enhancement New feature or request priority:medium Medium priority issue labels May 27, 2026
…and doc updates

Add refresh_with_lock_recovers_from_poisoned_mutex test to cover the recovery path in refresh_with_lock: spawn a thread that panics while holding the lock, confirm the mutex is poisoned, then verify refresh_with_lock returns a valid readout without panicking and that the recovered EngineState is correctly reset. Note that std::sync::Mutex does not clear the poison flag via into_inner(), so the post-recovery lock is still technically poisoned and is acquired with unwrap_or_else.

Add COMPUTE and VIDEO_DECODE upper-case assertions to normalize_engine_class_handles_known_tokens, covering the all-caps xe path variants that the normalize_engine_class function handles via to_ascii_lowercase.

Update ARCHITECTURE.md Intel GPU section to describe the engine-counter module and its v1 constraints (sysfs-only, seeding call, max(render,compute) primary, PMU deferred). Update README.md Linux feature list to mention engine-busy utilization and the seeding semantics. Update manpage Intel GPU entry to include engine-busy utilization.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Tests

Added 3 tests to src/device/readers/intel_gpu_engine/tests.rs:

refresh_with_lock_recovers_from_poisoned_mutex — covers the mutex poisoning recovery path in refresh_with_lock. A spawned thread acquires the lock and panics (confirmed via state.lock().is_err()), then refresh_with_lock is called and must not panic. The recovered EngineState has samples.is_empty() and discovery_attempted == true. One implementation note documented in the test: std::sync::Mutex does not clear the poison flag via into_inner(), so the mutex remains poisoned after recovery; post-recovery inspection uses unwrap_or_else.

normalize_engine_class_handles_known_tokens — two new assertions appended: "COMPUTE" -> "compute" and "VIDEO_DECODE" -> "video", covering the all-caps xe directory-name variants.

Total: 61 intel_gpu tests pass (was 58 before this commit).

Documentation

  • docs/ARCHITECTURE.md: Added engine-counter bullet to the Intel Arc/Iris Xe section describing the sysfs module, both driver layouts (i915 flat/nested, xe single-GT/multi-GT), the max(render, compute) primary selection, seeding semantics, and PMU deferral.
  • README.md: Added engine-busy utilization line to the Linux Intel GPU feature list with seeding and PMU notes.
  • docs/man/all-smi.1: Extended the Intel GPU entry to mention engine-busy utilization computed from sysfs per-engine monotonic counters.

Lint/Format

cargo fmt --all applied (import ordering in tests.rs). cargo clippy --lib --tests -- -D warnings clean.

Status

Labels unchanged (status:review). PR is not merged. Branch: feat/issue-246-intel-utilization-engine-counters at commit 0cd6339.

@inureyes inureyes added status:done Completed and removed status:review Under review labels May 27, 2026
The pre-existing 'pub use discovery::normalize_engine_class' is only consumed by the unit-test module via the 'use super::*' glob. When clippy runs in CI without --lib filtering (the default 'cargo clippy -- -D warnings'), the binary build sees the import as unused. Narrow the re-export to discover_engine_counters only and have tests import normalize_engine_class directly from the discovery submodule, matching the existing pattern for split_class_instance.
@inureyes inureyes merged commit 42a8dd9 into main May 27, 2026
6 of 7 checks passed
@inureyes inureyes deleted the feat/issue-246-intel-utilization-engine-counters branch May 27, 2026 01:30
inureyes added a commit that referenced this pull request May 27, 2026
…dinfo in README and manpage

new_with_roots is called by production IntelGpuReader::new(), not test-only. Update the doc comment to reflect that it is an internal constructor accepting arbitrary roots, with production code routing through IntelGpuReader::new.

README and manpage were updated in the engine-utilization PR (#249) but neither mentioned the per-process GPU memory tracking added by this PR. Add a bullet under the Intel Arc section in README and extend the manpage Intel entry to include the fdinfo-based process accounting, mirroring the detail already present in ARCHITECTURE.md line 217.
inureyes added a commit that referenced this pull request May 27, 2026
* feat(intel-gpu): add fdinfo parser and DRM-fd-to-card mapper

Introduces a new stateless module `intel_gpu_fdinfo` that parses Intel DRM client `/proc/<pid>/fdinfo/<fd>` blocks and correlates fds back to a reader-known card index. Provides:

- `parse_fdinfo` — pure-string parser for the i915 and xe schemas. Handles truncated / malformed input without panicking, rejects foreign drivers (amdgpu / nvidia), and normalises memory values from kB to bytes.
- `build_intel_drm_basenames` — walks `/sys/class/drm` once to find every `cardN` and `renderD<M>` minor that maps to one of the reader-enumerated Intel PCI devices. Both nodes share a card index so modern Vulkan / oneAPI / ffmpeg workloads opening the render node are captured.
- `intel_drm_fds_for_pid` — reads `/proc/<pid>/fd/` and returns the fds pointing at known Intel DRM nodes. Permission errors degrade silently.
- `collect_intel_gpu_processes` — top-level aggregator. Walks `/proc`, dedupes fds by `drm-client-id` per process+card (avoiding N× over-counting from `dup(2)`d fds) but sums across distinct clients (multi-context workloads). Returns deterministic, PID-sorted output.

The module is path-injection friendly: every public helper accepts an explicit `proc_root` / `drm_root` so the entire walker is testable under `tempfile::tempdir` fixtures without touching the real `/proc` or `/sys`.

23 unit tests cover: i915 + xe parsing, integrated vs discrete schemas, foreign-driver rejection, truncated input, kB-to-bytes conversion, the card/renderD mapping for one and two cards, AMD render-node rejection, connector-child filtering, client-id dedup, distinct-client summing, multi-card grouping, and graceful no-Intel-card short-circuit.

Refs #247

* feat(intel-gpu): wire fdinfo per-process accounting into IntelGpuReader

Replaces the `Vec::new()` stub in `IntelGpuReader::get_process_info()` with a full implementation built on top of the new `intel_gpu_fdinfo` module. The reader now:

- Caches an `intel_drm_basenames` map at construction (one entry per `cardN` and `renderD<M>` known to belong to an Intel PCI device), so the per-process refresh is a flat `/proc` walk with no extra sysfs probing.
- Threads a `proc_root` field through the constructor for test injection — production stays `/proc`.
- Builds a `card_index -> uuid` table on each call and delegates to `build_intel_process_infos`, which collects `(pid, card_index, used_memory_bytes)` aggregates, performs one minimal sysinfo refresh, and merges sysinfo metadata (cpu_percent, user, state, rss, vms, etc.) into the final `ProcessInfo` rows. Pattern matches AMD's reader exactly so cross-vendor consumers see a consistent shape.

To stay under the 500-line per-file budget after the integration, two cohesive subsections moved into siblings:

- `intel_gpu_linux/detection.rs` — `has_intel_client_gpu_from_root` and `line_matches_intel_gpu`. The public `has_intel_client_gpu` is now a 4-line wrapper.
- `intel_gpu_fdinfo/enrichment.rs` — the sysinfo merge helper. The parent module re-exports `build_intel_process_infos`, so the public API is unchanged.

File sizes after refactor (all <500):
- `intel_gpu_linux.rs`        474
- `intel_gpu_fdinfo.rs`       483
- `intel_gpu_fdinfo/enrichment.rs` 107
- `intel_gpu_fdinfo/tests.rs`     494
- `intel_gpu_linux/detection.rs`  86

Behaviour on hosts without Intel-GPU-using processes is unchanged: an empty basename map (no Intel GPUs detected) short-circuits to `Vec::new()`, and the fdinfo walker returns empty when no process holds an Intel DRM fd. The stretch goal (per-process engine-time deltas) is intentionally deferred to a follow-up; v1 reports `gpu_utilization = 0.0` per process.

Refs #247

* test(intel-gpu): add fdinfo integration tests and update ARCHITECTURE

Adds three end-to-end tests against `IntelGpuReader::get_process_info()` driven by a synthetic procfs and DRM sysfs tree:

- `get_process_info_returns_empty_when_no_intel_cards` — guarantees no regression on AMD-only or NVIDIA-only hosts: the empty basename map short-circuits to `Vec::new()` without touching `/proc`.
- `get_process_info_collects_fdinfo_from_render_node` — full pipeline: a synthetic Intel card with a matching `renderD<M>` render node, plus a synthetic `/proc/<pid>/fdinfo/<fd>` containing the i915 schema, must yield exactly one populated `ProcessInfo` with the correct PID, `device_id`, `device_uuid` (`Intel-GPU-<bus>` format matching `get_gpu_info`), and `used_memory` (16384 kB -> 16777216 bytes).
- `get_process_info_default_filter_keeps_uses_gpu_processes` — verifies the Intel reader is compatible with the trait's default `get_gpu_processes` filter (every emitted row has `uses_gpu = true`).

Also updates `docs/ARCHITECTURE.md`: the Intel client GPU section now lists per-process GPU memory accounting alongside engine-busy utilization, including the i915 / xe key sets parsed and the `drm-client-id` dedup behaviour.

Refs #247

* style(intel-gpu): apply rustfmt to fdinfo module

`cargo fmt --check` (which CI enforces) flagged three formatting nits in the new module: a path-join chain that fits on one line, a `let` binding with an unnecessary multi-line break, and a function signature that fits on one line. Run rustfmt to bring the new code in line with the rest of the workspace.

No behaviour change. All 42 fdinfo / intel_gpu_linux tests still pass; clippy clean on both `--lib --tests` and the default bin target.

* docs(intel-gpu): clarify new_with_roots doc and surface per-process fdinfo in README and manpage

new_with_roots is called by production IntelGpuReader::new(), not test-only. Update the doc comment to reflect that it is an internal constructor accepting arbitrary roots, with production code routing through IntelGpuReader::new.

README and manpage were updated in the engine-utilization PR (#249) but neither mentioned the per-process GPU memory tracking added by this PR. Add a bullet under the Intel Arc section in README and extend the manpage Intel entry to include the fdinfo-based process accounting, mirroring the detail already present in ARCHITECTURE.md line 217.
inureyes added a commit that referenced this pull request May 27, 2026
…251)

* feat(intel-gpu): add Level Zero backend skeleton behind a feature flag

Lay the opt-in `level_zero` Cargo feature, the cross-platform Level Zero (oneAPI) FFI shim, and the per-card state / readout types that subsequent commits wire into the Linux sysfs reader and the Windows WMI reader. Default build is unchanged — `cargo build` produces a binary with zero Level Zero references.

The module is split across four files to stay well under the 500-line per-file budget: `intel_gpu_level_zero.rs` (public API: `LevelZeroState`, `LevelZeroReadout`, `refresh`, `apply_to_gpu_info`), `intel_gpu_level_zero/ffi.rs` (hand-written `#[repr(C)]` typedefs and enum constants — no vendored headers, no bindgen), `intel_gpu_level_zero/loader.rs` (`libloading`-based dynamic load of `libze_loader.so.1` / `ze_loader.dll`, one-shot `ZES_ENABLE_SYSMAN=1` injection, driver / device enumeration keyed by PCI BDF), and `intel_gpu_level_zero/refresh.rs` (per-engine and per-power-domain delta tracking).

The v1 surface is intentionally narrow: per-engine activity (RENDER_SINGLE, COMPUTE_SINGLE — the XMX class — COPY_SINGLE, MEDIA_DECODE_SINGLE, MEDIA_ENCODE_SINGLE) plus power derived from `zesPowerGetEnergyCounter` deltas. Temperature, frequency, memory state, RAS, per-process L0 stats, and fine-grained power-limit control are explicitly deferred to follow-up issues so the PR stays reviewable.

The 28-test unit suite covers the spec-locked enum values, BDF formatting and round-tripping, engine-busy delta math (seeding, percentage, overrun clamp, backwards-clock guard, zero-delta guard), energy-counter delta math (with the correct (µJ/µs) = W conversion — no spurious 1e6 scaling), and the `GpuInfo` integration semantics on both platforms (Linux must NOT overwrite `utilization`; Windows MUST overwrite the WMI zeros). One test deliberately calls `try_load_library` against a bogus path to verify the graceful-degrade path the runtime relies on for hosts without the L0 loader.

* feat(intel-gpu): wire the Level Zero backend into the Linux reader

Per-card `IntelGpuCard` now holds a `Mutex<LevelZeroState>` field, gated behind `#[cfg(feature = "level_zero")]` so the struct shape is byte-identical to today's on the default build. The field is constructed empty in `discover_cards` — the first `get_gpu_info()` call lazily binds the card to an L0 device handle via the cached `LzRuntime`, looking up by canonical-formatted PCI BDF (the same string sysfs exposes via `/sys/class/drm/cardN/device`).

After the existing sysfs path emits the baseline `GpuInfo`, the L0 augmentation refreshes the per-card state, applies the readout in place, and — when L0 actually produced data — upgrades `detail["Metrics Source"]` from the new baseline `"sysfs (engine counters)"` to `"sysfs + Level Zero"`. The augmentation never overwrites `GpuInfo.utilization` on Linux: PR #249's sysfs engine counters remain authoritative for the headline percentage, and L0 only contributes additional `detail` entries (the XMX `COMPUTE_SINGLE` activity that sysfs cannot reach plus the energy-counter-derived `Power (L0)` reading).

The Linux test suite gains one assertion locking in the baseline `Metrics Source` so a regression that drops the marker (or that flips it without an L0 hardware verification) trips CI. The "sysfs + Level Zero" upgrade path requires a host with the Intel L0 runtime AND a supported GPU and is left for maintainer hardware verification per the issue ACs.

* refactor(intel-gpu): extract Linux L0 glue and add BDF enumeration helper

Split the Level Zero augmentation glue out of `intel_gpu_linux.rs` into a sibling `intel_gpu_linux/level_zero_glue.rs` so the per-OS reader file stays under the 500-line per-file budget once the L0 integration is wired in. The augmentation logic itself is unchanged — it still runs the L0 refresh against the just-pushed `GpuInfo` and is a noop on hosts without the runtime or for cards L0 cannot bind.

Move the baseline `Metrics Source = "sysfs (engine counters)"` insert into `ensure_static_info` (it is the same string for every call so caching it with the rest of the static identity costs nothing). Drop the explicit dynamic re-insert from `get_gpu_info`, which keeps the hot path one allocation lighter while preserving identical behaviour.

Add `enumerated_pci_bdfs()` to the Level Zero module so the Windows reader (commit follows) can pair its WMI controllers with L0 device handles by ordinal position when no shared per-card identifier is parseable from PNP IDs.

* feat(intel-gpu): wire the Level Zero backend into the Windows reader

`IntelWindowsGpuReader` gains a per-PNP-id `Mutex<HashMap<String, LevelZeroState>>` field gated behind `#[cfg(feature = "level_zero")]` so state persists across `get_gpu_info` calls (each call re-queries WMI, but the L0 energy-counter baseline must survive between calls or the delta-derived power reading is meaningless).

After the WMI baseline emits the list of GPUs, `augment_with_level_zero` walks the WMI controllers in parallel with the sorted list of L0 PCI BDFs (`enumerated_pci_bdfs`). On the typical single-Intel-GPU Windows host this is a perfect 1:1 match; for multi-GPU hosts (rare on Windows) the prefix pairs and the unpaired suffix keeps the WMI-only baseline. `Win32_VideoController.PNPDeviceID` does not expose the BDF in a stable, parseable form across driver versions, so we explicitly choose ordinal matching rather than guessing wrong on a heuristic — a follow-up issue can introduce `Win32_PnPEntity.LocationInformation` parsing if multi-GPU Windows hosts ever become common.

The baseline now records `detail["Metrics Source"] = "WMI"` (in addition to the legacy `Note`) so the augmentation can flip it to `"WMI + Level Zero"` consistently with the Linux path. When L0 produces a readout, it also overwrites the placeholder zeros WMI emits for `GpuInfo.utilization` (max of render / XMX compute) and `GpuInfo.power_consumption`, finally giving the Windows reader the real telemetry NVIDIA users already get via NVML.

* docs(intel-gpu): document the Level Zero augmentation and add no-runtime tests

ARCHITECTURE.md and README.md both pick up a paragraph describing the opt-in `--features level_zero` augmentation: what it covers (engine activity including XMX, energy-counter-derived power), how it interacts with the sysfs / WMI baseline (Linux augments, Windows overwrites the zeros), how `detail["Metrics Source"]` records the active backend, and the deferred surface that is explicitly out of scope (temperature, frequency, memory state, per-process L0, RAS, performance factor, power limits).

`intel_gpu_windows.rs`'s module-level doc gains a new "WMI-only baseline limitations" section that points at the augmentation rather than asserting the metrics are unreachable. The legacy `detail["Note"]` is kept for downstream-consumer compatibility.

Two graceful-degradation tests round out the suite: `enumerated_pci_bdfs_empty_when_runtime_absent` verifies the BDF helper returns a `Vec<String>` rather than panicking when no L0 loader is present (the case on every CI host), and `refresh_returns_none_without_runtime` exercises the same invariant for the per-card `refresh` path. Both tests pass on a host with the loader present too (the post-bind state simply reports `had_any_data = false` for an unknown BDF).

* style(intel-gpu): apply cargo fmt to Level Zero backend files

CI Test Suite failed on `cargo fmt --check`. Apply rustfmt to all seven files touched by the Level Zero backend so the formatting check is green. Pure mechanical formatting — no semantic changes, no test rewrites, all 30 L0 tests + 19 Linux tests still pass under both default and `--features level_zero` builds.

* fix(intel-gpu): correct ZES_STRUCTURE_TYPE constants to match Sysman spec

The hand-written FFI surface previously declared `ZES_STRUCTURE_TYPE_PCI_PROPERTIES = 0x1` and `ZES_STRUCTURE_TYPE_ENGINE_PROPERTIES = 0xa`. Per the official `zes_api.h` (https://github.com/oneapi-src/level-zero/blob/master/include/zes_api.h, `typedef enum _zes_structure_type_t`) the correct values are `PCI_PROPERTIES = 0x2` and `ENGINE_PROPERTIES = 0x5`. The value `0x1` is actually `ZES_STRUCTURE_TYPE_DEVICE_PROPERTIES` and `0xa` is `ZES_STRUCTURE_TYPE_LED_PROPERTIES`, so the `stype` field every refresh wrote into `zes_pci_properties_t` / `zes_engine_properties_t` was labelled as the wrong struct family.

In practice Intel's current oneAPI loader does not strictly validate `stype` on the top-level struct (it only chains `pNext` extension structs by `stype`), so the bug did not surface as a runtime failure on the developer host. But strict-validation drivers, the optional `ZE_ENABLE_VALIDATION_LAYER=1` path, and any future spec-compliant L0 implementation would reject these calls. The `structure_type_constants_match_spec` test that was supposed to lock the values to the spec instead locked them to the wrong values.

Bump both constants to the spec-correct values and update the unit test accordingly.

* fix(intel-gpu-level-zero): cap handle counts, correct ze_bool_t and engine-group constants

Three Sysman-layer fixes surfaced during PR security review.

1. DoS guard around driver-reported handle counts. Every count-then-buffer call site in the L0 loader and refresh paths previously did `vec![ptr; count as usize]` against a raw `u32` returned by the driver. A buggy or hostile driver returning `u32::MAX` would have triggered a ~32 GiB allocation. Added `MAX_L0_HANDLES = 256` (mirroring `MAX_DEVICES`) plus a shared `cap_handle_count` helper that clamps the count and emits a one-shot tracing warning on overflow. Applied at all four call sites (drivers + devices in `loader::enumerate_devices`, engine groups + power domains in `refresh::populate_*`). Each capped enumeration also truncates the Vec to the actual driver-written prefix.

2. `ze_bool_t` ABI mismatch in two FFI structs. Per the upstream header `typedef uint8_t ze_bool_t;`. The PR declared `zes_pci_properties_t::{have_bandwidth_counters,have_packet_counters,have_replay_counters}` and `zes_engine_properties_t::on_subdevice` as `u32`, inflating `zes_pci_properties_t` from the spec-correct 56 bytes to 64 bytes. Changed all four fields to `u8`. Added six `#[cfg(target_pointer_width = "64")]` size-assertion tests that lock the layouts to the C spec sizes (16/16/56/32/16/16). Verified the C struct sizes by compiling a faithful replica with the system C compiler.

3. Engine-group enum values 9..=14 corrected to match the spec. Cross-checked against the upstream `zes_api.h` `_zes_engine_group_t` definition. Fixed values: `MEDIA_ENHANCEMENT_SINGLE=9`, `3D_SINGLE=10` (was 11), `3D_RENDER_COMPUTE_ALL=11` (renamed from `RENDER_COMPUTE_ALL`, was 9), `RENDER_ALL=12` (new), `3D_ALL=13` (was 10), `MEDIA_CODEC_SINGLE=14` (new). Updated the lock-in test to assert the corrected values and added the two previously missing constants. Runtime classification logic (`is_tracked_engine` matches 4..=8) is unchanged because those values were already correct.

Verification:
- cargo fmt --check
- cargo clippy --lib --tests --features level_zero -- -D warnings
- cargo clippy --features level_zero -- -D warnings
- cargo test --lib device::readers::intel_gpu_level_zero --features level_zero (36 passed, was 30 before the new size-assertion tests)

* chore(intel-gpu-level-zero): add MAX_L0_HANDLES cap tests and manpage feature note

Add four unit tests that exercise cap_handle_count directly: under-cap pass-through, over-cap clamp (u32::MAX), exact boundary (MAX_L0_HANDLES), and one-over boundary. These cover the DoS guard path that was only exercised indirectly through the enumerate_devices call chain, which requires a real L0 runtime.

Also add a brief note to the Intel Arc entry in docs/man/all-smi.1 describing the opt-in --features level_zero build flag, the runtime library names, and the graceful degradation contract. The README and ARCHITECTURE.md already carried this information; the manpage had none.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(intel-gpu): real Linux utilization via perf engine-busy counters

1 participant