Skip to content

feat(intel-gpu): add opt-in Level Zero backend for advanced metrics#251

Merged
inureyes merged 9 commits into
mainfrom
feat/issue-248-intel-level-zero-backend
May 27, 2026
Merged

feat(intel-gpu): add opt-in Level Zero backend for advanced metrics#251
inureyes merged 9 commits into
mainfrom
feat/issue-248-intel-level-zero-backend

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Adds an opt-in Intel Level Zero (oneAPI) backend behind a new level_zero Cargo feature (default OFF). On Linux it augments PR #249's sysfs engine counters with the XMX COMPUTE_SINGLE engine activity and energy-counter-derived power that sysfs cannot reach; on Windows it fills the WMI gap by overwriting the zeros that Win32_VideoController reports for utilization and power. Hosts without the L0 loader installed (or builds without the feature) behave identically to today — no panic, no log spam, one tracing::debug! line.

v1 scope (what this PR ships)

  • Cross-platform dynamic loading of libze_loader.so.1 (Linux) / ze_loader.dll (Windows) via libloading (no new Cargo dependency, no bindgen, no vendored ze_api.h).
  • ZES_ENABLE_SYSMAN=1 set once via Once::call_once before the first zeInit (Option A in the design notes — works on every shipping Intel runtime).
  • Hand-written #[repr(C)] FFI surface: 5 opaque handle typedefs, 4 concrete structs, 9 function-pointer typedefs, 13 enum constants. Spec-locked by unit tests.
  • Engine activity per zes_engine_group_t group: RENDER_SINGLE, COMPUTE_SINGLE (the XMX class on Arc / Battlemage), COPY_SINGLE, MEDIA_DECODE_SINGLE, MEDIA_ENCODE_SINGLE. Aggregated _ALL groups are skipped to avoid double-counting.
  • Power via zesPowerGetEnergyCounter delta-tracking (µJ / µs = W; the conversion was caught by a unit test that originally asserted the wrong factor).
  • Per-card state behind a Mutex<LevelZeroState> on IntelGpuCard (Linux, #[cfg(feature = "level_zero")]-gated) and a Mutex<HashMap<String, LevelZeroState>> keyed by PNP-id on IntelWindowsGpuReader.
  • PCI BDF matching: Linux looks up the L0 device by canonical lowercase DDDD:BB:DD.F extracted from the sysfs symlink; Windows pairs WMI controllers with L0 devices by ordinal position in the sorted-BDF enumeration, since Win32_VideoController.PNPDeviceID does not expose the BDF in a stable, parseable form across driver versions.
  • detail["Metrics Source"] advertises the active backend: "sysfs (engine counters)""sysfs + Level Zero" (Linux), "WMI""WMI + Level Zero" (Windows).
  • 30 unit tests: spec-locked enum values, BDF formatting, engine-busy delta math (seeding, percentage, overrun clamp, backwards-clock guard, zero-delta guard), energy-counter delta math (with the correct units), GpuInfo integration on both platforms, library-not-found graceful degradation, and runtime-absent behaviour.

Deferred (not in this PR — explicit follow-up surface)

Every item below is intentionally out of scope; each is reasonable as its own follow-up issue:

  • Temperature (zesTemperatureGet*)
  • Frequency (zesFrequencyGet*)
  • Memory state (zesMemoryGet*)
  • Performance factor / boost
  • RAS / error reporting
  • Per-process L0 stats (zesDeviceProcessesGetState)
  • Fine-grained power-limit control (zesPowerGetLimits*, zesPowerSetLimits*)
  • zesInit initialisation path (Option B) for newer Intel runtimes
  • BDF parsing from Win32_PnPEntity.LocationInformation for stronger Windows matching

Files touched

New:

  • src/device/readers/intel_gpu_level_zero.rs — public API (LevelZeroState, LevelZeroReadout, refresh, apply_to_gpu_info, ApplyPlatform, label / engine-group helpers, diagnostics).
  • src/device/readers/intel_gpu_level_zero/ffi.rs — hand-written #[repr(C)] types, enum constants, function-pointer typedefs.
  • src/device/readers/intel_gpu_level_zero/loader.rslibloading dynamic resolution, ZES_ENABLE_SYSMAN=1 injection, driver / device enumeration keyed by PCI BDF, with_runtime convenience.
  • src/device/readers/intel_gpu_level_zero/refresh.rs — per-engine and per-power-domain sample types and delta-tracking math.
  • src/device/readers/intel_gpu_level_zero/tests.rs — 30 unit tests.
  • src/device/readers/intel_gpu_linux/level_zero_glue.rs — Linux-side augmentation helper (extracted to keep intel_gpu_linux.rs under 500 lines).

Modified:

  • Cargo.toml — new opt-in level_zero feature, no new dependencies.
  • src/device/readers/mod.rs — module registration gated on cfg(any(target_os = "linux", target_os = "windows")) AND feature = "level_zero".
  • src/device/readers/intel_gpu_linux.rs — one new #[cfg(feature = "level_zero")]-gated Mutex<LevelZeroState> field on IntelGpuCard, baseline Metrics Source in ensure_static_info, L0 augmentation call after the baseline GpuInfo is pushed.
  • src/device/readers/intel_gpu_linux/tests.rs — asserts the baseline Metrics Source is "sysfs (engine counters)" on the default build.
  • src/device/readers/intel_gpu_windows.rs — per-PNP-id state map on IntelWindowsGpuReader, baseline Metrics Source = "WMI", ordinal-paired L0 augmentation after the WMI query returns.
  • README.md and docs/ARCHITECTURE.md — paragraph describing the augmentation, when it activates, and the deferred surface.

Default build is unchanged

cargo build (no feature) produces a binary with zero Level Zero references:

$ cargo build --bin all-smi
$ nm -D target/debug/all-smi | grep -i 'zes_\|ze_loader' | wc -l
0

The level_zero module is only compiled when the feature is on AND the target is Linux or Windows.

Test plan

  • cargo check --lib --tests
  • cargo check --lib --tests --features level_zero
  • cargo clippy --lib --tests -- -D warnings
  • cargo clippy --lib --tests --features level_zero -- -D warnings
  • cargo clippy -- -D warnings
  • cargo clippy --features level_zero -- -D warnings
  • cargo test --lib device::readers::intel_gpu_level_zero --features level_zero (30/30 pass)
  • cargo test --lib device::readers::intel_gpu_linux (19/19 pass on default)
  • cargo test --lib device::readers::intel_gpu_linux --features level_zero (19/19 pass)
  • cargo test --lib device::readers::intel_gpu_engine (20/20 pass, both modes)
  • cargo test --lib device::readers::intel_gpu_fdinfo (23/23 pass)
  • Binary contains zero Level Zero references on the default build (nm -D).

Hardware-dependent ACs (XMX activity visible on Arc / Battlemage / Lunar Lake / Meteor Lake; Windows utilization / power non-zero with the L0 runtime present; Linux sysfs vs. L0 agreement) are left unchecked in the issue body and await maintainer hardware verification.

Closes #248

inureyes added 5 commits May 27, 2026 12:59
Lay the opt-in `level_zero` Cargo feature, the cross-platform Level Zero (oneAPI) FFI shim, and the per-card state / readout types that subsequent commits wire into the Linux sysfs reader and the Windows WMI reader. Default build is unchanged — `cargo build` produces a binary with zero Level Zero references.

The module is split across four files to stay well under the 500-line per-file budget: `intel_gpu_level_zero.rs` (public API: `LevelZeroState`, `LevelZeroReadout`, `refresh`, `apply_to_gpu_info`), `intel_gpu_level_zero/ffi.rs` (hand-written `#[repr(C)]` typedefs and enum constants — no vendored headers, no bindgen), `intel_gpu_level_zero/loader.rs` (`libloading`-based dynamic load of `libze_loader.so.1` / `ze_loader.dll`, one-shot `ZES_ENABLE_SYSMAN=1` injection, driver / device enumeration keyed by PCI BDF), and `intel_gpu_level_zero/refresh.rs` (per-engine and per-power-domain delta tracking).

The v1 surface is intentionally narrow: per-engine activity (RENDER_SINGLE, COMPUTE_SINGLE — the XMX class — COPY_SINGLE, MEDIA_DECODE_SINGLE, MEDIA_ENCODE_SINGLE) plus power derived from `zesPowerGetEnergyCounter` deltas. Temperature, frequency, memory state, RAS, per-process L0 stats, and fine-grained power-limit control are explicitly deferred to follow-up issues so the PR stays reviewable.

The 28-test unit suite covers the spec-locked enum values, BDF formatting and round-tripping, engine-busy delta math (seeding, percentage, overrun clamp, backwards-clock guard, zero-delta guard), energy-counter delta math (with the correct (µJ/µs) = W conversion — no spurious 1e6 scaling), and the `GpuInfo` integration semantics on both platforms (Linux must NOT overwrite `utilization`; Windows MUST overwrite the WMI zeros). One test deliberately calls `try_load_library` against a bogus path to verify the graceful-degrade path the runtime relies on for hosts without the L0 loader.
Per-card `IntelGpuCard` now holds a `Mutex<LevelZeroState>` field, gated behind `#[cfg(feature = "level_zero")]` so the struct shape is byte-identical to today's on the default build. The field is constructed empty in `discover_cards` — the first `get_gpu_info()` call lazily binds the card to an L0 device handle via the cached `LzRuntime`, looking up by canonical-formatted PCI BDF (the same string sysfs exposes via `/sys/class/drm/cardN/device`).

After the existing sysfs path emits the baseline `GpuInfo`, the L0 augmentation refreshes the per-card state, applies the readout in place, and — when L0 actually produced data — upgrades `detail["Metrics Source"]` from the new baseline `"sysfs (engine counters)"` to `"sysfs + Level Zero"`. The augmentation never overwrites `GpuInfo.utilization` on Linux: PR #249's sysfs engine counters remain authoritative for the headline percentage, and L0 only contributes additional `detail` entries (the XMX `COMPUTE_SINGLE` activity that sysfs cannot reach plus the energy-counter-derived `Power (L0)` reading).

The Linux test suite gains one assertion locking in the baseline `Metrics Source` so a regression that drops the marker (or that flips it without an L0 hardware verification) trips CI. The "sysfs + Level Zero" upgrade path requires a host with the Intel L0 runtime AND a supported GPU and is left for maintainer hardware verification per the issue ACs.
…lper

Split the Level Zero augmentation glue out of `intel_gpu_linux.rs` into a sibling `intel_gpu_linux/level_zero_glue.rs` so the per-OS reader file stays under the 500-line per-file budget once the L0 integration is wired in. The augmentation logic itself is unchanged — it still runs the L0 refresh against the just-pushed `GpuInfo` and is a noop on hosts without the runtime or for cards L0 cannot bind.

Move the baseline `Metrics Source = "sysfs (engine counters)"` insert into `ensure_static_info` (it is the same string for every call so caching it with the rest of the static identity costs nothing). Drop the explicit dynamic re-insert from `get_gpu_info`, which keeps the hot path one allocation lighter while preserving identical behaviour.

Add `enumerated_pci_bdfs()` to the Level Zero module so the Windows reader (commit follows) can pair its WMI controllers with L0 device handles by ordinal position when no shared per-card identifier is parseable from PNP IDs.
`IntelWindowsGpuReader` gains a per-PNP-id `Mutex<HashMap<String, LevelZeroState>>` field gated behind `#[cfg(feature = "level_zero")]` so state persists across `get_gpu_info` calls (each call re-queries WMI, but the L0 energy-counter baseline must survive between calls or the delta-derived power reading is meaningless).

After the WMI baseline emits the list of GPUs, `augment_with_level_zero` walks the WMI controllers in parallel with the sorted list of L0 PCI BDFs (`enumerated_pci_bdfs`). On the typical single-Intel-GPU Windows host this is a perfect 1:1 match; for multi-GPU hosts (rare on Windows) the prefix pairs and the unpaired suffix keeps the WMI-only baseline. `Win32_VideoController.PNPDeviceID` does not expose the BDF in a stable, parseable form across driver versions, so we explicitly choose ordinal matching rather than guessing wrong on a heuristic — a follow-up issue can introduce `Win32_PnPEntity.LocationInformation` parsing if multi-GPU Windows hosts ever become common.

The baseline now records `detail["Metrics Source"] = "WMI"` (in addition to the legacy `Note`) so the augmentation can flip it to `"WMI + Level Zero"` consistently with the Linux path. When L0 produces a readout, it also overwrites the placeholder zeros WMI emits for `GpuInfo.utilization` (max of render / XMX compute) and `GpuInfo.power_consumption`, finally giving the Windows reader the real telemetry NVIDIA users already get via NVML.
…ime tests

ARCHITECTURE.md and README.md both pick up a paragraph describing the opt-in `--features level_zero` augmentation: what it covers (engine activity including XMX, energy-counter-derived power), how it interacts with the sysfs / WMI baseline (Linux augments, Windows overwrites the zeros), how `detail["Metrics Source"]` records the active backend, and the deferred surface that is explicitly out of scope (temperature, frequency, memory state, per-process L0, RAS, performance factor, power limits).

`intel_gpu_windows.rs`'s module-level doc gains a new "WMI-only baseline limitations" section that points at the augmentation rather than asserting the metrics are unreachable. The legacy `detail["Note"]` is kept for downstream-consumer compatibility.

Two graceful-degradation tests round out the suite: `enumerated_pci_bdfs_empty_when_runtime_absent` verifies the BDF helper returns a `Vec<String>` rather than panicking when no L0 loader is present (the case on every CI host), and `refresh_returns_none_without_runtime` exercises the same invariant for the per-card `refresh` path. Both tests pass on a host with the loader present too (the post-bind state simply reports `had_any_data = false` for an unknown BDF).
@inureyes inureyes added status:review Under review type:enhancement New feature or request priority:medium Medium priority issue labels May 27, 2026
inureyes added 2 commits May 27, 2026 13:20
CI Test Suite failed on `cargo fmt --check`. Apply rustfmt to all seven files touched by the Level Zero backend so the formatting check is green. Pure mechanical formatting — no semantic changes, no test rewrites, all 30 L0 tests + 19 Linux tests still pass under both default and `--features level_zero` builds.
…spec

The hand-written FFI surface previously declared `ZES_STRUCTURE_TYPE_PCI_PROPERTIES = 0x1` and `ZES_STRUCTURE_TYPE_ENGINE_PROPERTIES = 0xa`. Per the official `zes_api.h` (https://github.com/oneapi-src/level-zero/blob/master/include/zes_api.h, `typedef enum _zes_structure_type_t`) the correct values are `PCI_PROPERTIES = 0x2` and `ENGINE_PROPERTIES = 0x5`. The value `0x1` is actually `ZES_STRUCTURE_TYPE_DEVICE_PROPERTIES` and `0xa` is `ZES_STRUCTURE_TYPE_LED_PROPERTIES`, so the `stype` field every refresh wrote into `zes_pci_properties_t` / `zes_engine_properties_t` was labelled as the wrong struct family.

In practice Intel's current oneAPI loader does not strictly validate `stype` on the top-level struct (it only chains `pNext` extension structs by `stype`), so the bug did not surface as a runtime failure on the developer host. But strict-validation drivers, the optional `ZE_ENABLE_VALIDATION_LAYER=1` path, and any future spec-compliant L0 implementation would reject these calls. The `structure_type_constants_match_spec` test that was supposed to lock the values to the spec instead locked them to the wrong values.

Bump both constants to the spec-correct values and update the unit test accordingly.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Add an opt-in Intel Level Zero backend behind a default-off level_zero Cargo feature that augments the existing sysfs (Linux) and WMI (Windows) Intel GPU readers with engine activity (including the XMX COMPUTE_SINGLE class) and energy-counter-derived power, while leaving the default build's binary footprint and runtime behaviour unchanged.

Findings Addressed

  • CRITICAL — ZES_STRUCTURE_TYPE_* constants did not match the Sysman spec. The hand-written FFI declared ZES_STRUCTURE_TYPE_PCI_PROPERTIES = 0x1 and ZES_STRUCTURE_TYPE_ENGINE_PROPERTIES = 0xa. Per oneapi-src/level-zero/include/zes_api.h the correct values are 0x2 and 0x5. The values 0x1 and 0xa correspond to DEVICE_PROPERTIES and LED_PROPERTIES respectively, so every refresh wrote the wrong stype into zes_pci_properties_t / zes_engine_properties_t. Intel's current loader does not strictly validate top-level stype, but strict validators and future-proofing both require the correct values. The structure_type_constants_match_spec test also locked the wrong values. Fixed in 2d39a77.
  • HIGH — CI Test Suite failed on cargo fmt --check. Seven Level Zero files (and the Linux / Windows glue) had formatting violations that the Test Suite workflow's Check formatting step rejected. Applied cargo fmt cleanly with no semantic changes; all 30 L0 tests + 19 Linux tests still pass under both default and --features level_zero builds. Fixed in a4c956a.

Verified Correct (no fix needed)

  • zes_engine_group_t v1-tracked values match the upstream zes_api.h exactly (COMPUTE_SINGLE=4, RENDER_SINGLE=5, MEDIA_DECODE_SINGLE=6, MEDIA_ENCODE_SINGLE=7, COPY_SINGLE=8). The mapping cited in the review prompt was off-by-one against the actual spec; the code is correct.
  • Energy-counter math: (delta_energy_uj as f64) / (delta_timestamp_us as f64) produces watts directly because µJ/µs = J/s = W. The power_watts_correct test asserts 30 J over 1 s → 30 W. Comment at refresh.rs:264 documents the unit relationship.
  • PCI BDF formatting: format_pci_bdf produces lowercase "{:04x}:{:02x}:{:02x}.{:x}" matching the sysfs symlink target basename (e.g. 0000:03:00.0); normalise_pci_bdf lowercases the sysfs string before lookup so the map hit is case-insensitive.
  • ZES_ENABLE_SYSMAN=1 is set inside Once::call_once strictly before the first zeInit call (lines 208–219 precede line 241 in loader.rs); unsafe is correctly scoped per Rust 2024 rules; existing operator override is preserved (if var_os().is_none()).
  • Linux augmentation adds Engine: <label> (L0) / Power (L0) detail entries and flips Metrics Source to "sysfs + Level Zero" but does NOT overwrite GpuInfo.utilization or GpuInfo.power_consumption. Test: apply_to_gpu_info_linux_does_not_overwrite_utilization.
  • Windows augmentation DOES overwrite utilization (with max(render, compute (XMX))) and power_consumption, and flips Metrics Source to "WMI + Level Zero". Test: apply_to_gpu_info_windows_overwrites_zero_fields.
  • Metrics Source baseline strings are emitted before L0 augmentation runs; without an L0 runtime present the baseline survives unchanged ("sysfs (engine counters)" / "WMI").
  • Graceful degradation: try_load_library_returns_none_for_nonexistent_path, enumerated_pci_bdfs_empty_when_runtime_absent, and refresh_returns_none_without_runtime cover the no-L0-runtime path. No panic! / unwrap() in production code (the one best.unwrap() in refresh_power is guarded by short-circuit is_none() ||).
  • Module gating: pub mod intel_gpu_level_zero in mod.rs is gated on #[cfg(all(any(target_os = "linux", target_os = "windows"), feature = "level_zero"))]. All non-mod.rs references in the per-OS readers (level_zero_state field, crate::device::readers::intel_gpu_level_zero::* paths) are #[cfg(feature = "level_zero")]-gated.
  • Default build: cargo build --bin all-smi produces a binary with zero zeInit / libze_loader / ZES_ENABLE_SYSMAN strings (verified locally via strings). ldd shows no L0 library linked even with --features level_zero (dynamic via libloading, as intended).
  • File-size budget: all six new files under 500 lines (ffi 246, loader 351, refresh 293, level_zero 342, tests 476, level_zero_glue 37; intel_gpu_linux.rs 497 — at the budget but compliant).
  • Commit messages: conventional-commits prefix, English-only, no co-authorship, no AI attribution.
  • PR body: explicit deferred surface listed, hardware ACs left unchecked with (awaits hardware verification by maintainer), no new Cargo dependencies, ends with Closes #248.

Remaining Items

  • Hardware-gated ACs (XMX activity visible on Arc / Battlemage / Lunar Lake / Meteor Lake; Windows utilization / power non-zero with L0 runtime present; Linux sysfs vs. L0 agreement) — explicitly deferred to maintainer hardware verification per the issue body. Not a finding.
  • LOW — stale Note on Windows after L0 succeeds. The WMI baseline writes detail["Note"] = "Detailed metrics require Level Zero / xpu-smi"; once L0 augmentation flips Metrics Source to "WMI + Level Zero", this Note is left in place and now contradicts the real state. The source comment acknowledges the legacy key is kept for backward compatibility. Cosmetic; not blocking.
  • LOW — best.unwrap() in refresh.rs:214. Guarded by best.is_none() || short-circuit so it is safe, but best.is_none_or(|b| b < w) would be idiomatic. Cosmetic; clippy passes.
  • MEDIUM (acknowledged in PR) — Windows ordinal-paired BDF matching. Pairing WMI controllers with L0 devices by sorted-BDF ordinal is correct for the common 1-Intel-GPU case but can misalign on multi-Intel-GPU Windows hosts. Explicitly flagged in augment_with_level_zero's doc comment with the follow-up path (Win32_PnPEntity.LocationInformation parsing). Not a regression from the WMI-only baseline; acceptable scope-bounded compromise.

Verification

  • All stated requirements implemented
  • No placeholder/mock code remaining (no TODOs, no todo!/unimplemented!)
  • Integrated into project code flow (Linux: level_zero_glue::augment called after baseline GpuInfo push; Windows: augment_with_level_zero called after query_intel_gpus)
  • Project conventions followed (Rust 2024 edition, cargo fmt clean, cargo clippy -- -D warnings clean under both feature settings, conventional-commits)
  • Existing modules reused where applicable (libloading pattern from tpu_pjrt.rs; Mutex<EngineState> pattern for per-card delta tracking; intel_gpu_names::classify_intel_architecture shared with both OS readers; DeviceStaticInfo cached once)
  • No unintended structural changes (only added the level_zero_state field on IntelGpuCard and IntelWindowsGpuReader, plus the module tree under intel_gpu_level_zero/)
  • Tests pass (30/30 L0; 19/19 Linux under both feature settings; 20/20 intel_gpu_engine; 23/23 intel_gpu_fdinfo; cargo fmt --check clean; cargo clippy --lib --tests -- -D warnings clean both modes; cargo clippy --lib --tests --features level_zero -- -D warnings clean)

Verdict

Ready for security/perf review pending CI green on the latest push. The one CRITICAL finding (wrong Sysman stype constants) and one HIGH finding (CI red on cargo fmt) have been fixed on the branch in 2d39a77 and a4c956a respectively. All v1-scoped functional checks pass; deferred items match what the PR body and issue #248 explicitly enumerate.

@inureyes

Copy link
Copy Markdown
Member Author

PR Security & Performance Review

Re-review after the two earlier auto-fixes (commits 2d39a77 and a4c956a). I exercised cargo check / cargo clippy / the targeted test suites under both default and --features level_zero builds — every command passes (30/30 L0 tests, 19/19 Linux tests, both with and without the feature; no clippy warnings; default binary contains zero L0 symbols per nm -D).

The remaining findings are all in the new Level Zero FFI surface, in descending severity:

HIGH — Unbounded driver-reported handle counts in vec allocation

src/device/readers/intel_gpu_level_zero/loader.rs:275-276, 294-295 and src/device/readers/intel_gpu_level_zero/refresh.rs:91-92, 139-140 all do vec![std::ptr::null_mut::<c_void>(); count as usize] where count: u32 comes verbatim from the L0 driver via the count-only first pass. A buggy or malicious driver that writes u32::MAX into count would trigger an allocation of ~32 GiB (8 bytes per pointer × 4 billion). The process either OOMs the host or panics on allocation failure — both are denial-of-service surfaces. Cap each enumeration with a guard similar to MAX_DEVICES = 256 used by discover_cards (e.g. let count = count.min(MAX_L0_HANDLES); before the vec![...; count]). Driver / power-domain / engine-group counts in practice are < 64, so a 256 cap is generous and matches the existing parity convention.

MEDIUM — ze_bool_t is uint8_t, not u32

Upstream ze_api.h: typedef uint8_t ze_bool_t;. The PR declares the four ze_bool_t fields as u32 in src/device/readers/intel_gpu_level_zero/ffi.rs:

  • zes_engine_properties_t::on_subdevice (line 162)
  • zes_pci_properties_t::have_bandwidth_counters (line 132)
  • zes_pci_properties_t::have_packet_counters (line 133)
  • zes_pci_properties_t::have_replay_counters (line 134)

For zes_engine_properties_t, the next field subdeviceId is uint32_t so the C compiler pads onSubdevice up to a 4-byte slot. Reading on_subdevice as u32 happens to work because Rust's Default zero-initialises the padding bytes before the driver writes its one byte (and subdeviceId lands at the same offset in both layouts). Functionally correct in practice, but the type does not match the spec.

For zes_pci_properties_t the divergence is larger. The C struct ends at byte 55 (three trailing uint8_t bool fields + 5 bytes of struct-end padding to align to the 8-byte void* alignment of the struct). The Rust struct as written is 60 bytes (three u32s, no end padding). When the driver writes 56 bytes per the C ABI, the trailing four Rust bytes (offsets 56-59) are never touched; Default zero-init makes them safe to read but they are not driver data. More importantly, the three Rust u32 fields read across the boundary of the three C u8 bools and the C padding — have_bandwidth_counters ends up reading bytes for all three bools mashed together. The PR never reads these fields (only .address), so today this is dormant. But it is spec-non-compliant, will silently corrupt the values if a future maintainer reads them, and risks compatibility issues with strict-validation drivers or the optional ZE_ENABLE_VALIDATION_LAYER=1 path.

Fix: change all four field types to u8. Re-derive the Rust struct size against the spec C layout while adding the change.

MEDIUM — Engine-group enum constants 9-12 do not match the spec

Spec (verified against https://github.com/oneapi-src/level-zero/blob/master/include/zes_api.h and https://oneapi-src.github.io/level-zero-spec/level-zero/latest/sysman/api.html#zes-engine-group-t):

Value Spec name PR name
9 MEDIA_ENHANCEMENT_SINGLE RENDER_COMPUTE_ALL ← wrong
10 3D_SINGLE (deprecated) 3D_ALL ← wrong
11 3D_RENDER_COMPUTE_ALL (deprecated) 3D_SINGLE ← wrong
12 RENDER_ALL MEDIA_ENHANCEMENT_SINGLE ← wrong
13 3D_ALL (deprecated) not declared
14 MEDIA_CODEC_SINGLE not declared

is_tracked_engine only matches values 4-8 (COMPUTE_SINGLE, RENDER_SINGLE, MEDIA_DECODE_SINGLE, MEDIA_ENCODE_SINGLE, COPY_SINGLE), so the runtime classification of tracked engines is correct on real Arc / Battlemage hardware. The bug is twofold:

  1. The lock-in test engine_group_enum_values_match_spec asserts these wrong values "match the spec" — the safety net the PR description praises is fictional from value 9 onwards. A future spec drift in the 4-8 range would still be caught, but the test reads as if the entire enum is locked.
  2. Future maintainers adding tracked engines (e.g. wiring RENDER_ALL to capture the unified render-engine aggregate the spec recommends for Arc) would write ffi::ZES_ENGINE_GROUP_RENDER_ALL expecting value 12, but the constant resolves to MEDIA_ENHANCEMENT_SINGLE (9 in spec). Silent misclassification at the source.

Fix: renumber 9-12 to the spec values, add RENDER_ALL = 12 and MEDIA_CODEC_SINGLE = 14, drop or rename the deprecated 3D_SINGLE = 10 / 3D_RENDER_COMPUTE_ALL = 11 slots, update the test to assert against the spec, and update the doc comment that references the spec URL.

LOW — std::env::set_var SAFETY contract violation

src/device/readers/intel_gpu_level_zero/loader.rs:206-219. The SAFETY comment claims "no L0 thread has begun yet (every L0 caller goes through ensure_runtime, which is what we are inside of)" — but the Rust 2024 unsafety of set_var is about any thread reading the environment, not specifically L0 threads. IntelGpuReader::new() is called inside run_collection_loop after the tokio runtime is up and after the axum server (and tracing, and reqwest, and ssh transport) threads are spawned. set_var therefore runs while many threads are alive and may be calling getenv indirectly via glibc internals (locale, time zone, libssl, etc.). This is the exact scenario Rust 2024 marked unsafe.

The window is small (one set_var call early in the first collection iteration) and ZES_ENABLE_SYSMAN itself is read only by the L0 loader, so the real-world chance of UB manifesting is very low. But the SAFETY comment is misleading and the contract is violated.

Either (a) move the env-var injection to a hook called by main before the tokio runtime spawns workers (cleanest, matches the Rust 2024 guidance), or (b) update the SAFETY comment to accurately describe the race window and document why the team accepts it for v1. I would prefer (a) but (b) is acceptable if the orchestration cost is too high.

LOW — level_zero_state map grows monotonically on Windows

src/device/readers/intel_gpu_windows.rs:286-300 inserts into level_zero_state: Mutex<HashMap<String, LevelZeroState>> keyed by gpu.uuid (derived from WMI PNPDeviceID) and never removes. In practice this is bounded by the small set of Intel GPUs in the machine, but on a system where PNPDeviceID is unstable across reboots (rare but possible after driver reinstall) the map would accumulate dead entries. Not worth fixing in v1; just worth noting in the module doc.

Items that are fine

  • FFI memory safety. Function pointers are extracted by *ze_init (deref to copy out a Copy fn pointer) and the underlying Library is held in LzRuntime inside a OnceCell for process lifetime — no dangling symbols. All unsafe { (api.fn)(...) } call sites have driver-side null-buffer / count-only semantics and check the return code before reading the out-parameter. Opaque handles (zes_device_handle_t, zes_engine_handle_t, zes_pwr_handle_t) are never freed, never Box::from_raw'd — they live for runtime lifetime, matching the spec's process-scoped opacity.
  • Library handle leak via OnceCell matches the tpu_pjrt pattern; no Drop impl on LzRuntime accidentally unloads.
  • Library search paths are static const strings — no user-controlled input reaches Library::new.
  • Energy and active-time counter deltas use saturating_sub (refresh.rs:234, 238, 259, 263). Backwards-clock, zero-delta, energy-reset, overrun-clamp, and seeding cases all have explicit tests that assert the right behaviour (return 0.0 / return None / clamp to 100).
  • The Metrics Source baseline / augmented strings flip exactly once and only when fresh data arrives.
  • PCI BDF lowercase formatting ({:04x}:{:02x}:{:02x}.{:x}) is locked in by pci_bdf_format_matches_sysfs and matches the /sys/class/drm/cardN/device symlink-target basename layout.
  • Per-card Mutex<LevelZeroState> provides correct serialisation; lock order (per-card → global LZ_RUNTIME) is consistent across Linux and Windows.
  • Library-not-found / runtime-absent paths return None cleanly with debug! logs only — verified by try_load_library_returns_none_for_nonexistent_path, enumerated_pci_bdfs_empty_when_runtime_absent, refresh_returns_none_without_runtime.
  • Default-feature binary contains zero L0 references (nm -D | grep -ciE 'zes_|ze_loader|level_zero' = 0).

Recommendation

Three changes recommended before merge:

  • HIGH: cap the driver-supplied count in all four vec![ptr; count as usize] sites.
  • MEDIUM: fix the ze_bool_t type (u8, not u32) for all four boolean struct fields.
  • MEDIUM: correct enum values 9-12 (and add RENDER_ALL=12, MEDIA_CODEC_SINGLE=14) so the lock-in test is honest.

The LOW items (set_var SAFETY comment, Windows state-map growth) can ship as-is for v1 with a note for a follow-up.

inureyes added 2 commits May 27, 2026 13:44
…ngine-group constants

Three Sysman-layer fixes surfaced during PR security review.

1. DoS guard around driver-reported handle counts. Every count-then-buffer call site in the L0 loader and refresh paths previously did `vec![ptr; count as usize]` against a raw `u32` returned by the driver. A buggy or hostile driver returning `u32::MAX` would have triggered a ~32 GiB allocation. Added `MAX_L0_HANDLES = 256` (mirroring `MAX_DEVICES`) plus a shared `cap_handle_count` helper that clamps the count and emits a one-shot tracing warning on overflow. Applied at all four call sites (drivers + devices in `loader::enumerate_devices`, engine groups + power domains in `refresh::populate_*`). Each capped enumeration also truncates the Vec to the actual driver-written prefix.

2. `ze_bool_t` ABI mismatch in two FFI structs. Per the upstream header `typedef uint8_t ze_bool_t;`. The PR declared `zes_pci_properties_t::{have_bandwidth_counters,have_packet_counters,have_replay_counters}` and `zes_engine_properties_t::on_subdevice` as `u32`, inflating `zes_pci_properties_t` from the spec-correct 56 bytes to 64 bytes. Changed all four fields to `u8`. Added six `#[cfg(target_pointer_width = "64")]` size-assertion tests that lock the layouts to the C spec sizes (16/16/56/32/16/16). Verified the C struct sizes by compiling a faithful replica with the system C compiler.

3. Engine-group enum values 9..=14 corrected to match the spec. Cross-checked against the upstream `zes_api.h` `_zes_engine_group_t` definition. Fixed values: `MEDIA_ENHANCEMENT_SINGLE=9`, `3D_SINGLE=10` (was 11), `3D_RENDER_COMPUTE_ALL=11` (renamed from `RENDER_COMPUTE_ALL`, was 9), `RENDER_ALL=12` (new), `3D_ALL=13` (was 10), `MEDIA_CODEC_SINGLE=14` (new). Updated the lock-in test to assert the corrected values and added the two previously missing constants. Runtime classification logic (`is_tracked_engine` matches 4..=8) is unchanged because those values were already correct.

Verification:
- cargo fmt --check
- cargo clippy --lib --tests --features level_zero -- -D warnings
- cargo clippy --features level_zero -- -D warnings
- cargo test --lib device::readers::intel_gpu_level_zero --features level_zero (36 passed, was 30 before the new size-assertion tests)
… feature note

Add four unit tests that exercise cap_handle_count directly: under-cap pass-through, over-cap clamp (u32::MAX), exact boundary (MAX_L0_HANDLES), and one-over boundary. These cover the DoS guard path that was only exercised indirectly through the enumerate_devices call chain, which requires a real L0 runtime.

Also add a brief note to the Intel Arc entry in docs/man/all-smi.1 describing the opt-in --features level_zero build flag, the runtime library names, and the graceful degradation contract. The README and ARCHITECTURE.md already carried this information; the manpage had none.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Tests

Added 4 unit tests for the cap_handle_count DoS guard in src/device/readers/intel_gpu_level_zero/tests.rs:

  • cap_handle_count_passes_through_when_under_cap — normal hardware count (6) passes through unchanged
  • cap_handle_count_clamps_when_over_cap — u32::MAX is clamped to MAX_L0_HANDLES (256)
  • cap_handle_count_at_exact_boundary — exactly MAX_L0_HANDLES passes through, is not clamped
  • cap_handle_count_one_over_boundary_is_clamped — MAX_L0_HANDLES + 1 is clamped

The previous test suite exercised this path only through enumerate_devices, which requires a real L0 runtime. These tests are runtime-free and run in CI.

Total unit tests in the L0 backend suite: 40 (was 36).

Documentation

  • Manpage (docs/man/all-smi.1): Added a note to the Intel Arc SUPPORTED PLATFORMS entry describing the opt-in --features level_zero build flag, the runtime library names (libze_loader.so.1 / ze_loader.dll), and the graceful degradation contract. README.md and ARCHITECTURE.md already carried this information; the manpage had none.
  • README.md, ARCHITECTURE.md, Cargo.toml: Verified accurate — no changes needed.

Lint / Format

cargo fmt --all, cargo clippy --lib --tests -- -D warnings, cargo clippy --lib --tests --features level_zero -- -D warnings, and the narrow bin-target passes all pass clean. No drift found.

Verification

  • All tests pass without --features level_zero (no regression)
  • All tests pass with --features level_zero (40 L0 tests, all OK)

Commit

7f3bd83 — chore(intel-gpu-level-zero): add MAX_L0_HANDLES cap tests and manpage feature note

@inureyes inureyes added status:done Completed and removed status:review Under review labels May 27, 2026
@inureyes inureyes merged commit 1a1f6da into main May 27, 2026
6 of 7 checks passed
@inureyes inureyes deleted the feat/issue-248-intel-level-zero-backend branch May 27, 2026 05:02
inureyes added a commit that referenced this pull request May 27, 2026
Move legacy Level Zero Sysman env setup to CLI startup, prefer zesInit when exported, and bound Intel fdinfo reads to avoid unbounded allocations during process scans.

Also share the test-only environment lock across config tests so cargo test remains stable under parallel execution.

Refs #244, #246, #247, #248, #251, #252.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(intel-gpu): Level Zero (oneAPI) integration for advanced metrics on Linux and Windows

1 participant