feat(intel-gpu): add opt-in Level Zero backend for advanced metrics#251
Conversation
Lay the opt-in `level_zero` Cargo feature, the cross-platform Level Zero (oneAPI) FFI shim, and the per-card state / readout types that subsequent commits wire into the Linux sysfs reader and the Windows WMI reader. Default build is unchanged — `cargo build` produces a binary with zero Level Zero references. The module is split across four files to stay well under the 500-line per-file budget: `intel_gpu_level_zero.rs` (public API: `LevelZeroState`, `LevelZeroReadout`, `refresh`, `apply_to_gpu_info`), `intel_gpu_level_zero/ffi.rs` (hand-written `#[repr(C)]` typedefs and enum constants — no vendored headers, no bindgen), `intel_gpu_level_zero/loader.rs` (`libloading`-based dynamic load of `libze_loader.so.1` / `ze_loader.dll`, one-shot `ZES_ENABLE_SYSMAN=1` injection, driver / device enumeration keyed by PCI BDF), and `intel_gpu_level_zero/refresh.rs` (per-engine and per-power-domain delta tracking). The v1 surface is intentionally narrow: per-engine activity (RENDER_SINGLE, COMPUTE_SINGLE — the XMX class — COPY_SINGLE, MEDIA_DECODE_SINGLE, MEDIA_ENCODE_SINGLE) plus power derived from `zesPowerGetEnergyCounter` deltas. Temperature, frequency, memory state, RAS, per-process L0 stats, and fine-grained power-limit control are explicitly deferred to follow-up issues so the PR stays reviewable. The 28-test unit suite covers the spec-locked enum values, BDF formatting and round-tripping, engine-busy delta math (seeding, percentage, overrun clamp, backwards-clock guard, zero-delta guard), energy-counter delta math (with the correct (µJ/µs) = W conversion — no spurious 1e6 scaling), and the `GpuInfo` integration semantics on both platforms (Linux must NOT overwrite `utilization`; Windows MUST overwrite the WMI zeros). One test deliberately calls `try_load_library` against a bogus path to verify the graceful-degrade path the runtime relies on for hosts without the L0 loader.
Per-card `IntelGpuCard` now holds a `Mutex<LevelZeroState>` field, gated behind `#[cfg(feature = "level_zero")]` so the struct shape is byte-identical to today's on the default build. The field is constructed empty in `discover_cards` — the first `get_gpu_info()` call lazily binds the card to an L0 device handle via the cached `LzRuntime`, looking up by canonical-formatted PCI BDF (the same string sysfs exposes via `/sys/class/drm/cardN/device`). After the existing sysfs path emits the baseline `GpuInfo`, the L0 augmentation refreshes the per-card state, applies the readout in place, and — when L0 actually produced data — upgrades `detail["Metrics Source"]` from the new baseline `"sysfs (engine counters)"` to `"sysfs + Level Zero"`. The augmentation never overwrites `GpuInfo.utilization` on Linux: PR #249's sysfs engine counters remain authoritative for the headline percentage, and L0 only contributes additional `detail` entries (the XMX `COMPUTE_SINGLE` activity that sysfs cannot reach plus the energy-counter-derived `Power (L0)` reading). The Linux test suite gains one assertion locking in the baseline `Metrics Source` so a regression that drops the marker (or that flips it without an L0 hardware verification) trips CI. The "sysfs + Level Zero" upgrade path requires a host with the Intel L0 runtime AND a supported GPU and is left for maintainer hardware verification per the issue ACs.
…lper Split the Level Zero augmentation glue out of `intel_gpu_linux.rs` into a sibling `intel_gpu_linux/level_zero_glue.rs` so the per-OS reader file stays under the 500-line per-file budget once the L0 integration is wired in. The augmentation logic itself is unchanged — it still runs the L0 refresh against the just-pushed `GpuInfo` and is a noop on hosts without the runtime or for cards L0 cannot bind. Move the baseline `Metrics Source = "sysfs (engine counters)"` insert into `ensure_static_info` (it is the same string for every call so caching it with the rest of the static identity costs nothing). Drop the explicit dynamic re-insert from `get_gpu_info`, which keeps the hot path one allocation lighter while preserving identical behaviour. Add `enumerated_pci_bdfs()` to the Level Zero module so the Windows reader (commit follows) can pair its WMI controllers with L0 device handles by ordinal position when no shared per-card identifier is parseable from PNP IDs.
`IntelWindowsGpuReader` gains a per-PNP-id `Mutex<HashMap<String, LevelZeroState>>` field gated behind `#[cfg(feature = "level_zero")]` so state persists across `get_gpu_info` calls (each call re-queries WMI, but the L0 energy-counter baseline must survive between calls or the delta-derived power reading is meaningless). After the WMI baseline emits the list of GPUs, `augment_with_level_zero` walks the WMI controllers in parallel with the sorted list of L0 PCI BDFs (`enumerated_pci_bdfs`). On the typical single-Intel-GPU Windows host this is a perfect 1:1 match; for multi-GPU hosts (rare on Windows) the prefix pairs and the unpaired suffix keeps the WMI-only baseline. `Win32_VideoController.PNPDeviceID` does not expose the BDF in a stable, parseable form across driver versions, so we explicitly choose ordinal matching rather than guessing wrong on a heuristic — a follow-up issue can introduce `Win32_PnPEntity.LocationInformation` parsing if multi-GPU Windows hosts ever become common. The baseline now records `detail["Metrics Source"] = "WMI"` (in addition to the legacy `Note`) so the augmentation can flip it to `"WMI + Level Zero"` consistently with the Linux path. When L0 produces a readout, it also overwrites the placeholder zeros WMI emits for `GpuInfo.utilization` (max of render / XMX compute) and `GpuInfo.power_consumption`, finally giving the Windows reader the real telemetry NVIDIA users already get via NVML.
…ime tests ARCHITECTURE.md and README.md both pick up a paragraph describing the opt-in `--features level_zero` augmentation: what it covers (engine activity including XMX, energy-counter-derived power), how it interacts with the sysfs / WMI baseline (Linux augments, Windows overwrites the zeros), how `detail["Metrics Source"]` records the active backend, and the deferred surface that is explicitly out of scope (temperature, frequency, memory state, per-process L0, RAS, performance factor, power limits). `intel_gpu_windows.rs`'s module-level doc gains a new "WMI-only baseline limitations" section that points at the augmentation rather than asserting the metrics are unreachable. The legacy `detail["Note"]` is kept for downstream-consumer compatibility. Two graceful-degradation tests round out the suite: `enumerated_pci_bdfs_empty_when_runtime_absent` verifies the BDF helper returns a `Vec<String>` rather than panicking when no L0 loader is present (the case on every CI host), and `refresh_returns_none_without_runtime` exercises the same invariant for the per-card `refresh` path. Both tests pass on a host with the loader present too (the post-bind state simply reports `had_any_data = false` for an unknown BDF).
CI Test Suite failed on `cargo fmt --check`. Apply rustfmt to all seven files touched by the Level Zero backend so the formatting check is green. Pure mechanical formatting — no semantic changes, no test rewrites, all 30 L0 tests + 19 Linux tests still pass under both default and `--features level_zero` builds.
…spec The hand-written FFI surface previously declared `ZES_STRUCTURE_TYPE_PCI_PROPERTIES = 0x1` and `ZES_STRUCTURE_TYPE_ENGINE_PROPERTIES = 0xa`. Per the official `zes_api.h` (https://github.com/oneapi-src/level-zero/blob/master/include/zes_api.h, `typedef enum _zes_structure_type_t`) the correct values are `PCI_PROPERTIES = 0x2` and `ENGINE_PROPERTIES = 0x5`. The value `0x1` is actually `ZES_STRUCTURE_TYPE_DEVICE_PROPERTIES` and `0xa` is `ZES_STRUCTURE_TYPE_LED_PROPERTIES`, so the `stype` field every refresh wrote into `zes_pci_properties_t` / `zes_engine_properties_t` was labelled as the wrong struct family. In practice Intel's current oneAPI loader does not strictly validate `stype` on the top-level struct (it only chains `pNext` extension structs by `stype`), so the bug did not surface as a runtime failure on the developer host. But strict-validation drivers, the optional `ZE_ENABLE_VALIDATION_LAYER=1` path, and any future spec-compliant L0 implementation would reject these calls. The `structure_type_constants_match_spec` test that was supposed to lock the values to the spec instead locked them to the wrong values. Bump both constants to the spec-correct values and update the unit test accordingly.
Implementation Review SummaryIntentAdd an opt-in Intel Level Zero backend behind a default-off Findings Addressed
Verified Correct (no fix needed)
Remaining Items
Verification
VerdictReady for security/perf review pending CI green on the latest push. The one CRITICAL finding (wrong Sysman |
PR Security & Performance ReviewRe-review after the two earlier auto-fixes (commits The remaining findings are all in the new Level Zero FFI surface, in descending severity: HIGH — Unbounded driver-reported handle counts in vec allocation
MEDIUM —
|
| Value | Spec name | PR name |
|---|---|---|
| 9 | MEDIA_ENHANCEMENT_SINGLE |
RENDER_COMPUTE_ALL ← wrong |
| 10 | 3D_SINGLE (deprecated) |
3D_ALL ← wrong |
| 11 | 3D_RENDER_COMPUTE_ALL (deprecated) |
3D_SINGLE ← wrong |
| 12 | RENDER_ALL |
MEDIA_ENHANCEMENT_SINGLE ← wrong |
| 13 | 3D_ALL (deprecated) |
not declared |
| 14 | MEDIA_CODEC_SINGLE |
not declared |
is_tracked_engine only matches values 4-8 (COMPUTE_SINGLE, RENDER_SINGLE, MEDIA_DECODE_SINGLE, MEDIA_ENCODE_SINGLE, COPY_SINGLE), so the runtime classification of tracked engines is correct on real Arc / Battlemage hardware. The bug is twofold:
- The lock-in test
engine_group_enum_values_match_specasserts these wrong values "match the spec" — the safety net the PR description praises is fictional from value 9 onwards. A future spec drift in the 4-8 range would still be caught, but the test reads as if the entire enum is locked. - Future maintainers adding tracked engines (e.g. wiring
RENDER_ALLto capture the unified render-engine aggregate the spec recommends for Arc) would writeffi::ZES_ENGINE_GROUP_RENDER_ALLexpecting value 12, but the constant resolves toMEDIA_ENHANCEMENT_SINGLE(9 in spec). Silent misclassification at the source.
Fix: renumber 9-12 to the spec values, add RENDER_ALL = 12 and MEDIA_CODEC_SINGLE = 14, drop or rename the deprecated 3D_SINGLE = 10 / 3D_RENDER_COMPUTE_ALL = 11 slots, update the test to assert against the spec, and update the doc comment that references the spec URL.
LOW — std::env::set_var SAFETY contract violation
src/device/readers/intel_gpu_level_zero/loader.rs:206-219. The SAFETY comment claims "no L0 thread has begun yet (every L0 caller goes through ensure_runtime, which is what we are inside of)" — but the Rust 2024 unsafety of set_var is about any thread reading the environment, not specifically L0 threads. IntelGpuReader::new() is called inside run_collection_loop after the tokio runtime is up and after the axum server (and tracing, and reqwest, and ssh transport) threads are spawned. set_var therefore runs while many threads are alive and may be calling getenv indirectly via glibc internals (locale, time zone, libssl, etc.). This is the exact scenario Rust 2024 marked unsafe.
The window is small (one set_var call early in the first collection iteration) and ZES_ENABLE_SYSMAN itself is read only by the L0 loader, so the real-world chance of UB manifesting is very low. But the SAFETY comment is misleading and the contract is violated.
Either (a) move the env-var injection to a hook called by main before the tokio runtime spawns workers (cleanest, matches the Rust 2024 guidance), or (b) update the SAFETY comment to accurately describe the race window and document why the team accepts it for v1. I would prefer (a) but (b) is acceptable if the orchestration cost is too high.
LOW — level_zero_state map grows monotonically on Windows
src/device/readers/intel_gpu_windows.rs:286-300 inserts into level_zero_state: Mutex<HashMap<String, LevelZeroState>> keyed by gpu.uuid (derived from WMI PNPDeviceID) and never removes. In practice this is bounded by the small set of Intel GPUs in the machine, but on a system where PNPDeviceID is unstable across reboots (rare but possible after driver reinstall) the map would accumulate dead entries. Not worth fixing in v1; just worth noting in the module doc.
Items that are fine
- FFI memory safety. Function pointers are extracted by
*ze_init(deref to copy out aCopyfnpointer) and the underlyingLibraryis held inLzRuntimeinside aOnceCellfor process lifetime — no dangling symbols. Allunsafe { (api.fn)(...) }call sites have driver-side null-buffer / count-only semantics and check the return code before reading the out-parameter. Opaque handles (zes_device_handle_t,zes_engine_handle_t,zes_pwr_handle_t) are never freed, neverBox::from_raw'd — they live for runtime lifetime, matching the spec's process-scoped opacity. - Library handle leak via
OnceCellmatches thetpu_pjrtpattern; noDropimpl onLzRuntimeaccidentally unloads. - Library search paths are static
conststrings — no user-controlled input reachesLibrary::new. - Energy and active-time counter deltas use
saturating_sub(refresh.rs:234, 238, 259, 263). Backwards-clock, zero-delta, energy-reset, overrun-clamp, and seeding cases all have explicit tests that assert the right behaviour (return 0.0 / return None / clamp to 100). - The
Metrics Sourcebaseline / augmented strings flip exactly once and only when fresh data arrives. - PCI BDF lowercase formatting (
{:04x}:{:02x}:{:02x}.{:x}) is locked in bypci_bdf_format_matches_sysfsand matches the/sys/class/drm/cardN/devicesymlink-target basename layout. - Per-card
Mutex<LevelZeroState>provides correct serialisation; lock order (per-card → globalLZ_RUNTIME) is consistent across Linux and Windows. - Library-not-found / runtime-absent paths return
Nonecleanly withdebug!logs only — verified bytry_load_library_returns_none_for_nonexistent_path,enumerated_pci_bdfs_empty_when_runtime_absent,refresh_returns_none_without_runtime. - Default-feature binary contains zero L0 references (
nm -D | grep -ciE 'zes_|ze_loader|level_zero'= 0).
Recommendation
Three changes recommended before merge:
- HIGH: cap the driver-supplied count in all four
vec![ptr; count as usize]sites. - MEDIUM: fix the
ze_bool_ttype (u8, notu32) for all four boolean struct fields. - MEDIUM: correct enum values 9-12 (and add
RENDER_ALL=12,MEDIA_CODEC_SINGLE=14) so the lock-in test is honest.
The LOW items (set_var SAFETY comment, Windows state-map growth) can ship as-is for v1 with a note for a follow-up.
…ngine-group constants
Three Sysman-layer fixes surfaced during PR security review.
1. DoS guard around driver-reported handle counts. Every count-then-buffer call site in the L0 loader and refresh paths previously did `vec![ptr; count as usize]` against a raw `u32` returned by the driver. A buggy or hostile driver returning `u32::MAX` would have triggered a ~32 GiB allocation. Added `MAX_L0_HANDLES = 256` (mirroring `MAX_DEVICES`) plus a shared `cap_handle_count` helper that clamps the count and emits a one-shot tracing warning on overflow. Applied at all four call sites (drivers + devices in `loader::enumerate_devices`, engine groups + power domains in `refresh::populate_*`). Each capped enumeration also truncates the Vec to the actual driver-written prefix.
2. `ze_bool_t` ABI mismatch in two FFI structs. Per the upstream header `typedef uint8_t ze_bool_t;`. The PR declared `zes_pci_properties_t::{have_bandwidth_counters,have_packet_counters,have_replay_counters}` and `zes_engine_properties_t::on_subdevice` as `u32`, inflating `zes_pci_properties_t` from the spec-correct 56 bytes to 64 bytes. Changed all four fields to `u8`. Added six `#[cfg(target_pointer_width = "64")]` size-assertion tests that lock the layouts to the C spec sizes (16/16/56/32/16/16). Verified the C struct sizes by compiling a faithful replica with the system C compiler.
3. Engine-group enum values 9..=14 corrected to match the spec. Cross-checked against the upstream `zes_api.h` `_zes_engine_group_t` definition. Fixed values: `MEDIA_ENHANCEMENT_SINGLE=9`, `3D_SINGLE=10` (was 11), `3D_RENDER_COMPUTE_ALL=11` (renamed from `RENDER_COMPUTE_ALL`, was 9), `RENDER_ALL=12` (new), `3D_ALL=13` (was 10), `MEDIA_CODEC_SINGLE=14` (new). Updated the lock-in test to assert the corrected values and added the two previously missing constants. Runtime classification logic (`is_tracked_engine` matches 4..=8) is unchanged because those values were already correct.
Verification:
- cargo fmt --check
- cargo clippy --lib --tests --features level_zero -- -D warnings
- cargo clippy --features level_zero -- -D warnings
- cargo test --lib device::readers::intel_gpu_level_zero --features level_zero (36 passed, was 30 before the new size-assertion tests)
… feature note Add four unit tests that exercise cap_handle_count directly: under-cap pass-through, over-cap clamp (u32::MAX), exact boundary (MAX_L0_HANDLES), and one-over boundary. These cover the DoS guard path that was only exercised indirectly through the enumerate_devices call chain, which requires a real L0 runtime. Also add a brief note to the Intel Arc entry in docs/man/all-smi.1 describing the opt-in --features level_zero build flag, the runtime library names, and the graceful degradation contract. The README and ARCHITECTURE.md already carried this information; the manpage had none.
PR Finalization CompleteTestsAdded 4 unit tests for the
The previous test suite exercised this path only through Total unit tests in the L0 backend suite: 40 (was 36). Documentation
Lint / Format
Verification
Commit
|
Move legacy Level Zero Sysman env setup to CLI startup, prefer zesInit when exported, and bound Intel fdinfo reads to avoid unbounded allocations during process scans. Also share the test-only environment lock across config tests so cargo test remains stable under parallel execution. Refs #244, #246, #247, #248, #251, #252.
Summary
Adds an opt-in Intel Level Zero (oneAPI) backend behind a new
level_zeroCargo feature (default OFF). On Linux it augments PR #249's sysfs engine counters with the XMXCOMPUTE_SINGLEengine activity and energy-counter-derived power that sysfs cannot reach; on Windows it fills the WMI gap by overwriting the zeros thatWin32_VideoControllerreports for utilization and power. Hosts without the L0 loader installed (or builds without the feature) behave identically to today — no panic, no log spam, onetracing::debug!line.v1 scope (what this PR ships)
libze_loader.so.1(Linux) /ze_loader.dll(Windows) vialibloading(no new Cargo dependency, nobindgen, no vendoredze_api.h).ZES_ENABLE_SYSMAN=1set once viaOnce::call_oncebefore the firstzeInit(Option A in the design notes — works on every shipping Intel runtime).#[repr(C)]FFI surface: 5 opaque handle typedefs, 4 concrete structs, 9 function-pointer typedefs, 13 enum constants. Spec-locked by unit tests.zes_engine_group_tgroup:RENDER_SINGLE,COMPUTE_SINGLE(the XMX class on Arc / Battlemage),COPY_SINGLE,MEDIA_DECODE_SINGLE,MEDIA_ENCODE_SINGLE. Aggregated_ALLgroups are skipped to avoid double-counting.zesPowerGetEnergyCounterdelta-tracking (µJ / µs = W; the conversion was caught by a unit test that originally asserted the wrong factor).Mutex<LevelZeroState>onIntelGpuCard(Linux,#[cfg(feature = "level_zero")]-gated) and aMutex<HashMap<String, LevelZeroState>>keyed by PNP-id onIntelWindowsGpuReader.DDDD:BB:DD.Fextracted from the sysfs symlink; Windows pairs WMI controllers with L0 devices by ordinal position in the sorted-BDF enumeration, sinceWin32_VideoController.PNPDeviceIDdoes not expose the BDF in a stable, parseable form across driver versions.detail["Metrics Source"]advertises the active backend:"sysfs (engine counters)"→"sysfs + Level Zero"(Linux),"WMI"→"WMI + Level Zero"(Windows).GpuInfointegration on both platforms, library-not-found graceful degradation, and runtime-absent behaviour.Deferred (not in this PR — explicit follow-up surface)
Every item below is intentionally out of scope; each is reasonable as its own follow-up issue:
zesTemperatureGet*)zesFrequencyGet*)zesMemoryGet*)zesDeviceProcessesGetState)zesPowerGetLimits*,zesPowerSetLimits*)zesInitinitialisation path (Option B) for newer Intel runtimesWin32_PnPEntity.LocationInformationfor stronger Windows matchingFiles touched
New:
src/device/readers/intel_gpu_level_zero.rs— public API (LevelZeroState,LevelZeroReadout,refresh,apply_to_gpu_info,ApplyPlatform, label / engine-group helpers, diagnostics).src/device/readers/intel_gpu_level_zero/ffi.rs— hand-written#[repr(C)]types, enum constants, function-pointer typedefs.src/device/readers/intel_gpu_level_zero/loader.rs—libloadingdynamic resolution,ZES_ENABLE_SYSMAN=1injection, driver / device enumeration keyed by PCI BDF,with_runtimeconvenience.src/device/readers/intel_gpu_level_zero/refresh.rs— per-engine and per-power-domain sample types and delta-tracking math.src/device/readers/intel_gpu_level_zero/tests.rs— 30 unit tests.src/device/readers/intel_gpu_linux/level_zero_glue.rs— Linux-side augmentation helper (extracted to keepintel_gpu_linux.rsunder 500 lines).Modified:
Cargo.toml— new opt-inlevel_zerofeature, no new dependencies.src/device/readers/mod.rs— module registration gated oncfg(any(target_os = "linux", target_os = "windows"))ANDfeature = "level_zero".src/device/readers/intel_gpu_linux.rs— one new#[cfg(feature = "level_zero")]-gatedMutex<LevelZeroState>field onIntelGpuCard, baselineMetrics Sourceinensure_static_info, L0 augmentation call after the baselineGpuInfois pushed.src/device/readers/intel_gpu_linux/tests.rs— asserts the baselineMetrics Sourceis"sysfs (engine counters)"on the default build.src/device/readers/intel_gpu_windows.rs— per-PNP-id state map onIntelWindowsGpuReader, baselineMetrics Source = "WMI", ordinal-paired L0 augmentation after the WMI query returns.README.mdanddocs/ARCHITECTURE.md— paragraph describing the augmentation, when it activates, and the deferred surface.Default build is unchanged
cargo build(no feature) produces a binary with zero Level Zero references:The
level_zeromodule is only compiled when the feature is on AND the target is Linux or Windows.Test plan
cargo check --lib --testscargo check --lib --tests --features level_zerocargo clippy --lib --tests -- -D warningscargo clippy --lib --tests --features level_zero -- -D warningscargo clippy -- -D warningscargo clippy --features level_zero -- -D warningscargo test --lib device::readers::intel_gpu_level_zero --features level_zero(30/30 pass)cargo test --lib device::readers::intel_gpu_linux(19/19 pass on default)cargo test --lib device::readers::intel_gpu_linux --features level_zero(19/19 pass)cargo test --lib device::readers::intel_gpu_engine(20/20 pass, both modes)cargo test --lib device::readers::intel_gpu_fdinfo(23/23 pass)nm -D).Hardware-dependent ACs (XMX activity visible on Arc / Battlemage / Lunar Lake / Meteor Lake; Windows utilization / power non-zero with the L0 runtime present; Linux sysfs vs. L0 agreement) are left unchecked in the issue body and await maintainer hardware verification.
Closes #248