Skip to content

fix: harden Intel Level Zero and fdinfo paths#253

Merged
inureyes merged 1 commit into
mainfrom
fix/issue-248-intel-postmerge-hardening
May 27, 2026
Merged

fix: harden Intel Level Zero and fdinfo paths#253
inureyes merged 1 commit into
mainfrom
fix/issue-248-intel-postmerge-hardening

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

This is the follow-up hardening PR after reviewing the merged Intel client GPU work from issues #244, #246, #247, and #248, implemented through PRs #245, #249, #250, and #251.

The merged implementation is broadly aligned with the intended direction:

  • Intel Linux/Windows baseline readers are present in the default build.
  • Linux Intel utilization and process memory use sysfs / DRM fdinfo fallbacks without requiring Level Zero.
  • Level Zero remains opt-in behind the level_zero feature and dynamically loads the Intel loader at runtime.
  • The merged Level Zero code already fixed the earlier high-risk FFI issues around driver-reported count caps and Sysman struct layout/constants.

This PR fixes the remaining hardening gaps found during the post-merge audit.

Changes

Intel fdinfo process scan hardening

  • Replaces unbounded std::fs::read_to_string() on /proc/<pid>/fdinfo/* with a bounded helper.
  • Caps each fdinfo read at 64 KiB, which is far above real DRM fdinfo sizes but prevents pathological procfs or synthetic test inputs from forcing unbounded allocation during process scans.
  • Adds unit coverage for normal and oversized fdinfo reads.

Level Zero Sysman initialization hardening

  • Resolves optional zesInit from the dynamically loaded Level Zero loader.
  • Uses zesInit for modern runtimes instead of relying only on ZES_ENABLE_SYSMAN.
  • Moves the legacy ZES_ENABLE_SYSMAN=1 mutation out of lazy runtime initialization and into CLI startup, before Tokio runtime creation or signal/background task spawning.
  • Keeps the unsafe environment mutation behind an explicit startup-only function with a documented safety contract.
  • If a library caller does not use the CLI startup path and the loader lacks zesInit, the Level Zero backend now degrades cleanly instead of mutating the process environment after threads may exist.

Test stability hardening

  • Shares one test-only environment lock between common::config and common::config_file tests.
  • This fixes a real cargo test race where parallel tests mutated overlapping ALL_SMI_ENERGY_* variables with separate mutexes.

Build/default-feature behavior verified

The default build still does not include the Level Zero backend. I rebuilt all-smi without --features level_zero and verified the binary contains no strings matching:

  • libze_loader
  • ze_loader
  • ZES_ENABLE_SYSMAN
  • zesInit
  • zesDevice
  • Level Zero

So the current behavior remains:

  • AMD Linux glibc support is included by default through the target-specific libamdgpu_top dependency.
  • Intel baseline Linux/Windows readers are included by default.
  • Intel Level Zero Sysman advanced metrics remain opt-in via --features level_zero.

Follow-up design issue

A larger Sysman-first metric-source redesign is intentionally not implemented here because it changes metric precedence and expands the Level Zero FFI surface. I filed that as #252 with detailed implementation guidance.

The intended long-term policy is:

  • temperature: Sysman -> hwmon -> unavailable
  • power: Sysman energy counter delta -> hwmon power -> unavailable
  • memory: Sysman memory state -> DRM/sysfs -> unavailable
  • engine util: Sysman engine activity delta -> DRM/sysfs/perf fallback -> unavailable
  • fan: hwmon -> Sysman fan if available -> unavailable

Validation

Passed locally:

  • cargo fmt --all --check
  • git diff --check
  • cargo check --lib --tests
  • cargo check --lib --tests --features level_zero
  • cargo clippy --lib --tests -- -D warnings
  • cargo clippy --lib --tests --features level_zero -- -D warnings
  • cargo clippy -- -D warnings
  • cargo clippy --features level_zero -- -D warnings
  • cargo test --lib common::config -- --nocapture
  • cargo test --lib common::config_file -- --nocapture
  • cargo test --lib device::readers::intel_gpu_fdinfo
  • cargo test --lib device::readers::intel_gpu_level_zero --features level_zero
  • cargo build --bin all-smi
  • default binary Level Zero string audit with strings target/debug/all-smi | rg -i "libze_loader|ze_loader|ZES_ENABLE_SYSMAN|zesInit|zesDevice|Level Zero" returned no matches
  • cargo test --verbose

Note: the local commit hook attempted an additional furiosa feature build and failed before commit completion logs with a host toolchain/sysroot issue from furiosa-smi-rs bindgen (stdarg.h not found). That path is unrelated to this Intel hardening PR; the commit was still created, and the default plus Level Zero validation above passed.

Refs #244.
Refs #246.
Refs #247.
Refs #248.
Refs #251.
Refs #252.

Move legacy Level Zero Sysman env setup to CLI startup, prefer zesInit when exported, and bound Intel fdinfo reads to avoid unbounded allocations during process scans.

Also share the test-only environment lock across config tests so cargo test remains stable under parallel execution.

Refs #244, #246, #247, #248, #251, #252.
@inureyes inureyes added type:bug Something isn't working status:review Under review priority:medium Medium priority issue labels May 27, 2026
@inureyes inureyes merged commit f2fff7e into main May 27, 2026
4 checks passed
@inureyes inureyes deleted the fix/issue-248-intel-postmerge-hardening branch May 27, 2026 07:24
@inureyes inureyes added status:done Completed and removed status:review Under review labels May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant