fix: harden Intel Level Zero and fdinfo paths#253
Merged
Conversation
Move legacy Level Zero Sysman env setup to CLI startup, prefer zesInit when exported, and bound Intel fdinfo reads to avoid unbounded allocations during process scans. Also share the test-only environment lock across config tests so cargo test remains stable under parallel execution. Refs #244, #246, #247, #248, #251, #252.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the follow-up hardening PR after reviewing the merged Intel client GPU work from issues #244, #246, #247, and #248, implemented through PRs #245, #249, #250, and #251.
The merged implementation is broadly aligned with the intended direction:
level_zerofeature and dynamically loads the Intel loader at runtime.This PR fixes the remaining hardening gaps found during the post-merge audit.
Changes
Intel fdinfo process scan hardening
std::fs::read_to_string()on/proc/<pid>/fdinfo/*with a bounded helper.Level Zero Sysman initialization hardening
zesInitfrom the dynamically loaded Level Zero loader.zesInitfor modern runtimes instead of relying only onZES_ENABLE_SYSMAN.ZES_ENABLE_SYSMAN=1mutation out of lazy runtime initialization and into CLI startup, before Tokio runtime creation or signal/background task spawning.unsafeenvironment mutation behind an explicit startup-only function with a documented safety contract.zesInit, the Level Zero backend now degrades cleanly instead of mutating the process environment after threads may exist.Test stability hardening
common::configandcommon::config_filetests.cargo testrace where parallel tests mutated overlappingALL_SMI_ENERGY_*variables with separate mutexes.Build/default-feature behavior verified
The default build still does not include the Level Zero backend. I rebuilt
all-smiwithout--features level_zeroand verified the binary contains no strings matching:libze_loaderze_loaderZES_ENABLE_SYSMANzesInitzesDeviceLevel ZeroSo the current behavior remains:
libamdgpu_topdependency.--features level_zero.Follow-up design issue
A larger Sysman-first metric-source redesign is intentionally not implemented here because it changes metric precedence and expands the Level Zero FFI surface. I filed that as #252 with detailed implementation guidance.
The intended long-term policy is:
Validation
Passed locally:
cargo fmt --all --checkgit diff --checkcargo check --lib --testscargo check --lib --tests --features level_zerocargo clippy --lib --tests -- -D warningscargo clippy --lib --tests --features level_zero -- -D warningscargo clippy -- -D warningscargo clippy --features level_zero -- -D warningscargo test --lib common::config -- --nocapturecargo test --lib common::config_file -- --nocapturecargo test --lib device::readers::intel_gpu_fdinfocargo test --lib device::readers::intel_gpu_level_zero --features level_zerocargo build --bin all-smistrings target/debug/all-smi | rg -i "libze_loader|ze_loader|ZES_ENABLE_SYSMAN|zesInit|zesDevice|Level Zero"returned no matchescargo test --verboseNote: the local commit hook attempted an additional
furiosafeature build and failed before commit completion logs with a host toolchain/sysroot issue fromfuriosa-smi-rsbindgen (stdarg.hnot found). That path is unrelated to this Intel hardening PR; the commit was still created, and the default plus Level Zero validation above passed.Refs #244.
Refs #246.
Refs #247.
Refs #248.
Refs #251.
Refs #252.