feat(intel-gpu): per-process GPU memory accounting via fdinfo by inureyes · Pull Request #250 · lablup/all-smi

inureyes · 2026-05-27T03:13:53Z

Summary

Implements per-process GPU memory accounting for the Intel client GPU reader on Linux by parsing DRM client /proc/<pid>/fdinfo/<fd> blocks for both the i915 and xe kernel drivers. Replaces the Vec::new() stub in IntelGpuReader::get_process_info() so any TUI tab, dashboard, or scraper that consumes ProcessInfo (e.g. the existing per-process columns in the all-smi view TUI) now sees Intel-GPU-using processes alongside AMD/NVIDIA ones.

What changed

New module src/device/readers/intel_gpu_fdinfo.rs (and intel_gpu_fdinfo/enrichment.rs, intel_gpu_fdinfo/tests.rs):

parse_fdinfo — pure-Rust parser for the i915 and xe fdinfo schemas. Sums drm-resident-{local0,system} (i915) or drm-resident-{vram0,gtt} (xe) into a single resident_bytes total (correct on integrated GPUs too, where only the system/GTT key is present). Rejects foreign drivers (amdgpu / nvidia / nouveau) so cross-vendor hosts cannot leak the wrong reader's processes. Tolerates truncated content during process teardown without panicking.
build_intel_drm_basenames — walks /sys/class/drm/ once and resolves both cardN and renderD<M> minors that point at each reader-enumerated Intel PCI bus. Both nodes share a card index so modern Vulkan / oneAPI / ffmpeg workloads that open the render node (no master/setmaster permission flow) are captured.
intel_drm_fds_for_pid — reads /proc/<pid>/fd/ and returns the fds pointing at known Intel DRM nodes. Permission errors (EACCES for fds owned by another user) degrade silently per-process.
collect_intel_gpu_processes — top-level aggregator. Walks /proc, dedupes fds by drm-client-id per (pid, card_index) (avoiding N× over-counting from dup(2)d fds), but sums across distinct clients (multi-context workloads). Bounded by a MAX_GPU_PROCESSES = 4096 cap. Returns deterministic, PID-sorted output.
build_intel_process_infos (in enrichment.rs) — runs one minimal sysinfo refresh_processes_specifics (cpu + memory + user) and merges sysinfo metadata into ProcessInfo rows. Pattern matches src/device/readers/amd.rs exactly so cross-vendor consumers see a consistent shape.

src/device/readers/intel_gpu_linux.rs:

IntelGpuReader gains an intel_drm_basenames: HashMap<String, usize> cached at construction time and a proc_root: PathBuf field for test injection.
IntelGpuCard struct shape is unchanged — no new Mutex<...> fields — so the upcoming Level Zero work in feat(intel-gpu): Level Zero (oneAPI) integration for advanced metrics on Linux and Windows #248 has no rebase friction.
get_process_info() body is ~10 lines: builds the (card_index, uuid) table fresh each call (so ProcessInfo.device_uuid always matches the contemporaneous GpuInfo.uuid) and delegates to build_intel_process_infos.
Detection helpers (has_intel_client_gpu_from_root, line_matches_intel_gpu) moved into a new sibling intel_gpu_linux/detection.rs to stay under the 500-line per-file budget. The public has_intel_client_gpu() is a 4-line wrapper.

Path-injection coverage: every public helper in intel_gpu_fdinfo accepts an explicit proc_root / drm_root, and IntelGpuReader has a private new_with_roots(drm_root, proc_root) constructor. The full pipeline is tested under tempfile::tempdir fixtures without touching the real /proc or /sys/class/drm.

Files touched:

src/device/readers/mod.rs (registered new module)
src/device/readers/intel_gpu_linux.rs (integration + detection split)
src/device/readers/intel_gpu_linux/detection.rs (new, extracted)
src/device/readers/intel_gpu_linux/tests.rs (3 new end-to-end tests)
src/device/readers/intel_gpu_fdinfo.rs (new module)
src/device/readers/intel_gpu_fdinfo/enrichment.rs (new, sysinfo merge)
src/device/readers/intel_gpu_fdinfo/tests.rs (23 unit tests)
docs/ARCHITECTURE.md (Intel section updated)

Deferred: per-process engine-time utilization

The issue body lists per-process engine-time utilization as a stretch goal that would reuse PR #249's EngineState machinery with a per-PID delta tracker keyed on drm-client-id. That work is intentionally deferred to a separate follow-up PR. Reasoning: the stretch goal would add a Mutex<ProcessEngineState> field to IntelGpuCard, and the next issue in the queue (#248, Level Zero integration) will also touch the struct shape. Keeping IntelGpuCard unchanged here makes #248's rebase trivial. The deferred work has a clean integration point — the same fdinfo walker can sample drm-engine-* counters next to the drm-resident-* keys it already reads — so the follow-up is mechanical.

Test plan

cargo check --lib --tests clean
cargo clippy --lib --tests -- -D warnings clean
cargo clippy -- -D warnings clean (bin-target path that caught the bin-only unused-import issue in PR feat(intel-gpu): compute Linux utilization from engine-busy counters #249)
cargo test --lib device::readers::intel_gpu_fdinfo — 23 tests pass (i915 + xe parsing, integrated vs discrete, foreign-driver rejection, truncated/malformed input, kB-to-bytes, card+render mapping for one and two cards, AMD render-node rejection, connector-child filtering, client-id dedup, distinct-client summing, multi-card grouping, no-Intel-card short-circuit, missing-client-id tolerance)
cargo test --lib device::readers::intel_gpu_linux — 19 tests pass (16 prior + 3 new: empty-when-no-cards, render-node fdinfo end-to-end, trait default filter compatibility)

Acceptance criteria status

get_process_info() returns non-empty on a host with Intel-GPU-using processes (awaits hardware verification by maintainer)
used_memory matches intel_gpu_top -p within rounding tolerance (awaits hardware verification by maintainer)
Permission errors degrade gracefully — EACCES on /proc/<pid>/fd/ is silently skipped per-process, no panics, no per-process log spam
Both i915 and xe drivers are handled — the parser branches on drm-driver and accepts both schemas; unit tests cover both
Integrated end-to-end — ProcessInfo flows through the existing GpuReader trait without per-vendor special casing (the trait's default get_gpu_processes filter is verified compatible)
(Stretch) Per-process engine-time utilization — deferred to follow-up per the scope decision above

Closes #247

Introduces a new stateless module `intel_gpu_fdinfo` that parses Intel DRM client `/proc/<pid>/fdinfo/<fd>` blocks and correlates fds back to a reader-known card index. Provides: - `parse_fdinfo` — pure-string parser for the i915 and xe schemas. Handles truncated / malformed input without panicking, rejects foreign drivers (amdgpu / nvidia), and normalises memory values from kB to bytes. - `build_intel_drm_basenames` — walks `/sys/class/drm` once to find every `cardN` and `renderD<M>` minor that maps to one of the reader-enumerated Intel PCI devices. Both nodes share a card index so modern Vulkan / oneAPI / ffmpeg workloads opening the render node are captured. - `intel_drm_fds_for_pid` — reads `/proc/<pid>/fd/` and returns the fds pointing at known Intel DRM nodes. Permission errors degrade silently. - `collect_intel_gpu_processes` — top-level aggregator. Walks `/proc`, dedupes fds by `drm-client-id` per process+card (avoiding N× over-counting from `dup(2)`d fds) but sums across distinct clients (multi-context workloads). Returns deterministic, PID-sorted output. The module is path-injection friendly: every public helper accepts an explicit `proc_root` / `drm_root` so the entire walker is testable under `tempfile::tempdir` fixtures without touching the real `/proc` or `/sys`. 23 unit tests cover: i915 + xe parsing, integrated vs discrete schemas, foreign-driver rejection, truncated input, kB-to-bytes conversion, the card/renderD mapping for one and two cards, AMD render-node rejection, connector-child filtering, client-id dedup, distinct-client summing, multi-card grouping, and graceful no-Intel-card short-circuit. Refs #247

Replaces the `Vec::new()` stub in `IntelGpuReader::get_process_info()` with a full implementation built on top of the new `intel_gpu_fdinfo` module. The reader now: - Caches an `intel_drm_basenames` map at construction (one entry per `cardN` and `renderD<M>` known to belong to an Intel PCI device), so the per-process refresh is a flat `/proc` walk with no extra sysfs probing. - Threads a `proc_root` field through the constructor for test injection — production stays `/proc`. - Builds a `card_index -> uuid` table on each call and delegates to `build_intel_process_infos`, which collects `(pid, card_index, used_memory_bytes)` aggregates, performs one minimal sysinfo refresh, and merges sysinfo metadata (cpu_percent, user, state, rss, vms, etc.) into the final `ProcessInfo` rows. Pattern matches AMD's reader exactly so cross-vendor consumers see a consistent shape. To stay under the 500-line per-file budget after the integration, two cohesive subsections moved into siblings: - `intel_gpu_linux/detection.rs` — `has_intel_client_gpu_from_root` and `line_matches_intel_gpu`. The public `has_intel_client_gpu` is now a 4-line wrapper. - `intel_gpu_fdinfo/enrichment.rs` — the sysinfo merge helper. The parent module re-exports `build_intel_process_infos`, so the public API is unchanged. File sizes after refactor (all <500): - `intel_gpu_linux.rs` 474 - `intel_gpu_fdinfo.rs` 483 - `intel_gpu_fdinfo/enrichment.rs` 107 - `intel_gpu_fdinfo/tests.rs` 494 - `intel_gpu_linux/detection.rs` 86 Behaviour on hosts without Intel-GPU-using processes is unchanged: an empty basename map (no Intel GPUs detected) short-circuits to `Vec::new()`, and the fdinfo walker returns empty when no process holds an Intel DRM fd. The stretch goal (per-process engine-time deltas) is intentionally deferred to a follow-up; v1 reports `gpu_utilization = 0.0` per process. Refs #247

Adds three end-to-end tests against `IntelGpuReader::get_process_info()` driven by a synthetic procfs and DRM sysfs tree: - `get_process_info_returns_empty_when_no_intel_cards` — guarantees no regression on AMD-only or NVIDIA-only hosts: the empty basename map short-circuits to `Vec::new()` without touching `/proc`. - `get_process_info_collects_fdinfo_from_render_node` — full pipeline: a synthetic Intel card with a matching `renderD<M>` render node, plus a synthetic `/proc/<pid>/fdinfo/<fd>` containing the i915 schema, must yield exactly one populated `ProcessInfo` with the correct PID, `device_id`, `device_uuid` (`Intel-GPU-<bus>` format matching `get_gpu_info`), and `used_memory` (16384 kB -> 16777216 bytes). - `get_process_info_default_filter_keeps_uses_gpu_processes` — verifies the Intel reader is compatible with the trait's default `get_gpu_processes` filter (every emitted row has `uses_gpu = true`). Also updates `docs/ARCHITECTURE.md`: the Intel client GPU section now lists per-process GPU memory accounting alongside engine-busy utilization, including the i915 / xe key sets parsed and the `drm-client-id` dedup behaviour. Refs #247

inureyes · 2026-05-27T03:20:18Z

Implementation Review Summary

Intent

Fill in IntelGpuReader::get_process_info() (previously Vec::new()) with real per-process GPU memory accounting from /proc/<pid>/fdinfo/<fd> for both i915 and xe drivers. Stretch goal (per-process engine-time) intentionally deferred to keep IntelGpuCard struct shape stable for #248's rebase.

Findings Addressed

None — the implementation is correct and complete as submitted. No fixes required.

Verification

All stated requirements implemented (memory accounting via fdinfo for i915 + xe, render+card node mapping, drm-client-id dedup, EACCES silent skip, kB→bytes conversion, sysinfo enrichment mirroring AMD pattern)
No placeholder/mock code remaining (the Vec::new() stub is fully replaced; gpu_utilization: 0.0 is a documented stretch-goal deferral, not a stub)
Integrated into project code flow (IntelGpuReader::new() already registered in reader_factory.rs; get_process_info() flows through the GpuReader trait without per-vendor specialization)
Project conventions followed (conventional-commits subjects all ≤72 chars; English-only; no AI attribution in commits, PR body, or code; no unwrap()/expect() in library code; let-else and Rust 1.58+ inline format args used)
Existing modules reused (with_global_system, crate::device::process_list::get_all_processes, ProcessRefreshKind::nothing().with_cpu().with_memory().with_user(UpdateKind::OnlyIfNotSet) — exact mirror of the AMD reader pattern at src/device/readers/amd.rs:599-608)
No unintended structural changes (IntelGpuCard struct shape is unchanged per the explicit orchestrator instruction — diff confirms only IntelGpuReader gained intel_drm_basenames and proc_root fields, and the 25-line reduction in intel_gpu_linux.rs is pure code movement to detection.rs)
Tests pass (cargo check --lib --tests, cargo clippy --lib --tests -- -D warnings, cargo clippy -- -D warnings, 23 fdinfo tests, 19 linux tests — all green from a clean build)

Critical correctness checks (ultrathink)

cardN ↔ renderD<M> mapping: build_intel_drm_basenames correctly registers BOTH nodes by walking /sys/class/drm and matching the basename of each node's device symlink target (the PCI bus identifier) against the pre-enumerated card list. Tests build_intel_drm_basenames_maps_card_and_render_to_same_index and build_intel_drm_basenames_two_cards_two_render_nodes cover the common case. Non-Intel render nodes are excluded by PCI-bus mismatch (test build_intel_drm_basenames_ignores_non_intel_render_nodes).
drm-client-id deduplication: Per-(pid, card_index) the parser keeps a HashMap<client_id, max_bytes> (so N fds sharing one client report ONE memory amount, not N) and sums by_client.values() across distinct client IDs. Test collect_intel_gpu_processes_dedupes_by_client_id proves no double-count; collect_intel_gpu_processes_sums_distinct_clients proves distinct clients still aggregate. NOT a per-(pid, fd) dedup.
i915 vs xe schema: Parser uses k.starts_with("drm-resident-") to sum every resident-memory key regardless of suffix (-system, -local0, -gtt, -vram0). Integrated GPUs (with only one key present) resolve correctly. Three tests cover both schemas explicitly.
Permission errors: EACCES on /proc/<pid>/fd/, read_link, and fdinfo all degrade silently via let-else/match — no eprintln! in the entire new module.
kB → bytes: parse_memory_value multiplies by 1024 for kb/kib. Test parse_fdinfo_kb_multiplier_is_1024 verifies 1 kB → 1024 bytes.
Malformed fdinfo: Truncated mid-key, non-numeric values, missing client-id — all skipped or treated as None without panic (4 tests cover these paths).
/proc injection: IntelGpuReader::new_with_roots(drm_root, proc_root) enables tempdir-based integration tests; IntelGpuReader::new() passes Path::new("/proc") for production. Integration tests in intel_gpu_linux/tests.rs exercise the full pipeline end-to-end.
No regression for non-Intel hosts: collect_intel_gpu_processes short-circuits when intel_drm_basenames is empty. Test get_process_info_returns_empty_when_no_intel_cards covers this directly. AMD-only hosts skip the IntelGpuReader entirely via reader_factory.rs registration.

Module structure

intel_gpu_fdinfo{.rs,/enrichment.rs,/tests.rs} mirrors intel_gpu_engine{.rs,/discovery.rs,/tests.rs} 3-file pattern.
File-size budgets: intel_gpu_fdinfo.rs 483 lines, enrichment.rs 107, tests.rs 494, intel_gpu_linux.rs 474 (was 499 — kept under 500 by extracting detection.rs), detection.rs 86. All under 500.
mod.rs declares intel_gpu_fdinfo under the same #[cfg(target_os = "linux")] gate as intel_gpu_engine.

Remaining items

Hardware verification ACs (get_process_info non-empty on real Intel GPU host, used_memory matches intel_gpu_top -p) remain unchecked with (awaits hardware verification by maintainer) — appropriate per the convention.
Stretch goal (per-process engine-time utilization) intentionally deferred to a follow-up PR; documented in PR body and tracked by ACs.

Minor docs nit (LOW)

intel_gpu_linux.rs:146 comment says "Test-only constructor" for new_with_roots, but it is also called by IntelGpuReader::new() (production). The function is correctly not #[cfg(test)] gated — the comment should read "Internal constructor that accepts arbitrary DRM and proc roots (production code uses default paths via [IntelGpuReader::new])." Not worth a separate fix cycle.

Final verdict

Ready to advance to security/perf review. Implementation is correct, complete, properly integrated, conforms to project conventions, and preserves IntelGpuCard struct shape for #248's rebase. All narrow-scope verifications pass on a clean build.

`cargo fmt --check` (which CI enforces) flagged three formatting nits in the new module: a path-join chain that fits on one line, a `let` binding with an unnecessary multi-line break, and a function signature that fits on one line. Run rustfmt to bring the new code in line with the rest of the workspace. No behaviour change. All 42 fdinfo / intel_gpu_linux tests still pass; clippy clean on both `--lib --tests` and the default bin target.

inureyes · 2026-05-27T03:27:35Z

Security and performance review

Comprehensive review of the per-process fdinfo accounting work against the surface areas called out in the briefing. The reviewer ran cargo check --lib --tests, cargo clippy --lib --tests -- -D warnings, cargo clippy -- -D warnings, cargo test --lib device::readers::intel_gpu_fdinfo (23 pass), and cargo test --lib device::readers::intel_gpu_linux (19 pass) on this branch.

Findings

MEDIUM (auto-fixed in 9bb3d75)

cargo fmt --check flagged three formatting nits in the new module (one path-join chain, one let binding, one function signature). CI runs cargo fmt --check (see .github/workflows/ci.yml:49), so this would have blocked merge. Applied cargo fmt, verified all 42 tests still pass, pushed as a separate style(intel-gpu) commit on this branch.

LOW (informational, not blocking)

collect_intel_gpu_processes only increments the MAX_GPU_PROCESSES cap counter for processes that have at least one Intel DRM fd (src/device/readers/intel_gpu_fdinfo.rs:407-410). On a host with thousands of idle PIDs and zero Intel-using processes, the full /proc walk still pays one read_dir(/proc/<pid>/fd/) per PID. This matches the AMD reader's cost shape via libamdgpu_top::FdInfoStat::update_proc_usage and is acceptable for v1, but it means the cap is a credit cap, not an enumeration cap. Worth noting in a follow-up if /proc walk latency ever becomes a hotspot.
The missing-drm-client-id fallback at src/device/readers/intel_gpu_fdinfo.rs:428-437 keys by fdinfo_path (which is unique per fd). Two fds in the same process to the same DRM client on a pre-drm-client-id kernel would still double-count. In practice this is a non-issue: the kernels that expose drm-resident-* (Linux >=5.19 i915, all xe) also emit drm-client-id, so the over-count only triggers on a hypothetical kernel that emits memory keys without the client-id — and on such a kernel resident_bytes is typically zero anyway.
read_to_string on fdinfo files has no take(N) upper bound (src/device/readers/intel_gpu_fdinfo.rs:413). In practice the kernel emits at most ~30 lines (~2 KB) per fdinfo, but a defence-in-depth byte cap would harden against a future kernel change emitting unbounded counters.

Critical surface areas — verdict

/proc/<pid>/fd/ traversal & EACCES: intel_drm_fds_for_pid uses let Ok(entries) = std::fs::read_dir(&fd_dir) else { return Vec::new(); } (line 307). Silently degrades on EACCES and ENOENT. No eprintln!, no unwrap, no panic! anywhere in the production path. Pass.
Symlink-follow safety: The walker uses std::fs::read_link(&fd_path) (line 320), returning only the symlink target as a PathBuf without following it. It then matches target.file_name() (basename only) against the HashMap<String, usize> of known Intel DRM nodes. No read_to_string on the fd entry itself; no path-join with the read_link result; no opportunity to follow a fd pointing at a sensitive file. Pass.
Path-injection via drm-pdev: drm_pdev is captured into FdInfo (line 159) but never used to construct a path anywhere in the pipeline. Pass.
TOCTOU on process exit: read_to_string(&fd.fdinfo_path) Err → continue (line 415); read_link Err → continue (line 321); read_dir(proc_root) Err → empty Vec. Process disappearance is a normal-not-error case throughout. Pass.
fdinfo parsing safety: All parse::<u64>() calls use .ok() (line 129) or ? chains; truncated/malformed input skips lines via let Some(...) = ... else { continue; } patterns; saturating_add/saturating_mul used on every memory math op (lines 141, 181, 456, 461). Non-UTF8 fdinfo would fail at read_to_string and skip cleanly. Pass.
resident vs total selection: Only drm-resident-* keys are summed (line 139: k.starts_with("drm-resident-")). drm-total-* is intentionally ignored per the kernel drm-usage-stats.rst spec — total includes freed pages. Tests verify this. Pass.
Cache reuse: build_intel_drm_basenames is called exactly once at IntelGpuReader::new_with_roots (line 154 of intel_gpu_linux.rs); subsequent get_process_info calls read self.intel_drm_basenames (line 299). The only per-call rebuild is card_uuids (O(cards), microseconds). Pass.
sysinfo refresh narrowness: ProcessRefreshKind::nothing().with_cpu().with_memory().with_user(UpdateKind::OnlyIfNotSet) (enrichment.rs:63-66). Matches AMD exactly; no with_disk_usage, no with_environ. Pass.

Verdict

Ready for finalizer. One MEDIUM (formatting) was auto-fixed on this branch (9bb3d75). The remaining LOW findings are informational and do not block merge.

…dinfo in README and manpage new_with_roots is called by production IntelGpuReader::new(), not test-only. Update the doc comment to reflect that it is an internal constructor accepting arbitrary roots, with production code routing through IntelGpuReader::new. README and manpage were updated in the engine-utilization PR (#249) but neither mentioned the per-process GPU memory tracking added by this PR. Add a bullet under the Intel Arc section in README and extend the manpage Intel entry to include the fdinfo-based process accounting, mirroring the detail already present in ARCHITECTURE.md line 217.

inureyes · 2026-05-27T03:30:51Z

PR Finalization Complete

Summary

Lint/Format: cargo fmt idempotent (no changes after security checker's prior run). cargo clippy --lib --tests and cargo clippy bin target both clean with -D warnings.

Tests: 1120 lib tests pass. No new tests added — reviewer coverage of drm-client-id dedup, cardN/renderD mapping, and i915/xe schema branch was assessed as complete.

Docs: Three targeted fixes committed as f8743b1.

src/device/readers/intel_gpu_linux.rs line 146: stale "Test-only constructor" doc comment replaced with accurate description of the function as an internal constructor called by both production code and tests.
README.md: added per-process fdinfo bullet under the Intel Arc section (was missing despite ARCHITECTURE.md already documenting it); updated the Processes metrics line to include Intel Arc/Xe alongside AMD.
docs/man/all-smi.1: extended the Intel Arc entry to mention per-process GPU memory tracking via /proc fdinfo, consistent with the ARCHITECTURE.md description.

All checks passing. Ready for merge.

inureyes added 3 commits May 27, 2026 12:05

inureyes added status:review Under review type:enhancement New feature or request priority:low Low priority issue labels May 27, 2026

inureyes added status:done Completed and removed status:review Under review labels May 27, 2026

inureyes merged commit c95c559 into main May 27, 2026
4 checks passed

inureyes deleted the feat/issue-247-intel-fdinfo-per-process branch May 27, 2026 03:41

This was referenced May 27, 2026

feat(intel-gpu): make Level Zero Sysman the primary metrics source #252

Closed

fix: harden Intel Level Zero and fdinfo paths #253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(intel-gpu): per-process GPU memory accounting via fdinfo#250

feat(intel-gpu): per-process GPU memory accounting via fdinfo#250
inureyes merged 5 commits into
mainfrom
feat/issue-247-intel-fdinfo-per-process

inureyes commented May 27, 2026

Uh oh!

inureyes commented May 27, 2026

Uh oh!

inureyes commented May 27, 2026

Uh oh!

inureyes commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented May 27, 2026

Summary

What changed

Deferred: per-process engine-time utilization

Test plan

Acceptance criteria status

Uh oh!

inureyes commented May 27, 2026

Implementation Review Summary

Intent

Findings Addressed

Verification

Critical correctness checks (ultrathink)

Module structure

Remaining items

Minor docs nit (LOW)

Final verdict

Uh oh!

inureyes commented May 27, 2026

Security and performance review

Findings

Critical surface areas — verdict

Verdict

Uh oh!

inureyes commented May 27, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant