Skip to content

AMD GPU readers leak a file descriptor per instantiation on Linux (fixed in libamdgpu_top 0.11.5) #218

Description

@joshhansen

Summary

On Linux hosts with an AMD GPU present, each call into the public API leaks a single file descriptor that is never released. The following all leak one fd per call on the reporter's system:

  1. AllSmi::new()
  2. get_gpu_readers()
  3. AmdGpuReader::default() / AmdGpuReader::new()

With a typical per-process fd limit of 1024, repeatedly instantiating these (~1020 times) exhausts the limit and crashes the process. These descriptors should be released on drop.

Reported by @joshhansen with a minimal reproduction: https://github.com/joshhansen/allsmi-libamdgpu_top-bug (run cargo run --release). Reproduced on commit d5b678d.

Background

The leak originates in the upstream libamdgpu_top crate, not in all-smi's own code. DevicePath opened the DRM device via into_raw_fd(), which transfers ownership away from the RAII wrapper so the descriptor is never closed. Every new AmdGpuReader (constructed through get_gpu_readers()AllSmi::new()) opens a fresh device handle, so file descriptors accumulate one per instantiation.

all-smi currently exact-pins libamdgpu_top = "=0.11.4" (Cargo.toml:76). That pin was added in #207 (closing #205) because 0.11.4 renamed get_all_proc_usageupdate_proc_usage in a patch release, violating semver and breaking caret resolution on fresh installs. The pin comment explicitly notes it must be re-evaluated when bumping.

Upstream issue: Umio-Yasuno/amdgpu_top#163

Proposed Solution

Bump the exact pin to libamdgpu_top = "=0.11.5".

Version 0.11.5 (published 2026-05-18) fixes the leak in upstream commit 8ade0d5: DevicePath now caches an Arc<OwnedFd> in a OnceLock<Arc<OwnedFd>> and get_fd() returns the cached RawFd, so the descriptor is owned and closed via RAII on drop instead of leaked.

Implementation Notes

  • Cargo.toml:76 — change libamdgpu_top = "=0.11.4""=0.11.5"; update the adjacent comment (currently explaining the 0.11.4 rename rationale) to also note the fd-leak fix and that 0.11.5 retains the update_proc_usage API.
  • Regenerate the lockfile: cargo update -p libamdgpu_top --precise 0.11.5.
  • No source changes expected:
    • all-smi never calls the changed get_fd() API (verified — no get_fd/into_raw_fd usage anywhere in src/), so the upstream signature change io::Result<RawFd>RawFd does not affect our call sites.
    • The only AMD API call site, update_proc_usage at src/device/readers/amd.rs:572, was introduced in 0.11.4 and is retained in 0.11.5.
    • Confirm with cargo build --release and cargo clippy on a Linux glibc target.
  • Severity context: all-smi's own view/api binaries construct readers once — outside the collection loop (src/api/collection_loop.rs:60) and behind the guarded one-time init (src/view/data_collection/local_collector.rs:139) — so the CLI leaks at most one fd per AMD reader, once. The unbounded leak primarily affects consumers of the public library API (AllSmi::new, get_gpu_readers, AmdGpuReader) that re-instantiate repeatedly, which is the reporter's scenario.
  • Build matrix scope: the dependency only applies to cfg(all(target_os = "linux", not(target_env = "musl"))); musl/static and non-Linux targets are unaffected.

Acceptance Criteria

  • libamdgpu_top pinned to =0.11.5 in Cargo.toml, with the explanatory comment updated.
  • Cargo.lock regenerated to 0.11.5.
  • cargo build --release and cargo clippy pass on Linux glibc (AMD code path compiles; update_proc_usage call site unchanged).
  • Repeatedly instantiating AllSmi::new() / get_gpu_readers() / AmdGpuReader::default() (>1020 iterations) no longer grows the process fd count unbounded and no longer crashes — verified against the reporter's reproduction.
  • No regression in AMD GPU metrics collection (view/api modes still report AMD GPU info correctly).

Original Suggestion

Title: AllSmi::new, get_gpu_readers, and AmdGpuReader::new leak file descriptors on Linux with AMD GPU present

A minimal reproduction can be seen here: https://github.com/joshhansen/allsmi-libamdgpu_top-bug
Just run cargo run --release.

This appears to stem ultimately from crate libamdgpu_top --- I've filed an issue there but wanted to make this project aware.

These three lines of code invoking the allsmi API each appear to leak a single file descriptor on my system:

  1. AllSmi::new().unwrap();
  2. get_gpu_readers();
  3. AmdGpuReader::default()

Since my per-process file descriptor limit is 1024, all it takes to trigger this is to instantiate AllSmi 1020 times. (The process's own fds do the rest.)

It should be expected that these would be cleaned up on drop.

Thanks

Metadata

Metadata

Assignees

Labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions