Skip to content

feat(lib): targeted device refresh and stable correlation IDs for the AllSmi API #211

Description

@joshhansen

Summary

The public library API (AllSmi, src/client.rs) returns owned, point-in-time snapshots (Vec<GpuInfo>, Vec<CpuInfo>, Vec<MemoryInfo>). Refreshing today means re-calling the getter, which re-enumerates every device. This issue requests two ergonomics for library consumers:

  1. Stable correlation identifiers so a re-fetched batch can be matched, entry-by-entry, to a previously held batch.
  2. Targeted refresh of a single device of interest without re-enumerating everything.

Background

AllSmi was introduced for embedding in external Rust projects (#106) and has since grown (e.g. get_storage_info, #115). Its getters — get_gpu_info (src/client.rs:259), get_cpu_info (src/client.rs:383), get_memory_info (src/client.rs:412) — take &self and rebuild the result on each call by iterating every reader (src/client.rs:260-264). So the underlying source is not frozen: re-calling returns fresh values (bounded by the platform sample interval on Apple Silicon). What goes stale is the owned copy the caller is holding, and after a refresh the caller is left to (a) re-enumerate and (b) work out which new entry maps to which old one.

State of per-entry identifiers today:

  • GpuInfo.uuid exists (src/device/types.rs:36) → GPUs/NPUs are already correlatable. The GPU gap is purely efficiency (re-enumerating 8+ devices to refresh one).
  • StorageInfo.index exists (src/storage/info.rs:24) → storage is correlatable.
  • CpuInfo (src/device/types.rs:290) has no per-entry unique id — only host-level host_id / hostname / instance, which are identical across every CPU entry from one host. This is exactly the gap the reporter hit. (Per-socket detail is nested in per_socket_info[].socket_id, but there is no key for the CpuInfo itself.)
  • MemoryInfo (src/device/types.rs:365) likewise has no id, though it is effectively a host singleton.

Cost that motivates targeted refresh: the NVIDIA reader (src/device/readers/nvidia.rs:370) queries NVML for every device on each get_gpu_info() and, on NVML failure, shells out to nvidia-smi (src/device/readers/nvidia.rs:383). Refreshing a single device of interest should not pay for all of them.

Proposed Solution

1. Targeted refresh on AllSmi (not on the info structs)

impl AllSmi {
    /// Fetch fresh info for one GPU/NPU by UUID. `None` if it is no longer present.
    pub fn get_gpu_by_uuid(&self, uuid: &str) -> Option<GpuInfo>;

    /// Re-fetch `info` in place by its UUID. Returns `true` if the device was
    /// found and the struct overwritten, `false` if it has disappeared.
    pub fn refresh_gpu(&self, info: &mut GpuInfo) -> bool;
}

Rationale for putting this on AllSmi rather than the reporter's GpuInfo::update(&mut self) option: GpuInfo / CpuInfo / MemoryInfo are plain Serialize + Deserialize + Clone DTOs with no handle to hardware. The same types are also produced by the remote-monitoring network parser and by snapshot deserialization. A self-updating struct would have to embed a #[serde(skip)] reader handle that is None for any deserialized value, giving update() a confusing partial contract (and complicating Clone / Send / Sync / serialization). The readers already live inside AllSmi (src/client.rs:134-138), so that is the correct owner of refresh logic.

refresh_gpu returns bool to match the current infallible reader behavior (get_gpu_info swallows read errors and returns zeros). A Result<bool> variant — matching the reporter's fallible suggestion — becomes worthwhile if/when readers start propagating read errors.

2. Optional, backward-compatible reader hook for efficiency

pub trait GpuReader: Send + Sync {
    fn get_gpu_info(&self) -> Vec<GpuInfo>;
    // …existing methods…

    /// Fetch a single device by UUID. Default filters the full enumeration;
    /// readers that can address one device directly should override.
    fn get_gpu_info_by_uuid(&self, uuid: &str) -> Option<GpuInfo> {
        self.get_gpu_info().into_iter().find(|g| g.uuid == uuid)
    }
}

The default impl keeps every existing reader compiling unchanged. The NVIDIA reader can override it: NVML supports opening a handle by UUID (nvmlDeviceGetHandleByUUID, surfaced as Nvml::device_by_uuid in nvml-wrapper 0.12.1 — confirmed present, the reader currently only uses device_by_index), letting it skip the full-device loop in get_gpu_info_nvml.

3. Stable correlation identifier for CpuInfo / MemoryInfo

Add a 0-based index: u32 field (mirroring StorageInfo.index), assigned by AllSmi::get_cpu_info / get_memory_info while concatenating reader outputs (enumerate() over the flattened result), marked #[serde(default)] for wire/snapshot back-compat (same convention already used for temperature_threshold_* and bandwidth_mb_s in types.rs). CPU/memory topology is static, so the index is a stable key across refreshes — unlike GPUs, which can hot-plug/MIG-reconfigure and therefore key on uuid.

Optional companion for symmetry: get_cpu_by_index(u32) -> Option<CpuInfo>. The efficiency win is marginal for CPU (typically a single aggregate entry per host), so the identifier itself is the real deliverable here.

Implementation Notes

  • Files: src/client.rs (new AllSmi methods + index assignment), src/device/traits.rs (default trait method), src/device/readers/nvidia.rs (override), src/device/types.rs (index fields).
  • NVML static cache: get_gpu_info_nvml caches static per-device data keyed by index (src/device/readers/nvidia.rs:269, :305). A UUID-keyed override needs a uuid→index resolution or a uuid-keyed static cache; the simplest first cut is to ship the default filter for NVIDIA and add the device_by_uuid fast path as a follow-up.
  • Exporter / parser scope: the index field is only required on the library/JSON path. The Prometheus exporter and remote network parser do not have to carry a CPU/memory index label — #[serde(default)] keeps older snapshots deserializing cleanly, so this change can stay contained to the library layer.
  • Thread-safety: all new methods stay &self; AllSmi: Send + Sync is preserved (src/client.rs:549-550).
  • Back-compat / breaking surface: the new methods are additive. Adding #[serde(default)] fields follows existing repo precedent and is treated as non-breaking; the only caveat is downstream code that constructs CpuInfo / MemoryInfo via exhaustive struct literals (these structs are normally returned by the library, not built by consumers).
  • Docs: extend examples/library_usage.rs with a correlate-and-refresh loop, and add a short "Refreshing data" section to the AllSmi rustdoc in src/client.rs / src/lib.rs.

Acceptance Criteria

  • AllSmi::get_gpu_by_uuid(&str) -> Option<GpuInfo> returns fresh data for a present UUID and None for an absent one.
  • AllSmi::refresh_gpu(&mut GpuInfo) -> bool overwrites the struct in place and reports found/absent.
  • GpuReader::get_gpu_info_by_uuid exists with a default implementation; all existing readers compile without change.
  • CpuInfo and MemoryInfo carry a stable, serializable index populated by AllSmi::get_cpu_info / get_memory_info, with #[serde(default)] for back-compat.
  • Unit/doc tests cover: by-uuid hit and miss, in-place refresh, and CpuInfo index stability across two consecutive get_cpu_info() calls.
  • examples/library_usage.rs demonstrates correlating and refreshing a previously fetched device; cargo run --example library_usage succeeds.
  • cargo fmt --check, cargo clippy, and cargo test pass.

Original Suggestion

Title: Allow targeted updates

Right now the returned infos are static and so grow stale the longer it's been since e.g. AllSmi::get_gpu_info was called.

This can be addressed by repeated calls to AllSmi::get_gpu_info, AllSmi::get_cpu_info, etc.

However, at least in the case of CpuInfo, there is no unique identifier by which to clearly determine which in the new batch of CpuInfos corresponds to which in the prior batch.

It would be sufficient if such a unique identifier were provided, allowing targeted updates to be done by re-generating all infos, and selecting the one desired.

It would be more efficient, though, in cases where only a small number of devices are of interest, to allow targeted updates.

One approach would be to enable the info structs to update themselves. This might look like:

impl GpuInfo {
  pub fn update(&mut self) { ... }
}

Or if it needs to be fallible (such as if the device disappears, permissions change, etc.):

impl GpuInfo {
  pub fn update(self) -> Result<Self, ...> { ... }
}

Or it could live in AllSmi:

impl AllSmi {
  pub fn update(gpu_info: &mut GpuInfo) { ... }
}

etc.

This would allow the staleness of returned information to be addressed as desired by the user, without having to re-enumerate all devices such as when calling AllSmi::get_gpu_info. Thanks for considering.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions