-
Notifications
You must be signed in to change notification settings - Fork 22
Closed
Enhancement
Copy link
Description
Problem
When CDI is enabled, the snapshot agent pod needs GPU access to run nvidia-smi. The existing --require-gpu flag solves this by requesting nvidia.com/gpu: 1, but this fails when all GPUs are already allocated to workloads.
Solution
Add a --runtime-class flag to aicr snapshot that:
- Sets
runtimeClassNameon the agent Job's pod spec (e.g.,nvidia) - Injects
NVIDIA_VISIBLE_DEVICES=allenvironment variable into the container
This gives the agent access to nvidia-smi via the NVIDIA container runtime without consuming a GPU from the Device Plugin. The snapshot GPU collector only needs to run nvidia-smi -q -x — it does not need a dedicated GPU allocation.
Flags behavior
--runtime-classand--require-gpuare mutually exclusive--runtime-classis the preferred approach; the error message when both are set recommends it- Supports
AICR_RUNTIME_CLASSenvironment variable
Acceptance criteria
-
aicr snapshot --runtime-class nvidiasetsruntimeClassNameon the agent pod -
NVIDIA_VISIBLE_DEVICES=allis injected when--runtime-classis set -
--require-gpuand--runtime-classtogether produce a clear error recommending--runtime-class - Table-driven tests cover the new flag in
job_test.go -
make qualifypasses
Reactions are currently unavailable