Skip to content

feat(snapshot): add --runtime-class flag as alternative to --require-gpu for CDI environments #433

@atif1996

Description

@atif1996

Problem

When CDI is enabled, the snapshot agent pod needs GPU access to run nvidia-smi. The existing --require-gpu flag solves this by requesting nvidia.com/gpu: 1, but this fails when all GPUs are already allocated to workloads.

Solution

Add a --runtime-class flag to aicr snapshot that:

  1. Sets runtimeClassName on the agent Job's pod spec (e.g., nvidia)
  2. Injects NVIDIA_VISIBLE_DEVICES=all environment variable into the container

This gives the agent access to nvidia-smi via the NVIDIA container runtime without consuming a GPU from the Device Plugin. The snapshot GPU collector only needs to run nvidia-smi -q -x — it does not need a dedicated GPU allocation.

Flags behavior

  • --runtime-class and --require-gpu are mutually exclusive
  • --runtime-class is the preferred approach; the error message when both are set recommends it
  • Supports AICR_RUNTIME_CLASS environment variable

Acceptance criteria

  • aicr snapshot --runtime-class nvidia sets runtimeClassName on the agent pod
  • NVIDIA_VISIBLE_DEVICES=all is injected when --runtime-class is set
  • --require-gpu and --runtime-class together produce a clear error recommending --runtime-class
  • Table-driven tests cover the new flag in job_test.go
  • make qualify passes

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions