feat(snapshot): add --runtime-class flag for CDI environments by atif1996 · Pull Request #434 · NVIDIA/aicr

atif1996 · 2026-03-19T16:54:27Z

Summary

Add --runtime-class flag to aicr snapshot that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all, giving the agent nvidia-smi access without consuming a GPU allocation
Mutually exclusive with --require-gpu — clear error recommends --runtime-class as the preferred approach for CDI environments where all GPUs are allocated
Pre-flight check validates the RuntimeClass exists before creating any resources, failing fast with a clear message on clusters without GPU Operator

Usage

aicr snapshot \
  --runtime-class nvidia \
  --node-selector nvidia.com/gpu.present=true \
  --output snapshot.yaml

Test plan

--require-gpu + --runtime-class rejected with clear error
--runtime-class nonexistent fails fast with RuntimeClass not found error
--runtime-class nvidia --node-selector nvidia.com/gpu.present=true captures snapshot with 8 GPUs detected
--node-selector alone still works (gpu-count=0 on CDI cluster without runtime class, as expected)
Table-driven unit tests for buildPodSpec, buildEnvVars, and combined runtime+node-selector
All affected package tests pass with -race

Closes #433

Add --runtime-class flag to `aicr snapshot` that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all. This gives the agent nvidia-smi access via the NVIDIA container runtime without consuming a GPU from the Device Plugin — useful when all GPUs are already allocated to workloads. - Mutually exclusive with --require-gpu (clear error recommends --runtime-class as the preferred approach) - Pre-flight check validates RuntimeClass exists before creating RBAC or Job resources - CLI docs recommend combining with --node-selector to target GPU nodes Closes #433

github-actions · 2026-03-19T16:54:37Z

Welcome to AICR, @atif1996! Thanks for your first pull request.

Before review, please ensure:

All commits are signed off per the DCO
CI checks pass (tests, lint, security scan)
The PR description explains the why behind your changes

A maintainer will review this soon.

github-actions · 2026-03-19T17:03:07Z

Coverage Report ✅

Metric	Value
Coverage	73.6%
Threshold	70%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-73.6%25-green)

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/aicr/pkg/cli	27.27% (-0.10%)	👎
github.com/NVIDIA/aicr/pkg/k8s/agent	77.48% (+1.46%)	👍
github.com/NVIDIA/aicr/pkg/snapshotter	50.45% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/aicr/pkg/cli/snapshot.go	5.00% (-0.26%)	40 (+2)	2	38 (+2)	👎
github.com/NVIDIA/aicr/pkg/k8s/agent/deployer.go	72.13% (+4.78%)	61 (+12)	44 (+11)	17 (+1)	👍
github.com/NVIDIA/aicr/pkg/k8s/agent/job.go	87.69% (+2.45%)	65 (+4)	57 (+5)	8 (-1)	👍
github.com/NVIDIA/aicr/pkg/k8s/agent/types.go	100.00% (ø)	1	1	0
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go	38.00% (ø)	150	57	93

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

mchmarny

Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?

Also, can we add some tests as to not decrease the test coverage?

Resolve gocritic elseif lint: collapse else { if } into else if.

yuanchen8911

LGTM. Clean implementation — good mutual exclusion, pre-flight validation, and test coverage.

A few observations (non-blocking):

NVIDIA_VISIBLE_DEVICES=all may conflict with CDI injection. On clusters where GPU Operator uses CDI mode (CDI-enabled ClusterPolicy), the runtime injects NVIDIA_VISIBLE_DEVICES via the CDI spec. Setting it explicitly in the env may override or conflict depending on the container runtime version. Worth a comment noting this is intentional for the non-CDI-allocation path (runtimeClass without resource request).
Timeout reuse. validateRuntimeClass uses defaults.K8sJobCreationTimeout for a single GET call. This works but is semantically mismatched — consider defaults.K8sAPICallTimeout or similar if one exists, or just inline a short timeout (5s).
Nit: The TestAgentConfigWithRuntimeClassName test only checks struct field defaults — it doesn't exercise any behavior. The job_test.go tests already cover the real logic well, so this test adds limited value.

yuanchen8911 · 2026-03-19T19:56:54Z

Medium finding: validateRuntimeClass() correctly returns ErrCodeNotFound when the RuntimeClass is missing (deployer.go:165), but the caller at agent.go:135 wraps all deploy errors as ErrCodeInternal:

return nil, errors.Wrap(errors.ErrCodeInternal, "failed to deploy agent", deployErr)

This loses the NOT_FOUND signal for callers/automation. Consider returning deployErr directly to preserve the original error code.

… RuntimeClassName Add table-driven tests for validateRuntimeClass (exists, not found) and a Deploy integration test verifying early failure when RuntimeClass is missing. Improves pkg/k8s/agent coverage from 73.7% to 77.5%.

Add missing sidebar entry for docs/integrator/gke-tcpxo-networking.md.

Return deployErr directly instead of wrapping with ErrCodeInternal, which was overwriting the original error code (e.g. ErrCodeNotFound from validateRuntimeClass, ErrCodeUnavailable from network errors).

- Add comment to NVIDIA_VISIBLE_DEVICES explaining it is intentional for the non-CDI-allocation path (runtimeClassName without GPU resource request) - Replace defaults.K8sJobCreationTimeout with inline 5s in validateRuntimeClass; a single GET pre-flight check does not warrant the job-creation timeout constant - Remove TestAgentConfigWithRuntimeClassName; test only asserted struct field defaults with no behavioral coverage, real logic covered by job_test.go

atif1996 · 2026-03-19T20:39:06Z

Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?

Also, can we add some tests as to not decrease the test coverage?

Just fixed the lint issues. Taking a look at coverage

LGTM. Clean implementation — good mutual exclusion, pre-flight validation, and test coverage.

A few observations (non-blocking):

NVIDIA_VISIBLE_DEVICES=all may conflict with CDI injection. On clusters where GPU Operator uses CDI mode (CDI-enabled ClusterPolicy), the runtime injects NVIDIA_VISIBLE_DEVICES via the CDI spec. Setting it explicitly in the env may override or conflict depending on the container runtime version. Worth a comment noting this is intentional for the non-CDI-allocation path (runtimeClass without resource request).

Timeout reuse. validateRuntimeClass uses defaults.K8sJobCreationTimeout for a single GET call. This works but is semantically mismatched — consider defaults.K8sAPICallTimeout or similar if one exists, or just inline a short timeout (5s).

Nit: The TestAgentConfigWithRuntimeClassName test only checks struct field defaults — it doesn't exercise any behavior. The job_test.go tests already cover the real logic well, so this test adds limited value.

Removed TestAgentConfigWithRuntimeClassName and other feedback. Ready to re-review

atif1996 · 2026-03-19T20:39:43Z

Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?

Also, can we add some tests as to not decrease the test coverage?

Corrected. Should be ready to re-review.

…#434) Add --runtime-class flag to aicr snapshot that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all. This gives the agent nvidia-smi access via the NVIDIA container runtime without consuming a GPU allocation — useful in CDI environments where all GPUs are already allocated to workloads. Mutually exclusive with --require-gpu; clear error recommends --runtime-class as the preferred approach Pre-flight check validates RuntimeClass exists before creating RBAC or Job resources, failing fast on clusters without GPU Operator CLI docs recommend combining with --node-selector to target GPU nodes Preserves original error codes from Deploy callers (no blanket ErrCodeInternal wrapping) Closes NVIDIA#433

atif1996 requested a review from a team as a code owner March 19, 2026 16:54

github-actions bot added area/cli size/L labels Mar 19, 2026

mchmarny requested changes Mar 19, 2026

View reviewed changes

mchmarny added this to the M2 - KubeCon EU milestone Mar 19, 2026

mchmarny assigned atif1996 Mar 19, 2026

mchmarny and others added 2 commits March 19, 2026 10:28

Merge branch 'main' into feat/snapshot-runtime-class

703d217

fix(snapshot): simplify else-if in RuntimeClassName test

9fc5863

Resolve gocritic elseif lint: collapse else { if } into else if.

yuanchen8911 previously approved these changes Mar 19, 2026

View reviewed changes

atif1996 dismissed yuanchen8911’s stale review via aba48a3 March 19, 2026 20:22

atif1996 added 3 commits March 19, 2026 16:23

fix(docs): add GKE TCPXO Networking to sidebar

b54b2e2

Add missing sidebar entry for docs/integrator/gke-tcpxo-networking.md.

fix(snapshot): preserve error codes from Deploy in agent caller

d56af5c

Return deployErr directly instead of wrapping with ErrCodeInternal, which was overwriting the original error code (e.g. ErrCodeNotFound from validateRuntimeClass, ErrCodeUnavailable from network errors).

atif1996 closed this Mar 19, 2026

atif1996 reopened this Mar 19, 2026

atif1996 requested a review from mchmarny March 19, 2026 20:55

Merge branch 'main' into feat/snapshot-runtime-class

5587172

atif1996 requested a review from yuanchen8911 March 19, 2026 20:56

atif1996 enabled auto-merge (squash) March 19, 2026 21:08

yuanchen8911 mentioned this pull request Mar 19, 2026

feat(recipe): add a service with B200 training and inference overlays #441

Closed

5 tasks

mchmarny added 2 commits March 20, 2026 04:43

Merge branch 'main' into feat/snapshot-runtime-class

ba8fb0d

Merge branch 'main' into feat/snapshot-runtime-class

e6ab67a

mchmarny approved these changes Mar 20, 2026

View reviewed changes

atif1996 merged commit 3e46e47 into main Mar 20, 2026
26 checks passed

atif1996 deleted the feat/snapshot-runtime-class branch March 20, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(snapshot): add --runtime-class flag for CDI environments#434

feat(snapshot): add --runtime-class flag for CDI environments#434
atif1996 merged 10 commits intomainfrom
feat/snapshot-runtime-class

atif1996 commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026 •

edited

Loading

Changed files (no unit tests)

Uh oh!

mchmarny left a comment

Uh oh!

yuanchen8911 left a comment

Uh oh!

yuanchen8911 commented Mar 19, 2026

Uh oh!

atif1996 commented Mar 19, 2026

Uh oh!

atif1996 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

atif1996 commented Mar 19, 2026

Summary

Usage

Test plan

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Merging this branch changes the coverage (1 decrease, 1 increase)

Changed files (no unit tests)

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Mar 19, 2026

Uh oh!

atif1996 commented Mar 19, 2026

Uh oh!

atif1996 commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 19, 2026 •

edited

Loading