Skip to content

feat(snapshot): add --runtime-class flag for CDI environments#434

Merged
atif1996 merged 10 commits intomainfrom
feat/snapshot-runtime-class
Mar 20, 2026
Merged

feat(snapshot): add --runtime-class flag for CDI environments#434
atif1996 merged 10 commits intomainfrom
feat/snapshot-runtime-class

Conversation

@atif1996
Copy link
Copy Markdown
Contributor

Summary

  • Add --runtime-class flag to aicr snapshot that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all, giving the agent nvidia-smi access without consuming a GPU allocation
  • Mutually exclusive with --require-gpu — clear error recommends --runtime-class as the preferred approach for CDI environments where all GPUs are allocated
  • Pre-flight check validates the RuntimeClass exists before creating any resources, failing fast with a clear message on clusters without GPU Operator

Usage

aicr snapshot \
  --runtime-class nvidia \
  --node-selector nvidia.com/gpu.present=true \
  --output snapshot.yaml

Test plan

  • --require-gpu + --runtime-class rejected with clear error
  • --runtime-class nonexistent fails fast with RuntimeClass not found error
  • --runtime-class nvidia --node-selector nvidia.com/gpu.present=true captures snapshot with 8 GPUs detected
  • --node-selector alone still works (gpu-count=0 on CDI cluster without runtime class, as expected)
  • Table-driven unit tests for buildPodSpec, buildEnvVars, and combined runtime+node-selector
  • All affected package tests pass with -race

Closes #433

Add --runtime-class flag to `aicr snapshot` that sets runtimeClassName
on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all. This gives the
agent nvidia-smi access via the NVIDIA container runtime without
consuming a GPU from the Device Plugin — useful when all GPUs are
already allocated to workloads.

- Mutually exclusive with --require-gpu (clear error recommends
  --runtime-class as the preferred approach)
- Pre-flight check validates RuntimeClass exists before creating RBAC
  or Job resources
- CLI docs recommend combining with --node-selector to target GPU nodes

Closes #433
@atif1996 atif1996 requested a review from a team as a code owner March 19, 2026 16:54
@github-actions
Copy link
Copy Markdown

Welcome to AICR, @atif1996! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 19, 2026

Coverage Report ✅

Metric Value
Coverage 73.6%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.6%25-green)

Merging this branch changes the coverage (1 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/cli 27.27% (-0.10%) 👎
github.com/NVIDIA/aicr/pkg/k8s/agent 77.48% (+1.46%) 👍
github.com/NVIDIA/aicr/pkg/snapshotter 50.45% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/cli/snapshot.go 5.00% (-0.26%) 40 (+2) 2 38 (+2) 👎
github.com/NVIDIA/aicr/pkg/k8s/agent/deployer.go 72.13% (+4.78%) 61 (+12) 44 (+11) 17 (+1) 👍
github.com/NVIDIA/aicr/pkg/k8s/agent/job.go 87.69% (+2.45%) 65 (+4) 57 (+5) 8 (-1) 👍
github.com/NVIDIA/aicr/pkg/k8s/agent/types.go 100.00% (ø) 1 1 0
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go 38.00% (ø) 150 57 93

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?

Also, can we add some tests as to not decrease the test coverage?

@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 19, 2026
yuanchen8911
yuanchen8911 previously approved these changes Mar 19, 2026
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean implementation — good mutual exclusion, pre-flight validation, and test coverage.

A few observations (non-blocking):

  1. NVIDIA_VISIBLE_DEVICES=all may conflict with CDI injection. On clusters where GPU Operator uses CDI mode (CDI-enabled ClusterPolicy), the runtime injects NVIDIA_VISIBLE_DEVICES via the CDI spec. Setting it explicitly in the env may override or conflict depending on the container runtime version. Worth a comment noting this is intentional for the non-CDI-allocation path (runtimeClass without resource request).

  2. Timeout reuse. validateRuntimeClass uses defaults.K8sJobCreationTimeout for a single GET call. This works but is semantically mismatched — consider defaults.K8sAPICallTimeout or similar if one exists, or just inline a short timeout (5s).

  3. Nit: The TestAgentConfigWithRuntimeClassName test only checks struct field defaults — it doesn't exercise any behavior. The job_test.go tests already cover the real logic well, so this test adds limited value.

@yuanchen8911
Copy link
Copy Markdown
Contributor

Medium finding: validateRuntimeClass() correctly returns ErrCodeNotFound when the RuntimeClass is missing (deployer.go:165), but the caller at agent.go:135 wraps all deploy errors as ErrCodeInternal:

return nil, errors.Wrap(errors.ErrCodeInternal, "failed to deploy agent", deployErr)

This loses the NOT_FOUND signal for callers/automation. Consider returning deployErr directly to preserve the original error code.

… RuntimeClassName

Add table-driven tests for validateRuntimeClass (exists, not found)
and a Deploy integration test verifying early failure when RuntimeClass
is missing. Improves pkg/k8s/agent coverage from 73.7% to 77.5%.
Add missing sidebar entry for docs/integrator/gke-tcpxo-networking.md.
Return deployErr directly instead of wrapping with ErrCodeInternal,
which was overwriting the original error code (e.g. ErrCodeNotFound
from validateRuntimeClass, ErrCodeUnavailable from network errors).
- Add comment to NVIDIA_VISIBLE_DEVICES explaining it is intentional for
  the non-CDI-allocation path (runtimeClassName without GPU resource request)
- Replace defaults.K8sJobCreationTimeout with inline 5s in validateRuntimeClass;
  a single GET pre-flight check does not warrant the job-creation timeout constant
- Remove TestAgentConfigWithRuntimeClassName; test only asserted struct field
  defaults with no behavioral coverage, real logic covered by job_test.go
@atif1996
Copy link
Copy Markdown
Contributor Author

Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?

Also, can we add some tests as to not decrease the test coverage?

Just fixed the lint issues. Taking a look at coverage

LGTM. Clean implementation — good mutual exclusion, pre-flight validation, and test coverage.

A few observations (non-blocking):

  1. NVIDIA_VISIBLE_DEVICES=all may conflict with CDI injection. On clusters where GPU Operator uses CDI mode (CDI-enabled ClusterPolicy), the runtime injects NVIDIA_VISIBLE_DEVICES via the CDI spec. Setting it explicitly in the env may override or conflict depending on the container runtime version. Worth a comment noting this is intentional for the non-CDI-allocation path (runtimeClass without resource request).
  2. Timeout reuse. validateRuntimeClass uses defaults.K8sJobCreationTimeout for a single GET call. This works but is semantically mismatched — consider defaults.K8sAPICallTimeout or similar if one exists, or just inline a short timeout (5s).
  3. Nit: The TestAgentConfigWithRuntimeClassName test only checks struct field defaults — it doesn't exercise any behavior. The job_test.go tests already cover the real logic well, so this test adds limited value.

Removed TestAgentConfigWithRuntimeClassName and other feedback. Ready to re-review

@atif1996 atif1996 closed this Mar 19, 2026
@atif1996
Copy link
Copy Markdown
Contributor Author

Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?

Also, can we add some tests as to not decrease the test coverage?

Corrected. Should be ready to re-review.

@atif1996 atif1996 reopened this Mar 19, 2026
@atif1996 atif1996 requested a review from mchmarny March 19, 2026 20:55
@atif1996 atif1996 merged commit 3e46e47 into main Mar 20, 2026
26 checks passed
@atif1996 atif1996 deleted the feat/snapshot-runtime-class branch March 20, 2026 14:39
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
…#434)

Add --runtime-class flag to aicr snapshot that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all. This gives the agent nvidia-smi access via the NVIDIA container runtime without consuming a GPU allocation — useful in CDI environments where all GPUs are already allocated to workloads.

Mutually exclusive with --require-gpu; clear error recommends --runtime-class as the preferred approach
Pre-flight check validates RuntimeClass exists before creating RBAC or Job resources, failing fast on clusters without GPU Operator
CLI docs recommend combining with --node-selector to target GPU nodes
Preserves original error codes from Deploy callers (no blanket ErrCodeInternal wrapping)
Closes NVIDIA#433
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(snapshot): add --runtime-class flag as alternative to --require-gpu for CDI environments

3 participants