feat(snapshot): add --runtime-class flag for CDI environments#434
feat(snapshot): add --runtime-class flag for CDI environments#434
Conversation
Add --runtime-class flag to `aicr snapshot` that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all. This gives the agent nvidia-smi access via the NVIDIA container runtime without consuming a GPU from the Device Plugin — useful when all GPUs are already allocated to workloads. - Mutually exclusive with --require-gpu (clear error recommends --runtime-class as the preferred approach) - Pre-flight check validates RuntimeClass exists before creating RBAC or Job resources - CLI docs recommend combining with --node-selector to target GPU nodes Closes #433
Coverage Report ✅
Coverage BadgeMerging this branch changes the coverage (1 decrease, 1 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
mchmarny
left a comment
There was a problem hiding this comment.
Generally speaking it looks good. Looks like there are linting errors though, did make qualify pass locally?
Also, can we add some tests as to not decrease the test coverage?
Resolve gocritic elseif lint: collapse else { if } into else if.
yuanchen8911
left a comment
There was a problem hiding this comment.
LGTM. Clean implementation — good mutual exclusion, pre-flight validation, and test coverage.
A few observations (non-blocking):
-
NVIDIA_VISIBLE_DEVICES=allmay conflict with CDI injection. On clusters where GPU Operator uses CDI mode (CDI-enabledClusterPolicy), the runtime injectsNVIDIA_VISIBLE_DEVICESvia the CDI spec. Setting it explicitly in the env may override or conflict depending on the container runtime version. Worth a comment noting this is intentional for the non-CDI-allocation path (runtimeClass without resource request). -
Timeout reuse.
validateRuntimeClassusesdefaults.K8sJobCreationTimeoutfor a single GET call. This works but is semantically mismatched — considerdefaults.K8sAPICallTimeoutor similar if one exists, or just inline a short timeout (5s). -
Nit: The
TestAgentConfigWithRuntimeClassNametest only checks struct field defaults — it doesn't exercise any behavior. Thejob_test.gotests already cover the real logic well, so this test adds limited value.
|
Medium finding: return nil, errors.Wrap(errors.ErrCodeInternal, "failed to deploy agent", deployErr)This loses the |
… RuntimeClassName Add table-driven tests for validateRuntimeClass (exists, not found) and a Deploy integration test verifying early failure when RuntimeClass is missing. Improves pkg/k8s/agent coverage from 73.7% to 77.5%.
Add missing sidebar entry for docs/integrator/gke-tcpxo-networking.md.
Return deployErr directly instead of wrapping with ErrCodeInternal, which was overwriting the original error code (e.g. ErrCodeNotFound from validateRuntimeClass, ErrCodeUnavailable from network errors).
- Add comment to NVIDIA_VISIBLE_DEVICES explaining it is intentional for the non-CDI-allocation path (runtimeClassName without GPU resource request) - Replace defaults.K8sJobCreationTimeout with inline 5s in validateRuntimeClass; a single GET pre-flight check does not warrant the job-creation timeout constant - Remove TestAgentConfigWithRuntimeClassName; test only asserted struct field defaults with no behavioral coverage, real logic covered by job_test.go
Just fixed the lint issues. Taking a look at coverage
Removed TestAgentConfigWithRuntimeClassName and other feedback. Ready to re-review |
Corrected. Should be ready to re-review. |
…#434) Add --runtime-class flag to aicr snapshot that sets runtimeClassName on the agent pod and injects NVIDIA_VISIBLE_DEVICES=all. This gives the agent nvidia-smi access via the NVIDIA container runtime without consuming a GPU allocation — useful in CDI environments where all GPUs are already allocated to workloads. Mutually exclusive with --require-gpu; clear error recommends --runtime-class as the preferred approach Pre-flight check validates RuntimeClass exists before creating RBAC or Job resources, failing fast on clusters without GPU Operator CLI docs recommend combining with --node-selector to target GPU nodes Preserves original error codes from Deploy callers (no blanket ErrCodeInternal wrapping) Closes NVIDIA#433
Summary
--runtime-classflag toaicr snapshotthat setsruntimeClassNameon the agent pod and injectsNVIDIA_VISIBLE_DEVICES=all, giving the agentnvidia-smiaccess without consuming a GPU allocation--require-gpu— clear error recommends--runtime-classas the preferred approach for CDI environments where all GPUs are allocatedUsage
Test plan
--require-gpu+--runtime-classrejected with clear error--runtime-class nonexistentfails fast with RuntimeClass not found error--runtime-class nvidia --node-selector nvidia.com/gpu.present=truecaptures snapshot with 8 GPUs detected--node-selectoralone still works (gpu-count=0 on CDI cluster without runtime class, as expected)buildPodSpec,buildEnvVars, and combined runtime+node-selector-raceCloses #433