Skip to content

fix: skip runtimeClassName injection when gpuPodRuntimeClassName is empty#1035

Merged
enoodle merged 1 commit intokai-scheduler:mainfrom
yuanchen8911:fix/skip-runtimeclass-when-empty
Feb 19, 2026
Merged

fix: skip runtimeClassName injection when gpuPodRuntimeClassName is empty#1035
enoodle merged 1 commit intokai-scheduler:mainfrom
yuanchen8911:fix/skip-runtimeclass-when-empty

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

When gpuPodRuntimeClassName is set to empty string (""), the admission webhook should not inject runtimeClassName into GPU pods. Currently, even with an empty value, the Mutate function still proceeds to evaluate the pod and may set runtimeClassName to an empty string.

This fix adds an early return in RuntimeEnforcement.Mutate() when gpuPodRuntimeClassName is empty, completely skipping the runtimeClassName injection.

Problem

With GPU Operator v25.10.0+, nvidia is configured as the default containerd runtime. KAI scheduler's admission webhook injecting runtimeClassName: nvidia triggers the management.nvidia.com CDI management path, causing pod startup failures:

Error: failed to inject CDI devices: unresolvable CDI devices
management.nvidia.com/gpu=GPU-<UUID>: unknown

The --gpu-pod-runtime-class-name flag help text already documents "Set to empty string to disable", but the Mutate function did not check for this case.

Changes

  • runtime_enforcement.go: Add early return when gpuPodRuntimeClassName is empty
  • runtime_enforcement_test.go: Add test case for empty string behavior

Test plan

  • Existing unit tests pass
  • New test case verifies GPU pods are not mutated when gpuPodRuntimeClassName is empty

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…mpty

When gpuPodRuntimeClassName is set to empty string, the admission
webhook should not inject runtimeClassName into GPU pods. This allows
environments where the nvidia runtime is already the default containerd
runtime (e.g., GPU Operator v25.10.0+) to avoid triggering the
management.nvidia.com CDI management path, which fails with
"unresolvable CDI devices" on nodes without UUID-based CDI specs.

The --gpu-pod-runtime-class-name flag help text already documents
"Set to empty string to disable" but the Mutate function did not
check for this case.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the fix/skip-runtimeclass-when-empty branch from 0d995ee to 4991337 Compare February 18, 2026 21:45
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Thanks for the review. The empty string is preserved through the operator pipeline because SetDefault only replaces nil, not empty string:

// pkg/apis/kai/v1/common/set_default.go
func SetDefault[T any](target *T, value *T) *T {
    if target == nil {
        return value
    }
    return target  // *string("") is not nil, so it's kept
}

Full flow when user sets admission.gpuPodRuntimeClassName: "":

  1. Helm (after PR fix: allow setting empty gpuPodRuntimeClassName during helm install #972) renders gpuPodRuntimeClassName: "" in ConfigMap
  2. Operator deserializes → *string pointing to "" (not nil)
  3. SetDefaultsWhereNeeded at admission.go:70SetDefault keeps "" because target is not nil
  4. buildArgsList at resources.go:362config.GPUPodRuntimeClassName is non-nil, so it passes --gpu-pod-runtime-class-name ""
  5. This fixMutate() sees empty string, returns early, no runtimeClassName injection

Without this fix, step 5 would still proceed to evaluate the pod and call setRuntimeClass(pod, ""), which sets runtimeClassName to an empty string pointer rather than leaving it unset.

The cmd/admission/app/options.go:70 default (constants.DefaultRuntimeClassName) only applies when the flag is not provided at all — since the operator explicitly passes the flag via buildArgsList, the pflag default is overridden.

@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/admission/webhook/v1alpha2/runtimeenforcement 78.57% (+3.57%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/admission/webhook/v1alpha2/runtimeenforcement/runtime_enforcement.go 78.57% (+3.57%) 14 (+2) 11 (+2) 3 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/admission/webhook/v1alpha2/runtimeenforcement/runtime_enforcement_test.go

@enoodle enoodle added this pull request to the merge queue Feb 19, 2026
Merged via the queue into kai-scheduler:main with commit 9675f37 Feb 19, 2026
13 of 15 checks passed
@KaiPilotBot
Copy link
Copy Markdown

Backport failed for v0.9, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.9
git worktree add -d .worktree/backport-1035-to-v0.9 origin/v0.9
cd .worktree/backport-1035-to-v0.9
git switch --create backport-1035-to-v0.9
git cherry-pick -x 9675f37d32d202b7eeb96485c85fdd399d37b12b

@KaiPilotBot
Copy link
Copy Markdown

Backport failed for v0.10, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.10
git worktree add -d .worktree/backport-1035-to-v0.10 origin/v0.10
cd .worktree/backport-1035-to-v0.10
git switch --create backport-1035-to-v0.10
git cherry-pick -x 9675f37d32d202b7eeb96485c85fdd399d37b12b

@KaiPilotBot
Copy link
Copy Markdown

Backport failed for v0.12, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.12
git worktree add -d .worktree/backport-1035-to-v0.12 origin/v0.12
cd .worktree/backport-1035-to-v0.12
git switch --create backport-1035-to-v0.12
git cherry-pick -x 9675f37d32d202b7eeb96485c85fdd399d37b12b

enoodle pushed a commit that referenced this pull request Mar 4, 2026
…mpty (#1035)

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
enoodle pushed a commit that referenced this pull request Mar 4, 2026
…mpty (#1035)

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
enoodle pushed a commit that referenced this pull request Mar 4, 2026
…mpty (#1035)

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
enoodle pushed a commit that referenced this pull request Mar 4, 2026
…mpty (#1035)

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
enoodle pushed a commit that referenced this pull request Mar 4, 2026
…mpty (#1035)

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants