feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling#450
Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom Mar 21, 2026
Merged
Conversation
d8804ae to
c548b35
Compare
c548b35 to
dd95884
Compare
mchmarny
approved these changes
Mar 20, 2026
Upgrade KAI Scheduler from v0.12.17 to v0.13.0. Key changes:
1. Runtime class injection fix (KAI-Scheduler#1035):
- Add admission.gpuPodRuntimeClassName: "" to Helm values
- Remove post-install workaround Job (disable-runtime-class-injection.yaml)
- v0.13.0 chart respects empty string instead of falling back to "nvidia"
2. DRA ResourceClaim queue label (v0.13.0 requirement):
- KAI v0.13.0 requires kai.scheduler/queue label on DRA claims
- Add label to gang scheduling validator (buildGangResourceClaim)
- Update evidence manifest to use DRA ResourceClaims instead of
device-plugin nvidia.com/gpu requests
Tested on EKS with H100: all 11 conformance checks pass including
gang scheduling with DRA + KAI v0.13.0.
Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
dd95884 to
4edc021
Compare
Contributor
Author
|
GPU H100 CI checks (conformance, training, inference) are cancelled due to a pre-existing nvkind runner infrastructure issue — all H100 tests across all PRs and main have been failing/cancelled today. Not related to this change. Manually validated on EKS H100 cluster (aicr-cuj2):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upgrade KAI Scheduler from v0.12.17 to v0.13.0 and fix DRA compatibility for gang scheduling validation.
Motivation / Context
KAI v0.12.x injected
runtimeClassName: nvidiainto GPU pods via admission webhook, which conflicted with GPU Operator v25.10.0+ CDI defaults (causing "unresolvable CDI devices" errors). We worked around this with a post-install Job that patched the Config CR. KAI v0.13.0 fixes this upstream (KAI-Scheduler#1035).Additionally, KAI v0.13.0 requires
kai.scheduler/queuelabels on DRA ResourceClaims — without this label, the scheduler rejects device-plugin-style GPU requests on DRA-enabled nodes with "device-plugin GPU requests cannot be scheduled on DRA-only nodes".Fixes: N/A
Related: kai-scheduler/KAI-Scheduler#1035
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
KAI v0.13.0 upgrade:
recipes/registry.yaml: version bump v0.12.17 → v0.13.0recipes/overlays/base.yaml: version bump + remove manifestFiles workaroundrecipes/components/kai-scheduler/values.yaml: addadmission.gpuPodRuntimeClassName: ""recipes/components/kai-scheduler/manifests/disable-runtime-class-injection.yamlDRA gang scheduling fix:
validators/conformance/gang_scheduling_check.go: addkai.scheduler/queue: default-queuelabel tobuildGangResourceClaimpkg/evidence/scripts/manifests/gang-scheduling-test.yaml: replace device-pluginnvidia.com/gpu: 1with DRA ResourceClaims + queue labelsgpuPodRuntimeClassName | default "nvidia"gpuPodRuntimeClassName | quoteTesting
Tested on EKS with H100 (p5.48xlarge), KAI v0.13.0, DRA driver, GPU Operator v25.12.0.
Risk Assessment
Rollout notes: Existing clusters using KAI v0.12.x with the post-install workaround will upgrade cleanly — the workaround Job is idempotent and its removal is safe. New deployments get the fix natively via Helm values.
Checklist
make testwith-race)make lint)git commit -S)