feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling by yuanchen8911 · Pull Request #450 · NVIDIA/aicr

yuanchen8911 · 2026-03-20T20:31:52Z

Summary

Upgrade KAI Scheduler from v0.12.17 to v0.13.0 and fix DRA compatibility for gang scheduling validation.

Motivation / Context

KAI v0.12.x injected runtimeClassName: nvidia into GPU pods via admission webhook, which conflicted with GPU Operator v25.10.0+ CDI defaults (causing "unresolvable CDI devices" errors). We worked around this with a post-install Job that patched the Config CR. KAI v0.13.0 fixes this upstream (KAI-Scheduler#1035).

Additionally, KAI v0.13.0 requires kai.scheduler/queue labels on DRA ResourceClaims — without this label, the scheduler rejects device-plugin-style GPU requests on DRA-enabled nodes with "device-plugin GPU requests cannot be scheduled on DRA-only nodes".

Fixes: N/A
Related: kai-scheduler/KAI-Scheduler#1035

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: recipes/overlays, recipes/registry, evidence scripts

Implementation Notes

KAI v0.13.0 upgrade:

recipes/registry.yaml: version bump v0.12.17 → v0.13.0
recipes/overlays/base.yaml: version bump + remove manifestFiles workaround
recipes/components/kai-scheduler/values.yaml: add admission.gpuPodRuntimeClassName: ""
Delete recipes/components/kai-scheduler/manifests/disable-runtime-class-injection.yaml

DRA gang scheduling fix:

validators/conformance/gang_scheduling_check.go: add kai.scheduler/queue: default-queue label to buildGangResourceClaim
pkg/evidence/scripts/manifests/gang-scheduling-test.yaml: replace device-plugin nvidia.com/gpu: 1 with DRA ResourceClaims + queue labels

	v0.12.17	v0.13.0
Chart template	`gpuPodRuntimeClassName \| default "nvidia"`	`gpuPodRuntimeClassName \| quote`
Empty string override	Ignored	Works
Post-install patch Job	Required	Not needed
DRA claim queue label	Not required	Required

Testing

# Conformance validation (all 11 checks)
aicr validate --phase conformance  # 11/11 PASS

# Gang scheduling evidence collection
aicr validate --phase conformance --cncf-submission \
  --feature gang-scheduling --evidence-dir ./evidence  # PASS

Tested on EKS with H100 (p5.48xlarge), KAI v0.13.0, DRA driver, GPU Operator v25.12.0.

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: Existing clusters using KAI v0.12.x with the post-install workaround will upgrade cleanly — the workaround Job is idempotent and its removal is safe. New deployments get the fix natively via Helm values.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

Upgrade KAI Scheduler from v0.12.17 to v0.13.0. Key changes: 1. Runtime class injection fix (KAI-Scheduler#1035): - Add admission.gpuPodRuntimeClassName: "" to Helm values - Remove post-install workaround Job (disable-runtime-class-injection.yaml) - v0.13.0 chart respects empty string instead of falling back to "nvidia" 2. DRA ResourceClaim queue label (v0.13.0 requirement): - KAI v0.13.0 requires kai.scheduler/queue label on DRA claims - Add label to gang scheduling validator (buildGangResourceClaim) - Update evidence manifest to use DRA ResourceClaims instead of device-plugin nvidia.com/gpu requests Tested on EKS with H100: all 11 conformance checks pass including gang scheduling with DRA + KAI v0.13.0. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 · 2026-03-21T00:30:26Z

GPU H100 CI checks (conformance, training, inference) are cancelled due to a pre-existing nvkind runner infrastructure issue — all H100 tests across all PRs and main have been failing/cancelled today. Not related to this change.

Manually validated on EKS H100 cluster (aicr-cuj2):

All 11 conformance checks pass (including gang scheduling with DRA + KAI v0.13.0)
CNCF evidence collection for gang-scheduling passes
KAI v0.13.0 admission.gpuPodRuntimeClassName: "" works without the post-install workaround

yuanchen8911 added area/recipes area/validator labels Mar 20, 2026

github-actions bot added area/tests size/M and removed area/validator labels Mar 20, 2026

yuanchen8911 requested a review from mchmarny March 20, 2026 20:35

yuanchen8911 force-pushed the chore/kai-scheduler-v0.13.0 branch from d8804ae to c548b35 Compare March 20, 2026 20:35

yuanchen8911 requested review from a team as code owners March 20, 2026 20:35

yuanchen8911 force-pushed the chore/kai-scheduler-v0.13.0 branch from c548b35 to dd95884 Compare March 20, 2026 20:41

github-actions bot added area/docs size/L and removed size/M labels Mar 20, 2026

yuanchen8911 changed the title ~~chore(deps): bump kai-scheduler to v0.13.0, fix DRA gang scheduling~~ feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling Mar 20, 2026

mchmarny approved these changes Mar 20, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Mar 20, 2026

yuanchen8911 force-pushed the chore/kai-scheduler-v0.13.0 branch from dd95884 to 4edc021 Compare March 20, 2026 23:27

yuanchen8911 merged commit 76d27c7 into NVIDIA:main Mar 21, 2026
61 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling#450

feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling#450
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:chore/kai-scheduler-v0.13.0

yuanchen8911 commented Mar 20, 2026

Uh oh!

yuanchen8911 commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Mar 20, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

yuanchen8911 commented Mar 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants