Skip to content

feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling#450

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:chore/kai-scheduler-v0.13.0
Mar 21, 2026
Merged

feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling#450
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:chore/kai-scheduler-v0.13.0

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Upgrade KAI Scheduler from v0.12.17 to v0.13.0 and fix DRA compatibility for gang scheduling validation.

Motivation / Context

KAI v0.12.x injected runtimeClassName: nvidia into GPU pods via admission webhook, which conflicted with GPU Operator v25.10.0+ CDI defaults (causing "unresolvable CDI devices" errors). We worked around this with a post-install Job that patched the Config CR. KAI v0.13.0 fixes this upstream (KAI-Scheduler#1035).

Additionally, KAI v0.13.0 requires kai.scheduler/queue labels on DRA ResourceClaims — without this label, the scheduler rejects device-plugin-style GPU requests on DRA-enabled nodes with "device-plugin GPU requests cannot be scheduled on DRA-only nodes".

Fixes: N/A
Related: kai-scheduler/KAI-Scheduler#1035

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes/overlays, recipes/registry, evidence scripts

Implementation Notes

KAI v0.13.0 upgrade:

  • recipes/registry.yaml: version bump v0.12.17 → v0.13.0
  • recipes/overlays/base.yaml: version bump + remove manifestFiles workaround
  • recipes/components/kai-scheduler/values.yaml: add admission.gpuPodRuntimeClassName: ""
  • Delete recipes/components/kai-scheduler/manifests/disable-runtime-class-injection.yaml

DRA gang scheduling fix:

  • validators/conformance/gang_scheduling_check.go: add kai.scheduler/queue: default-queue label to buildGangResourceClaim
  • pkg/evidence/scripts/manifests/gang-scheduling-test.yaml: replace device-plugin nvidia.com/gpu: 1 with DRA ResourceClaims + queue labels
v0.12.17 v0.13.0
Chart template gpuPodRuntimeClassName | default "nvidia" gpuPodRuntimeClassName | quote
Empty string override Ignored Works
Post-install patch Job Required Not needed
DRA claim queue label Not required Required

Testing

# Conformance validation (all 11 checks)
aicr validate --phase conformance  # 11/11 PASS

# Gang scheduling evidence collection
aicr validate --phase conformance --cncf-submission \
  --feature gang-scheduling --evidence-dir ./evidence  # PASS

Tested on EKS with H100 (p5.48xlarge), KAI v0.13.0, DRA driver, GPU Operator v25.12.0.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Existing clusters using KAI v0.12.x with the post-install workaround will upgrade cleanly — the workaround Job is idempotent and its removal is safe. New deployments get the fix natively via Helm values.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from mchmarny March 20, 2026 20:35
@yuanchen8911 yuanchen8911 force-pushed the chore/kai-scheduler-v0.13.0 branch from d8804ae to c548b35 Compare March 20, 2026 20:35
@yuanchen8911 yuanchen8911 requested review from a team as code owners March 20, 2026 20:35
@yuanchen8911 yuanchen8911 force-pushed the chore/kai-scheduler-v0.13.0 branch from c548b35 to dd95884 Compare March 20, 2026 20:41
@yuanchen8911 yuanchen8911 changed the title chore(deps): bump kai-scheduler to v0.13.0, fix DRA gang scheduling feat(recipes): bump kai-scheduler to v0.13.0, fix DRA gang scheduling Mar 20, 2026
Upgrade KAI Scheduler from v0.12.17 to v0.13.0. Key changes:

1. Runtime class injection fix (KAI-Scheduler#1035):
   - Add admission.gpuPodRuntimeClassName: "" to Helm values
   - Remove post-install workaround Job (disable-runtime-class-injection.yaml)
   - v0.13.0 chart respects empty string instead of falling back to "nvidia"

2. DRA ResourceClaim queue label (v0.13.0 requirement):
   - KAI v0.13.0 requires kai.scheduler/queue label on DRA claims
   - Add label to gang scheduling validator (buildGangResourceClaim)
   - Update evidence manifest to use DRA ResourceClaims instead of
     device-plugin nvidia.com/gpu requests

Tested on EKS with H100: all 11 conformance checks pass including
gang scheduling with DRA + KAI v0.13.0.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the chore/kai-scheduler-v0.13.0 branch from dd95884 to 4edc021 Compare March 20, 2026 23:27
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

GPU H100 CI checks (conformance, training, inference) are cancelled due to a pre-existing nvkind runner infrastructure issue — all H100 tests across all PRs and main have been failing/cancelled today. Not related to this change.

Manually validated on EKS H100 cluster (aicr-cuj2):

  • All 11 conformance checks pass (including gang scheduling with DRA + KAI v0.13.0)
  • CNCF evidence collection for gang-scheduling passes
  • KAI v0.13.0 admission.gpuPodRuntimeClassName: "" works without the post-install workaround

@yuanchen8911 yuanchen8911 merged commit 76d27c7 into NVIDIA:main Mar 21, 2026
61 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants