Skip to content

feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays#415

Merged
mchmarny merged 10 commits intoNVIDIA:mainfrom
Jont828:add-aks-support
Mar 17, 2026
Merged

feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays#415
mchmarny merged 10 commits intoNVIDIA:mainfrom
Jont828:add-aks-support

Conversation

@Jont828
Copy link
Copy Markdown
Contributor

@Jont828 Jont828 commented Mar 16, 2026

Mirrors the existing EKS overlay structure with AKS-specific changes:

  • Storage: gp2 → managed-csi (Azure Disk CSI, built-in AKS addon)
  • Networking: No aws-efa equivalent needed (InfiniBand native on ND-series VMs)
  • GPU drivers: Disabled in GPU Operator (AKS pre-installs NVIDIA drivers/toolkit)
  • Skyhook: Customizations omitted (packages don't support aks yet; follows Kind pattern)
  • H100 only (GB200 not available on Azure)

Summary

Motivation / Context

Fixes:
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Mirrors the existing EKS overlay structure with AKS-specific changes:
- Storage: gp2 → managed-csi (Azure Disk CSI, built-in AKS addon)
- Networking: No aws-efa equivalent needed (InfiniBand native on ND-series VMs)
- GPU drivers: Disabled in GPU Operator (AKS pre-installs NVIDIA drivers/toolkit)
- Skyhook: Customizations omitted (packages don't support aks yet; follows Kind pattern)
- H100 only (GB200 not available on Azure)

Signed-off-by: Jont828 <jt572@cornell.edu>
@Jont828 Jont828 requested review from a team as code owners March 16, 2026 20:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

Welcome to AICR, @Jont828! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@mchmarny mchmarny added the enhancement New feature or request label Mar 16, 2026
@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 16, 2026
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on AKS support! The overlay hierarchy mirrors other services and the Azure-specific adaptations (managed-csi storage, disabled driver/toolkit, no EFA equivalent) are well-reasoned. Left a few inline comments — nothing major, mostly verification questions and a minor cleanup suggestion. Looking forward to seeing this land!

Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please take a look at the comment: #415 (comment)

@Jont828 Jont828 changed the title [WIP] feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays Mar 16, 2026
@yuanchen8911
Copy link
Copy Markdown
Contributor

Thanks for the PR! cc @mchmarny

# Severity Issue Suggested Fix
1 High helm-values check not in validator catalog — will fail deployment validation at runtime Remove - helm-values from deployment checks in h100-aks-ubuntu-training.yaml:50, h100-aks-ubuntu-inference-dynamo.yaml:75, and examples/recipes/aks-training.yaml:171. EKS equivalents already removed it (PR #388).
2 High AKS inference inherits default GPU Operator values with driver/toolkit enabled — AKS pre-installs both Move valuesFile: components/gpu-operator/values-aks-training.yaml (or a renamed values-aks.yaml) to aks.yaml base componentRefs so both training and inference inherit it. Currently only aks-training.yaml:38 sets it.
3 Low (optional) kubeflow-trainer missing dependencyRefs in h100-aks-ubuntu-training-kubeflow.yaml:45 Add dependencyRefs: [cert-manager, kube-prometheus-stack, gpu-operator] to match EKS/GKE equivalents.
4 Low (optional) No AKS conformance test in conformance_test.go Add h100-aks-ubuntu-inference-dynamo and/or h100-aks-ubuntu-training test cases to TestConformanceRecipeInvariants.
5 Low (optional) examples/recipes/aks-training.yaml drifted from generated output Pins skyhook-operator v0.13.1 (current: v0.14.0); includes helm-values check (line 171). Regenerate with aicr recipe.

Jont828 added 3 commits March 16, 2026 21:01
Signed-off-by: Jont828 <jt572@cornell.edu>
…erlay

AKS inference recipes silently inherited the base values.yaml (with
toolkit.enabled: true) because neither aks.yaml nor aks-inference.yaml
overrode the gpu-operator valuesFile. Since AKS pre-installs the NVIDIA
container toolkit, this caused conflicts on inference deployments.

Create values-aks.yaml with the shared toolkit disable and wire it into
the aks.yaml base overlay so all AKS intents inherit it. Slim down
values-aks-training.yaml to only training-specific settings.

Add docs/integrator/aks-gpu-setup.md documenting the --gpu-driver none
nodepool prerequisite to avoid driver conflicts with GPU Operator.

Signed-off-by: Jont828 <jt572@cornell.edu>
Remove non-existent network-operator-health check from aks.yaml
conformance validation, remove stale helm-values check references,
fix YAML comment indentation for yamllint compliance, add missing
AKS GPU Setup sidebar entry, and add kubeflow-trainer dependency
refs.

Signed-off-by: Jont828 <jt572@cornell.edu>
Jont828 and others added 3 commits March 16, 2026 22:07
DRA (Dynamic Resource Allocation) graduated to GA in Kubernetes 1.34
with stable resource.k8s.io/v1 APIs. Bump the AKS overlay K8s version
constraint from >= 1.28 to >= 1.34, update integrator and user docs
with version requirements, feature gate timeline, CLI overrides, and
device-plugin vs DRA guidance. Add AKS to supported platforms in README.

Signed-off-by: Jont828 <jt572@cornell.edu>
…overlay

Add K8s >= 1.34 constraint, nvidia-dra-driver-gpu component ref with
gpuResourcesEnabledOverride, and dra-support conformance check to the
h100-aks-ubuntu-inference-dynamo overlay.

Signed-off-by: Jont828 <jt572@cornell.edu>
@yuanchen8911 yuanchen8911 self-requested a review March 17, 2026 15:54
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911
Copy link
Copy Markdown
Contributor

Need to rebase.

@mchmarny mchmarny enabled auto-merge (squash) March 17, 2026 15:56
@mchmarny mchmarny merged commit 87fd28f into NVIDIA:main Mar 17, 2026
60 checks passed
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
…#415)

Signed-off-by: Jont828 <jt572@cornell.edu>
Co-authored-by: Mark Chmarny <mchmarny@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants