feat(validator): add EKS/GKE cluster autoscaling fallback by yuanchen8911 · Pull Request #438 · NVIDIA/aicr

yuanchen8911 · 2026-03-19T17:34:22Z

Summary

Add platform-aware fallback validation for cluster autoscaling on EKS and GKE when Karpenter is absent. Also fix validator image pull policy for dev builds.

Motivation / Context

The cluster-autoscaling conformance check previously only validated Karpenter and skipped on all other clusters. EKS and GKE clusters that use native autoscaling (ASG node groups, GKE built-in cluster autoscaler) were incorrectly reported as skipped.

Fixes: N/A
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

Cluster autoscaling fallback (validators/conformance/cluster_autoscaling_check.go):

Detects platform from node providerID (aws:// → EKS, gce:// → GKE)
EKS: Validates GPU nodes belong to a managed node group via eks.amazonaws.com/nodegroup labels, scans all GPU nodes (not just first), optionally checks for Cluster Autoscaler deployment
GKE: Validates GPU nodes, checks cluster-autoscaler-status ConfigMap, records node pool annotations and autoscaler events
Fallback only triggers on ErrCodeNotFound (Karpenter truly absent). Unhealthy Karpenter is reported as a failure, not masked by fallback.
Skips for unrecognized platforms (e.g., Kind)

Image pull policy fix (pkg/validator/job/deployer.go):

Uses PullAlways for :latest tags (dev builds) to prevent stale cached images causing exec format error on cluster nodes
Versioned tags continue to use PullIfNotPresent

Testing

go test -race ./validators/conformance/ -run 'Test(Check|Detect|Deref)' -count=1
go test -race ./pkg/validator/job/ -run 'TestImagePullPolicy' -count=1
golangci-lint run ./validators/conformance/ ./pkg/validator/job/

End-to-end conformance validation on real clusters:

GKE aicr-demo4: 10/10 passed (cluster-autoscaling via GKE cluster autoscaler)
EKS aicr-cuj1: 10/10 passed (cluster-autoscaling via EKS ASG node group)

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — additive change, existing Karpenter validation path unchanged.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

validators/conformance/cluster_autoscaling_check.go

njtran

There are some assumptions i think we need to generalize better

validators/conformance/cluster_autoscaling_check.go

When Karpenter is absent, fall back to platform-specific autoscaling validation: EKS node group (ASG-backed) or GKE built-in cluster autoscaler. - Search Karpenter by label across all namespaces (not hardcoded) - Use deployment's desired replicas for health check - Search multiple namespaces for Cluster Autoscaler components - Distinguish NotFound from API errors to avoid masking failures

…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 requested a review from a team as a code owner March 19, 2026 17:34

yuanchen8911 added enhancement New feature or request area/validator labels Mar 19, 2026

github-actions bot added the size/XL label Mar 19, 2026

yuanchen8911 force-pushed the feat/cluster-autoscaling-platform-fallback branch 2 times, most recently from b8f92bb to fb15adc Compare March 19, 2026 17:40

dims previously approved these changes Mar 19, 2026

View reviewed changes

njtran reviewed Mar 19, 2026

View reviewed changes

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

njtran reviewed Mar 19, 2026

View reviewed changes

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

njtran reviewed Mar 19, 2026

View reviewed changes

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

njtran reviewed Mar 19, 2026

View reviewed changes

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

yuanchen8911 dismissed dims’s stale review via 1aedd80 March 19, 2026 18:05

yuanchen8911 force-pushed the feat/cluster-autoscaling-platform-fallback branch from fb15adc to 1aedd80 Compare March 19, 2026 18:05

yuanchen8911 requested review from dims and njtran March 19, 2026 18:21

njtran previously approved these changes Mar 19, 2026

View reviewed changes

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

validators/conformance/cluster_autoscaling_check.go Outdated Show resolved Hide resolved

yuanchen8911 dismissed njtran’s stale review via 0c67fa3 March 19, 2026 19:22

yuanchen8911 force-pushed the feat/cluster-autoscaling-platform-fallback branch from 1aedd80 to 0c67fa3 Compare March 19, 2026 19:22

dims approved these changes Mar 19, 2026

View reviewed changes

yuanchen8911 merged commit d3fd483 into NVIDIA:main Mar 19, 2026
15 checks passed

xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

feat(validator): add EKS/GKE cluster autoscaling fallback (NVIDIA#438)

8bfb11e

yuanchen8911 mentioned this pull request Mar 25, 2026

feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images #463

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): add EKS/GKE cluster autoscaling fallback#438

feat(validator): add EKS/GKE cluster autoscaling fallback#438
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/cluster-autoscaling-platform-fallback

yuanchen8911 commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njtran left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Mar 19, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njtran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants