Skip to content

feat(validator): add EKS/GKE cluster autoscaling fallback#438

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/cluster-autoscaling-platform-fallback
Mar 19, 2026
Merged

feat(validator): add EKS/GKE cluster autoscaling fallback#438
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/cluster-autoscaling-platform-fallback

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add platform-aware fallback validation for cluster autoscaling on EKS and GKE when Karpenter is absent. Also fix validator image pull policy for dev builds.

Motivation / Context

The cluster-autoscaling conformance check previously only validated Karpenter and skipped on all other clusters. EKS and GKE clusters that use native autoscaling (ASG node groups, GKE built-in cluster autoscaler) were incorrectly reported as skipped.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Cluster autoscaling fallback (validators/conformance/cluster_autoscaling_check.go):

  • Detects platform from node providerID (aws:// → EKS, gce:// → GKE)
  • EKS: Validates GPU nodes belong to a managed node group via eks.amazonaws.com/nodegroup labels, scans all GPU nodes (not just first), optionally checks for Cluster Autoscaler deployment
  • GKE: Validates GPU nodes, checks cluster-autoscaler-status ConfigMap, records node pool annotations and autoscaler events
  • Fallback only triggers on ErrCodeNotFound (Karpenter truly absent). Unhealthy Karpenter is reported as a failure, not masked by fallback.
  • Skips for unrecognized platforms (e.g., Kind)

Image pull policy fix (pkg/validator/job/deployer.go):

  • Uses PullAlways for :latest tags (dev builds) to prevent stale cached images causing exec format error on cluster nodes
  • Versioned tags continue to use PullIfNotPresent

Testing

go test -race ./validators/conformance/ -run 'Test(Check|Detect|Deref)' -count=1
go test -race ./pkg/validator/job/ -run 'TestImagePullPolicy' -count=1
golangci-lint run ./validators/conformance/ ./pkg/validator/job/

End-to-end conformance validation on real clusters:

  • GKE aicr-demo4: 10/10 passed (cluster-autoscaling via GKE cluster autoscaler)
  • EKS aicr-cuj1: 10/10 passed (cluster-autoscaling via EKS ASG node group)

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — additive change, existing Karpenter validation path unchanged.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 19, 2026 17:34
@yuanchen8911 yuanchen8911 added enhancement New feature or request area/validator labels Mar 19, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/cluster-autoscaling-platform-fallback branch 2 times, most recently from b8f92bb to fb15adc Compare March 19, 2026 17:40
dims
dims previously approved these changes Mar 19, 2026
Copy link
Copy Markdown
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some assumptions i think we need to generalize better

@yuanchen8911 yuanchen8911 force-pushed the feat/cluster-autoscaling-platform-fallback branch from fb15adc to 1aedd80 Compare March 19, 2026 18:05
@yuanchen8911 yuanchen8911 requested review from dims and njtran March 19, 2026 18:21
njtran
njtran previously approved these changes Mar 19, 2026
When Karpenter is absent, fall back to platform-specific autoscaling
validation: EKS node group (ASG-backed) or GKE built-in cluster
autoscaler.

- Search Karpenter by label across all namespaces (not hardcoded)
- Use deployment's desired replicas for health check
- Search multiple namespaces for Cluster Autoscaler components
- Distinguish NotFound from API errors to avoid masking failures
@yuanchen8911 yuanchen8911 force-pushed the feat/cluster-autoscaling-platform-fallback branch from 1aedd80 to 0c67fa3 Compare March 19, 2026 19:22
@yuanchen8911 yuanchen8911 merged commit d3fd483 into NVIDIA:main Mar 19, 2026
15 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 25, 2026
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 25, 2026
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 25, 2026
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 25, 2026
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 25, 2026
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 25, 2026
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants