feat(validator): add EKS/GKE cluster autoscaling fallback#438
Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom Mar 19, 2026
Merged
Conversation
b8f92bb to
fb15adc
Compare
dims
previously approved these changes
Mar 19, 2026
njtran
reviewed
Mar 19, 2026
njtran
reviewed
Mar 19, 2026
njtran
reviewed
Mar 19, 2026
njtran
reviewed
Mar 19, 2026
Contributor
njtran
left a comment
There was a problem hiding this comment.
There are some assumptions i think we need to generalize better
njtran
reviewed
Mar 19, 2026
fb15adc to
1aedd80
Compare
njtran
previously approved these changes
Mar 19, 2026
When Karpenter is absent, fall back to platform-specific autoscaling validation: EKS node group (ASG-backed) or GKE built-in cluster autoscaler. - Search Karpenter by label across all namespaces (not hardcoded) - Use deployment's desired replicas for health check - Search multiple namespaces for Cluster Autoscaler components - Distinguish NotFound from API errors to avoid masking failures
1aedd80 to
0c67fa3
Compare
dims
approved these changes
Mar 19, 2026
xdu31
pushed a commit
to xdu31/aicr
that referenced
this pull request
Mar 24, 2026
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Mar 25, 2026
…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Mar 25, 2026
…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Mar 25, 2026
…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Mar 25, 2026
…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
25 tasks
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Mar 25, 2026
…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Mar 25, 2026
…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add platform-aware fallback validation for cluster autoscaling on EKS and GKE when Karpenter is absent. Also fix validator image pull policy for dev builds.
Motivation / Context
The
cluster-autoscalingconformance check previously only validated Karpenter and skipped on all other clusters. EKS and GKE clusters that use native autoscaling (ASG node groups, GKE built-in cluster autoscaler) were incorrectly reported as skipped.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Cluster autoscaling fallback (
validators/conformance/cluster_autoscaling_check.go):providerID(aws://→ EKS,gce://→ GKE)eks.amazonaws.com/nodegrouplabels, scans all GPU nodes (not just first), optionally checks for Cluster Autoscaler deploymentcluster-autoscaler-statusConfigMap, records node pool annotations and autoscaler eventsErrCodeNotFound(Karpenter truly absent). Unhealthy Karpenter is reported as a failure, not masked by fallback.Image pull policy fix (
pkg/validator/job/deployer.go):PullAlwaysfor:latesttags (dev builds) to prevent stale cached images causingexec format erroron cluster nodesPullIfNotPresentTesting
End-to-end conformance validation on real clusters:
Risk Assessment
Rollout notes: N/A — additive change, existing Karpenter validation path unchanged.
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info