Skip to content

test(autoscaling): expand e2e coverage for HPA external metrics#137142

Open
atilsensalduz wants to merge 1 commit intokubernetes:masterfrom
atilsensalduz:test/hpa-external-metrics-coverage
Open

test(autoscaling): expand e2e coverage for HPA external metrics#137142
atilsensalduz wants to merge 1 commit intokubernetes:masterfrom
atilsensalduz:test/hpa-external-metrics-coverage

Conversation

@atilsensalduz
Copy link
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR expands the e2e test coverage for the HPA External Metrics framework that was introduced in #136799 . Three new test cases are added to horizontal_pod_autosclaing_external_metrics.go covering multi-metric scaling, downscale stabilization window enforcement, and scale-up rate limiting. A new test-framework helper, CreateMultiMetricExternalHPAWithBehavior, is introduced to autoscaling_utils.go to fill a gap in the existing API. It enables creating an HPA with both multiple MetricSpec entries and a custom HorizontalPodAutoscalerBehavior in a single call.

Which issue(s) this PR is related to:

Fixes #137132

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 19, 2026
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 19, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @atilsensalduz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 19, 2026
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 19, 2026
@atilsensalduz
Copy link
Contributor Author

❯ ./_output/bin/ginkgo run \
  --label-filter="Feature:HPA" \
  --focus="Horizontal pod autoscaling \(external metrics\)" \
  --timeout=60m \
  ./_output/bin/e2e.test -- \
  --kubeconfig=$HOME/.kube/config \
  --provider=local

  I0219 21:29:26.448267   60445 e2e.go:109] Starting e2e run "72deda95-7fd2-4f9e-9e68-841a2b22686c" on Ginkgo node 1
Running Suite: Kubernetes e2e suite - ~/workspace/dev/kubernetes/_output/bin
===============================================================================================
Random Seed: 1771525765 - will randomize all specs

Will run 4 of 7435 specs

Ran 4 of 7435 Specs in 417.481 seconds
SUCCESS! -- 4 Passed | 0 Failed | 0 Pending | 7431 Skipped
PASS

@adrianmoisey
Copy link
Member

Thanks for this!
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 19, 2026
@atilsensalduz
Copy link
Contributor Author

/retest

@@ -83,4 +84,135 @@ var _ = SIGDescribe(feature.HPA, "Horizontal pod autoscaling (external metrics)"
rc.WaitForReplicas(ctx, int(*hpa.Spec.MinReplicas), maxResourceConsumerDelay+waitBuffer)
ginkgo.DeferCleanup(e2eautoscaling.DeleteHorizontalPodAutoscaler, rc, hpa.Name)
})

ginkgo.It("should scale based on multiple external metrics", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a test with multiple external metrics, where the HPA scales based on the maximum value among them. In this test you can set one metric to the value of 0 and another has a higher value then the HPA should scale according to the maximum metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review and recommendation @omerap12 , I've updated the multi-metric test and added a new test case that uses the average value metric type.
The existing multi-metric test verifies that setting one metric to 0 while another stays high prevents scale-down, as the HPA selects the maximum across all metrics.
The new test extends this validation: after scaling to max, dropping the dominant metric causes the HPA to scale down only to the level required by the remaining metric, not to minimum. This demonstrates that MAX selection holds at intermediate replica counts. The average value metric type enables precise control over these intermediate replica counts during testing.

@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch 2 times, most recently from c853200 to b47e163 Compare February 21, 2026 16:48
@k8s-ci-robot k8s-ci-robot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Feb 21, 2026
@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 7a62ee0 to 46cfa73 Compare February 21, 2026 17:11
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Feb 21, 2026
@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 46cfa73 to 6916cd8 Compare February 21, 2026 18:32
@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch 3 times, most recently from 7fa7ddb to d0eb960 Compare February 22, 2026 02:46
@atilsensalduz
Copy link
Contributor Author

/retest

@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch 8 times, most recently from a9bc82c to 5ba7d4d Compare February 23, 2026 07:58
@atilsensalduz
Copy link
Contributor Author

/retest

@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 5ba7d4d to 4f5159c Compare February 23, 2026 10:26
@atilsensalduz
Copy link
Contributor Author

Hi @omerap12 , I've completed the PR. While working on it, I noticed the external metrics test filename had a typo, so I fixed that as well. Could you please review when you have a chance?

Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from me

}
})

ginkgo.It("should scale up and down based on external metric value", framework.WithSlow(), framework.WithSerial(), func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need WithSlow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially added the slow labeling while investigating test failures in CI, but since these tests only take 2-4 minutes, I've now removed it. Thanks for catching that!

Comment on lines +105 to +128
behavior := e2eautoscaling.HPABehaviorWithStabilizationWindows(0, 0)
hpa := e2eautoscaling.CreateMultiMetricHPAWithBehavior(ctx, rc, metrics, int32(initPods), 5, behavior)
ginkgo.DeferCleanup(e2eautoscaling.DeleteHorizontalPodAutoscaler, rc, hpa.Name)

ginkgo.By("Waiting for HPA to scale up to max replicas")
rc.WaitForReplicas(ctx, int(hpa.Spec.MaxReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)

ginkgo.By("Setting queue_messages_ready to 0 while keeping http_requests_total high")
// queue_messages_ready=0 -> calculation yields 0, floored to minReplicas=1
// http_requests_total=500 -> wants 5 replicas
// HPA takes MAX(1, 5) = 5 -> should NOT scale down
err := metricsController.SetMetricValue(ctx, "queue_messages_ready", 0, nil)
framework.ExpectNoError(err)

ginkgo.By("Verifying HPA does not scale down while http_requests_total still requires max replicas")
rc.EnsureDesiredReplicasInRange(ctx, int(hpa.Spec.MaxReplicas), int(hpa.Spec.MaxReplicas),
2*hpaReconciliationInterval+waitBuffer, hpa.Name)

ginkgo.By("Setting http_requests_total to 0 to trigger scale down")
err = metricsController.SetMetricValue(ctx, "http_requests_total", 0, nil)
framework.ExpectNoError(err)

ginkgo.By("Waiting for HPA to scale down to min replicas")
rc.WaitForReplicas(ctx, int(*hpa.Spec.MinReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified by just creating one metric with 0 value and another one with value > 0 and just validate that the HPA is scaling the workload up to max replicas.
I don't see the point by waiting to HPA to scale up to max replicas and then wait again to verify it does not scales down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly. You're right, that's much simpler.

Comment on lines +113 to +115
// queue_messages_ready=0 -> calculation yields 0, floored to minReplicas=1
// http_requests_total=500 -> wants 5 replicas
// HPA takes MAX(1, 5) = 5 -> should NOT scale down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not accurate, HPA should be scaled to max replicas in this test

Comment on lines +144 to +146
// AverageValueMetricType formula: desiredReplicas = ceil(metricValue / target), independent of currentReplicas.
// This is required here because ValueMetricType computes ceil(currentReplicas * metricValue/target),
// which would keep driving scale-up as replica count grows, preventing stable intermediate values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary.

Suggested change
// AverageValueMetricType formula: desiredReplicas = ceil(metricValue / target), independent of currentReplicas.
// This is required here because ValueMetricType computes ceil(currentReplicas * metricValue/target),
// which would keep driving scale-up as replica count grows, preventing stable intermediate values.

Comment on lines +159 to +160
ginkgo.By("Waiting for HPA to scale up to max replicas (http_requests_total dominates)")
rc.WaitForReplicas(ctx, int(hpa.Spec.MaxReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? if we are using AverageValue we can use AverageValue with different values that will cause the HPA to scale out but not to max replicas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm thinking of adding a dedicated test for the average value metric type. The idea is to make it distinct from the other tests so we can specifically validate how HPA handles intermediate replica counts when metrics change.

Comment on lines +163 to +166
// http_requests_total=0 -> calculation yields 0, floored to minReplicas=1
// queue_messages_ready=100 -> wants ceil(100/50)=2 replicas
// HPA takes MAX(1, 2) = 2 -> should scale down to 2, not to min
err := metricsController.SetMetricValue(ctx, "http_requests_total", 0, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use different value so it be something similar then comparing 0 to other value

Comment on lines +250 to +253
deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay
// deadlineFor3: second scale must wait the full rate-limit window, plus HPA reaction and
// resource consumer delay.
deadlineFor3 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay + waitBuffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay
// deadlineFor3: second scale must wait the full rate-limit window, plus HPA reaction and
// resource consumer delay.
deadlineFor3 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay + waitBuffer
deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay
// deadlineFor3: second scale must wait the full rate-limit window, plus HPA reaction and
// resource consumer delay.
deadlineFor3 := deadlineFor2 + waitBuffer

// Delete the APIService
config, err := framework.LoadConfig()
if err == nil {
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that this logic makes sense, how about:

func CleanupExternalMetricsServer(ctx context.Context, c clientset.Interface, ns, name string) {
	ginkgo.By(fmt.Sprintf("Cleaning up external metrics server %s", name))

	// Delete the APIService
	config, err := framework.LoadConfig()
	if err != nil {
		framework.Failf("%s", err)
	} 
	aggClient, err := aggregatorclient.NewForConfig(config)
	if err != nil {
		framework.Failf("could not create aggregator client: %v", err)
	}
	
	err = aggClient.ApiregistrationV1().APIServices().Delete(ctx, "v1beta1.external.metrics.k8s.io", metav1.DeleteOptions{})
	if err != nil && !apierrors.IsNotFound(err) {
		framework.Logf("Error deleting APIService: %v", err)
	}
	
	// Delete the deployment
	err = c.AppsV1().Deployments(ns).Delete(ctx, name, metav1.DeleteOptions{})
	if err != nil {
		framework.Logf("Error deleting deployment: %v", err)
	}

	// Delete the service
	err = c.CoreV1().Services(ns).Delete(ctx, name, metav1.DeleteOptions{})
	if err != nil {
		framework.Logf("Error deleting service: %v", err)
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point thank you @omerap12 . I updated to fail fast instead of silently skipping the APIService deletion.

@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 4f5159c to 6ef047e Compare February 23, 2026 18:20
@atilsensalduz
Copy link
Contributor Author

/retest

Copy link
Member

@omerap12 omerap12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments from me, overall lgtm. thanks for taking care of this :)
/assign @adrianmoisey

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atilsensalduz
Once this PR has been reviewed and has the lgtm label, please ask for approval from adrianmoisey. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 784fdf2 to cc62f37 Compare February 24, 2026 11:40
Comment on lines +810 to +835
// Try to create the APIService. If it already exists (left over from a previous
// run that failed without cleanup), update it so the Service reference points to the
// current test's namespace.
created, err := aggClient.ApiregistrationV1().APIServices().Create(ctx, apiService, metav1.CreateOptions{})
if err == nil {
return created, nil
}
if !apierrors.IsAlreadyExists(err) {
return nil, err
}
for {
existing, err := aggClient.ApiregistrationV1().APIServices().Get(ctx, apiService.Name, metav1.GetOptions{})
if err != nil {
return nil, err
}
apiService.ResourceVersion = existing.ResourceVersion
updated, err := aggClient.ApiregistrationV1().APIServices().Update(ctx, apiService, metav1.UpdateOptions{})
if err == nil {
return updated, nil
}
if !apierrors.IsConflict(err) {
return nil, err
}
// 409 Conflict: resourceVersion changed between Get and Update; retry.
framework.Logf("Retrying APIService update after conflict: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I am not sure this is needed.. resources should be cleaned using the DeferCleanup function. And I think that if a test fails you will get a new cluster from scratch so no "old resources" should be lived there.
@adrianmoisey?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I added it to fix an "already exists" error in CI. Tests were previously running in parallel. When I switched them to serial, the API service issues disappeared. However, I kept the fix in place as a safety measure in case we encounter the same issue again when adding new tests that run in parallel.
I agree we can remove it if all tests will run serially going forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely happening because ApiregistrationV1 is being created in every test. When the tests run in parallel, the resources may not be cleaned up before other tests finish, causing conflicts. Ideally, we should support parallel execution, but for now it’s fine to keep the tests running serially and improve this later.

@atilsensalduz
Copy link
Contributor Author

/retest

Copy link

@antonio-mazzini antonio-mazzini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured expansion of the HPA external metrics test suite @atilsensalduz! All three requested test scenarios (Multiple Metrics, Stabilization Window, Scaling Limits) are properly implemented. The filename typo fix (autosclaingautoscaling) is a nice cleanup too. A few suggestions:

1. Bound the retry loop in createAPIService (Medium)

for {
    existing, err := aggClient.ApiregistrationV1().APIServices().Get(...)
    // retries on 409 Conflict indefinitely
}

If another controller keeps modifying the APIService resource, this loop will hang forever.

Suggestion: Add a max retry count (e.g., 5 attempts) or use wait.PollUntilContextTimeout. At minimum, check ctx.Err() to respect context cancellation.

2. Tighten deadlineFor2 in the scaling limits test (Low)

deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay

The comment says "first scale is not rate-limited" but the deadline includes scaleLimitWindow (1 min). This overly generous deadline could hide a regression where the first scale IS incorrectly rate-limited.

Suggestion: deadlineFor2 := maxHPAReactionTime + maxResourceConsumerDelay + waitBuffer

3. Cleanup logging vs failing on APIService delete (Low)

if err != nil && !apierrors.IsNotFound(err) {
    framework.Logf("Error deleting APIService: %v", err)
}

This logs instead of failing, which is inconsistent with other cleanup steps. Consider using framework.Failf or at least framework.ExpectNoError to catch cleanup issues.

4. Minor: descriptive HPA names per test (Nit)

All tests share a global hpaName constant. Using descriptive names per test (e.g., "hpa-multi-metric", "hpa-stabilization") would help when debugging CI failures.

LGTM with the retry loop fix (issue 1). Solid work!

@atilsensalduz atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from cc62f37 to 0d34ae9 Compare March 6, 2026 13:46
@atilsensalduz
Copy link
Contributor Author

Thank you for the suggestion, @antonio-mazzini! I've updated the PR accordingly

@k8s-ci-robot
Copy link
Contributor

@atilsensalduz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-linter-hints 0d34ae9 link false /test pull-kubernetes-linter-hints
pull-kubernetes-verify 0d34ae9 link true /test pull-kubernetes-verify
pull-kubernetes-e2e-autoscaling-hpa-cpu-alpha-beta 0d34ae9 link false /test pull-kubernetes-e2e-autoscaling-hpa-cpu-alpha-beta

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add additional test coverage for HPA External Metrics framework

5 participants