test(autoscaling): expand e2e coverage for HPA external metrics by atilsensalduz · Pull Request #137142 · kubernetes/kubernetes

atilsensalduz · 2026-02-19T17:17:47Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR expands the e2e test coverage for the HPA External Metrics framework that was introduced in #136799 . Three new test cases are added to horizontal_pod_autosclaing_external_metrics.go covering multi-metric scaling, downscale stabilization window enforcement, and scale-up rate limiting. A new test-framework helper, CreateMultiMetricExternalHPAWithBehavior, is introduced to autoscaling_utils.go to fill a gap in the existing API. It enables creating an HPA with both multiple MetricSpec entries and a custom HorizontalPodAutoscalerBehavior in a single call.

Which issue(s) this PR is related to:

Fixes #137132

k8s-ci-robot · 2026-02-19T17:17:50Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-02-19T17:17:56Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-02-19T17:17:57Z

Hi @atilsensalduz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

atilsensalduz · 2026-02-19T18:41:18Z

❯ ./_output/bin/ginkgo run \
  --label-filter="Feature:HPA" \
  --focus="Horizontal pod autoscaling \(external metrics\)" \
  --timeout=60m \
  ./_output/bin/e2e.test -- \
  --kubeconfig=$HOME/.kube/config \
  --provider=local

  I0219 21:29:26.448267   60445 e2e.go:109] Starting e2e run "72deda95-7fd2-4f9e-9e68-841a2b22686c" on Ginkgo node 1
Running Suite: Kubernetes e2e suite - ~/workspace/dev/kubernetes/_output/bin
===============================================================================================
Random Seed: 1771525765 - will randomize all specs

Will run 4 of 7435 specs

Ran 4 of 7435 Specs in 417.481 seconds
SUCCESS! -- 4 Passed | 0 Failed | 0 Pending | 7431 Skipped
PASS

adrianmoisey · 2026-02-19T18:57:34Z

Thanks for this!
/ok-to-test

atilsensalduz · 2026-02-20T13:51:47Z

/retest

omerap12 · 2026-02-21T09:26:09Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

@@ -83,4 +84,135 @@ var _ = SIGDescribe(feature.HPA, "Horizontal pod autoscaling (external metrics)"
 		rc.WaitForReplicas(ctx, int(*hpa.Spec.MinReplicas), maxResourceConsumerDelay+waitBuffer)
 		ginkgo.DeferCleanup(e2eautoscaling.DeleteHorizontalPodAutoscaler, rc, hpa.Name)
 	})
+
+	ginkgo.It("should scale based on multiple external metrics", func(ctx context.Context) {


I would expect a test with multiple external metrics, where the HPA scales based on the maximum value among them. In this test you can set one metric to the value of 0 and another has a higher value then the HPA should scale according to the maximum metric.

Thank you for the review and recommendation @omerap12 , I've updated the multi-metric test and added a new test case that uses the average value metric type.
The existing multi-metric test verifies that setting one metric to 0 while another stays high prevents scale-down, as the HPA selects the maximum across all metrics.
The new test extends this validation: after scaling to max, dropping the dominant metric causes the HPA to scale down only to the level required by the remaining metric, not to minimum. This demonstrates that MAX selection holds at intermediate replica counts. The average value metric type enables precise control over these intermediate replica counts during testing.

atilsensalduz · 2026-02-22T09:10:07Z

/retest

atilsensalduz · 2026-02-23T08:58:14Z

/retest

atilsensalduz · 2026-02-23T11:24:58Z

Hi @omerap12 , I've completed the PR. While working on it, I noticed the external metrics test filename had a typo, so I fixed that as well. Could you please review when you have a chance?

omerap12

Some comments from me

omerap12 · 2026-02-23T11:31:37Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		}
+	})
+
+	ginkgo.It("should scale up and down based on external metric value", framework.WithSlow(), framework.WithSerial(), func(ctx context.Context) {


Why do we need WithSlow?

I initially added the slow labeling while investigating test failures in CI, but since these tests only take 2-4 minutes, I've now removed it. Thanks for catching that!

omerap12 · 2026-02-23T11:43:01Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		behavior := e2eautoscaling.HPABehaviorWithStabilizationWindows(0, 0)
+		hpa := e2eautoscaling.CreateMultiMetricHPAWithBehavior(ctx, rc, metrics, int32(initPods), 5, behavior)
+		ginkgo.DeferCleanup(e2eautoscaling.DeleteHorizontalPodAutoscaler, rc, hpa.Name)
+
+		ginkgo.By("Waiting for HPA to scale up to max replicas")
+		rc.WaitForReplicas(ctx, int(hpa.Spec.MaxReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)
+
+		ginkgo.By("Setting queue_messages_ready to 0 while keeping http_requests_total high")
+		// queue_messages_ready=0 -> calculation yields 0, floored to minReplicas=1
+		// http_requests_total=500 -> wants 5 replicas
+		// HPA takes MAX(1, 5) = 5 -> should NOT scale down
+		err := metricsController.SetMetricValue(ctx, "queue_messages_ready", 0, nil)
+		framework.ExpectNoError(err)
+
+		ginkgo.By("Verifying HPA does not scale down while http_requests_total still requires max replicas")
+		rc.EnsureDesiredReplicasInRange(ctx, int(hpa.Spec.MaxReplicas), int(hpa.Spec.MaxReplicas),
+			2*hpaReconciliationInterval+waitBuffer, hpa.Name)
+
+		ginkgo.By("Setting http_requests_total to 0 to trigger scale down")
+		err = metricsController.SetMetricValue(ctx, "http_requests_total", 0, nil)
+		framework.ExpectNoError(err)
+
+		ginkgo.By("Waiting for HPA to scale down to min replicas")
+		rc.WaitForReplicas(ctx, int(*hpa.Spec.MinReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)


This can be simplified by just creating one metric with 0 value and another one with value > 0 and just validate that the HPA is scaling the workload up to max replicas.
I don't see the point by waiting to HPA to scale up to max replicas and then wait again to verify it does not scales down.

Updated accordingly. You're right, that's much simpler.

omerap12 · 2026-02-23T12:38:15Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		// queue_messages_ready=0 -> calculation yields 0, floored to minReplicas=1
+		// http_requests_total=500 -> wants 5 replicas
+		// HPA takes MAX(1, 5) = 5 -> should NOT scale down


This is not accurate, HPA should be scaled to max replicas in this test

omerap12 · 2026-02-23T12:47:28Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		// AverageValueMetricType formula: desiredReplicas = ceil(metricValue / target), independent of currentReplicas.
+		// This is required here because ValueMetricType computes ceil(currentReplicas * metricValue/target),
+		// which would keep driving scale-up as replica count grows, preventing stable intermediate values.


This is unnecessary.

Suggested change

// AverageValueMetricType formula: desiredReplicas = ceil(metricValue / target), independent of currentReplicas.

// This is required here because ValueMetricType computes ceil(currentReplicas * metricValue/target),

// which would keep driving scale-up as replica count grows, preventing stable intermediate values.

omerap12 · 2026-02-23T13:00:21Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		ginkgo.By("Waiting for HPA to scale up to max replicas (http_requests_total dominates)")
+		rc.WaitForReplicas(ctx, int(hpa.Spec.MaxReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)


Why? if we are using AverageValue we can use AverageValue with different values that will cause the HPA to scale out but not to max replicas.

Actually, I'm thinking of adding a dedicated test for the average value metric type. The idea is to make it distinct from the other tests so we can specifically validate how HPA handles intermediate replica counts when metrics change.

omerap12 · 2026-02-23T13:00:55Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		// http_requests_total=0 -> calculation yields 0, floored to minReplicas=1
+		// queue_messages_ready=100 -> wants ceil(100/50)=2 replicas
+		// HPA takes MAX(1, 2) = 2 -> should scale down to 2, not to min
+		err := metricsController.SetMetricValue(ctx, "http_requests_total", 0, nil)


Let's use different value so it be something similar then comparing 0 to other value

omerap12 · 2026-02-23T13:19:37Z

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

+		deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay
+		// deadlineFor3: second scale must wait the full rate-limit window, plus HPA reaction and
+		// resource consumer delay.
+		deadlineFor3 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay + waitBuffer


Suggested change

deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay

// deadlineFor3: second scale must wait the full rate-limit window, plus HPA reaction and

// resource consumer delay.

deadlineFor3 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay + waitBuffer

deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay

// deadlineFor3: second scale must wait the full rate-limit window, plus HPA reaction and

// resource consumer delay.

deadlineFor3 := deadlineFor2 + waitBuffer

omerap12 · 2026-02-23T13:24:46Z

test/e2e/framework/autoscaling/autoscaling_utils.go

 	// Delete the APIService
 	config, err := framework.LoadConfig()
-	if err == nil {
+	if err != nil {


I am not sure that this logic makes sense, how about:

func CleanupExternalMetricsServer(ctx context.Context, c clientset.Interface, ns, name string) { ginkgo.By(fmt.Sprintf("Cleaning up external metrics server %s", name)) // Delete the APIService config, err := framework.LoadConfig() if err != nil { framework.Failf("%s", err) } aggClient, err := aggregatorclient.NewForConfig(config) if err != nil { framework.Failf("could not create aggregator client: %v", err) } err = aggClient.ApiregistrationV1().APIServices().Delete(ctx, "v1beta1.external.metrics.k8s.io", metav1.DeleteOptions{}) if err != nil && !apierrors.IsNotFound(err) { framework.Logf("Error deleting APIService: %v", err) } // Delete the deployment err = c.AppsV1().Deployments(ns).Delete(ctx, name, metav1.DeleteOptions{}) if err != nil { framework.Logf("Error deleting deployment: %v", err) } // Delete the service err = c.CoreV1().Services(ns).Delete(ctx, name, metav1.DeleteOptions{}) if err != nil { framework.Logf("Error deleting service: %v", err) } }

Good point thank you @omerap12 . I updated to fail fast instead of silently skipping the APIService deletion.

atilsensalduz · 2026-02-24T07:18:27Z

/retest

omerap12

More comments from me, overall lgtm. thanks for taking care of this :)
/assign @adrianmoisey

test/e2e/autoscaling/horizontal_pod_autoscaling_external_metrics.go

k8s-ci-robot · 2026-02-24T11:35:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atilsensalduz
Once this PR has been reviewed and has the lgtm label, please ask for approval from adrianmoisey. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

omerap12 · 2026-02-24T11:46:34Z

test/e2e/framework/autoscaling/autoscaling_utils.go

+	// Try to create the APIService. If it already exists (left over from a previous
+	// run that failed without cleanup), update it so the Service reference points to the
+	// current test's namespace.
+	created, err := aggClient.ApiregistrationV1().APIServices().Create(ctx, apiService, metav1.CreateOptions{})
+	if err == nil {
+		return created, nil
+	}
+	if !apierrors.IsAlreadyExists(err) {
+		return nil, err
+	}
+	for {
+		existing, err := aggClient.ApiregistrationV1().APIServices().Get(ctx, apiService.Name, metav1.GetOptions{})
+		if err != nil {
+			return nil, err
+		}
+		apiService.ResourceVersion = existing.ResourceVersion
+		updated, err := aggClient.ApiregistrationV1().APIServices().Update(ctx, apiService, metav1.UpdateOptions{})
+		if err == nil {
+			return updated, nil
+		}
+		if !apierrors.IsConflict(err) {
+			return nil, err
+		}
+		// 409 Conflict: resourceVersion changed between Get and Update; retry.
+		framework.Logf("Retrying APIService update after conflict: %v", err)
+	}


TBH I am not sure this is needed.. resources should be cleaned using the DeferCleanup function. And I think that if a test fails you will get a new cluster from scratch so no "old resources" should be lived there.
@adrianmoisey?

Actually, I added it to fix an "already exists" error in CI. Tests were previously running in parallel. When I switched them to serial, the API service issues disappeared. However, I kept the fix in place as a safety measure in case we encounter the same issue again when adding new tests that run in parallel.
I agree we can remove it if all tests will run serially going forward

This is likely happening because ApiregistrationV1 is being created in every test. When the tests run in parallel, the resources may not be cleaned up before other tests finish, causing conflicts. Ideally, we should support parallel execution, but for now it’s fine to keep the tests running serially and improve this later.

atilsensalduz · 2026-02-24T12:35:24Z

/retest

antonio-mazzini

Well-structured expansion of the HPA external metrics test suite @atilsensalduz! All three requested test scenarios (Multiple Metrics, Stabilization Window, Scaling Limits) are properly implemented. The filename typo fix (autosclaing → autoscaling) is a nice cleanup too. A few suggestions:

1. Bound the retry loop in `createAPIService` (Medium)

for {
    existing, err := aggClient.ApiregistrationV1().APIServices().Get(...)
    // retries on 409 Conflict indefinitely
}

If another controller keeps modifying the APIService resource, this loop will hang forever.

Suggestion: Add a max retry count (e.g., 5 attempts) or use wait.PollUntilContextTimeout. At minimum, check ctx.Err() to respect context cancellation.

2. Tighten `deadlineFor2` in the scaling limits test (Low)

deadlineFor2 := scaleLimitWindow + maxHPAReactionTime + maxResourceConsumerDelay

The comment says "first scale is not rate-limited" but the deadline includes scaleLimitWindow (1 min). This overly generous deadline could hide a regression where the first scale IS incorrectly rate-limited.

Suggestion: deadlineFor2 := maxHPAReactionTime + maxResourceConsumerDelay + waitBuffer

3. Cleanup logging vs failing on APIService delete (Low)

if err != nil && !apierrors.IsNotFound(err) {
    framework.Logf("Error deleting APIService: %v", err)
}

This logs instead of failing, which is inconsistent with other cleanup steps. Consider using framework.Failf or at least framework.ExpectNoError to catch cleanup issues.

4. Minor: descriptive HPA names per test (Nit)

All tests share a global hpaName constant. Using descriptive names per test (e.g., "hpa-multi-metric", "hpa-stabilization") would help when debugging CI failures.

LGTM with the retry loop fix (issue 1). Solid work!

atilsensalduz · 2026-03-06T13:49:33Z

Thank you for the suggestion, @antonio-mazzini! I've updated the PR accordingly

k8s-ci-robot · 2026-03-06T14:40:36Z

@atilsensalduz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-linter-hints	`0d34ae9`	link	false	`/test pull-kubernetes-linter-hints`
pull-kubernetes-verify	`0d34ae9`	link	true	`/test pull-kubernetes-verify`
pull-kubernetes-e2e-autoscaling-hpa-cpu-alpha-beta	`0d34ae9`	link	false	`/test pull-kubernetes-e2e-autoscaling-hpa-cpu-alpha-beta`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 19, 2026

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Feb 19, 2026

k8s-ci-robot requested review from adrianmoisey and gjtempleton February 19, 2026 17:18

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 19, 2026

omerap12 reviewed Feb 21, 2026

View reviewed changes

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch 2 times, most recently from c853200 to b47e163 Compare February 21, 2026 16:48

k8s-ci-robot added the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Feb 21, 2026

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 7a62ee0 to 46cfa73 Compare February 21, 2026 17:11

k8s-ci-robot removed the do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. label Feb 21, 2026

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 46cfa73 to 6916cd8 Compare February 21, 2026 18:32

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch 3 times, most recently from 7fa7ddb to d0eb960 Compare February 22, 2026 02:46

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch 8 times, most recently from a9bc82c to 5ba7d4d Compare February 23, 2026 07:58

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 5ba7d4d to 4f5159c Compare February 23, 2026 10:26

omerap12 suggested changes Feb 23, 2026

View reviewed changes

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 4f5159c to 6ef047e Compare February 23, 2026 18:20

atilsensalduz requested a review from omerap12 February 24, 2026 09:27

omerap12 reviewed Feb 24, 2026

View reviewed changes

k8s-ci-robot assigned adrianmoisey Feb 24, 2026

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from 784fdf2 to cc62f37 Compare February 24, 2026 11:40

omerap12 reviewed Feb 24, 2026

View reviewed changes

antonio-mazzini reviewed Mar 4, 2026

View reviewed changes

test(autoscaling): expand e2e coverage for HPA external metrics

0d34ae9

atilsensalduz force-pushed the test/hpa-external-metrics-coverage branch from cc62f37 to 0d34ae9 Compare March 6, 2026 13:46

	// AverageValueMetricType formula: desiredReplicas = ceil(metricValue / target), independent of currentReplicas.
	// This is required here because ValueMetricType computes ceil(currentReplicas * metricValue/target),
	// which would keep driving scale-up as replica count grows, preventing stable intermediate values.

		ginkgo.By("Waiting for HPA to scale up to max replicas (http_requests_total dominates)")
		rc.WaitForReplicas(ctx, int(hpa.Spec.MaxReplicas), maxHPAReactionTime+maxResourceConsumerDelay+waitBuffer)

Conversation

atilsensalduz commented Feb 19, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Uh oh!

k8s-ci-robot commented Feb 19, 2026

Uh oh!

k8s-ci-robot commented Feb 19, 2026

Uh oh!

k8s-ci-robot commented Feb 19, 2026

Uh oh!

atilsensalduz commented Feb 19, 2026

Uh oh!

adrianmoisey commented Feb 19, 2026

Uh oh!

atilsensalduz commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atilsensalduz commented Feb 22, 2026

Uh oh!

atilsensalduz commented Feb 23, 2026

Uh oh!

atilsensalduz commented Feb 23, 2026

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atilsensalduz commented Feb 24, 2026

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atilsensalduz commented Feb 24, 2026

Uh oh!

antonio-mazzini left a comment

Choose a reason for hiding this comment

1. Bound the retry loop in createAPIService (Medium)

2. Tighten deadlineFor2 in the scaling limits test (Low)

3. Cleanup logging vs failing on APIService delete (Low)

4. Minor: descriptive HPA names per test (Nit)

Uh oh!

1. Bound the retry loop in `createAPIService` (Medium)

2. Tighten `deadlineFor2` in the scaling limits test (Low)