Kafka e2e: Bump Strimzi/Kafka for Kube 1.33, fix offset test flakes by jkyros · Pull Request #6929 · kedacore/keda

jkyros · 2025-07-25T02:22:50Z

Running our Kafka tests against kube 1.33, the version of Strimzi we're pegged to doesn't like it:

2025-07-16 22:50:17 ERROR PlatformFeaturesAvailability:141 - Detection of Kubernetes version failed.
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238) ~[io.fabric8.kubernetes-client-api-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507) ~[io.fabric8.kubernetes-client-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524) ~[io.fabric8.kubernetes-client-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.restCall(OperationSupport.java:711) ~[io.fabric8.kubernetes-client-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.impl.BaseClient.getVersionInfo(BaseClient.java:298) ~[io.fabric8.kubernetes-client-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.impl.KubernetesClientImpl.getKubernetesVersion(KubernetesClientImpl.java:639) ~[io.fabric8.kubernetes-client-6.9.0.jar:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.lambda$getVersionInfoFromKubernetes$4(PlatformFeaturesAvailability.java:139) ~[io.strimzi.cluster-operator-0.38.0.jar:0.38.0]
	at io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:180) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
	at io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:277) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
	at io.vertx.core.impl.ContextBase.lambda$internalExecuteBlocking$2(ContextBase.java:199) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
	at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76) ~[io.vertx.vertx-core-4.4.6.jar:4.4.6]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129) ~[io.fabric8.kubernetes-client-api-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122) ~[io.fabric8.kubernetes-client-api-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.utils.KubernetesSerialization.unmarshal(KubernetesSerialization.java:258) ~[io.fabric8.kubernetes-client-api-6.9.0.jar:?]
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:551) ~[io.fabric8.kubernetes-client-6.9.0.jar:?]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
	at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140) ~[io.fabric8.kubernetes-client-api-6.9.0.jar:?]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
	at io.fabric8.kubernetes.client.utils.AsyncUtils.lambda$retryWithExponentialBackoff$3(AsyncUtils.java:90) ~[io.fabric8.kubernetes-client-api-6.9.0.jar:?]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
	at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:614) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:844) ~[?:?]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482) ~[?:?]
	... 1 more
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "emulationMajor" (class io.fabric8.kubernetes.client.VersionInfo), not marked as ignorable (9 known properties: "goVersion", "gitTreeState", "platform", "minor", "gitVersion", "gitCommit", "buildDate", "compiler", "major"])
 at [Source: (BufferedInputStream); line: 4, column: 22] (through reference chain: io.fabric8.kubernetes.client.VersionInfo["emulationMajor"])

Strimzi fixed it in 0.46.0: strimzi/strimzi-kafka-operator#11456 (comment), but they aren't backporting, so we need to upgrade to a new version.

This:

Adds code to wait for the Strimzi deployment to become available on setup ( otherwise we don't notice it breaks until we get to the Kafka part of the suite)
Bumps Strimzi version for tests to 0.46.0
Bumps Kafka version for tests to 4.0.0 (3.4.0 is too old for Strimzi)
Configures Kafka for KRaft since Zookeeper has apparently been deprecated
Adds a Kafka NodePool since KRaft config requires it
Disables topic finalization for the Kafka deployments so topics don't block namespace deletion - that was a fun one 😄

Also:

Moves the kafka offset tests to their own consumer groups -- depending on timings, the earlier tests seem to pollute the later ones since it was the same consumer group. Didn't seem to be desired behavior. Not a regression from the version bump, has always been flaky.

Checklist

~~[ ] When introducing a new scaler, I agree with the scaling governance policy~~
I have verified that my change is according to the deprecations & breaking changes policy
~~[ ] Tests have been added~~
~~[ ] Changelog has been updated and is aligned with our changelog requirements~~
[ ] A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified)
[ ] A PR is opened to update the documentation on (repo) (if applicable)
Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #

Relates to #

joelsmith · 2025-07-28T15:40:30Z

Wow, that kafka test! I'm so sorry about that! But thanks for running it to ground and fixing it!

/lgtm

dttung2905

Thanks alot for the fix. I agree it is due time that we make an upgrade to the Kafka test suites, considering Kafka 4.0 with Kraft support is out

dttung2905 · 2025-07-28T22:04:03Z

tests/helper/helper.go

 	StringTrue  = "true"

-	StrimziVersion   = "0.35.0"
+	StrimziVersion   = "0.46.0"


I see that the latest is 0.47 now. Should we bump to 0.47 or is there other reason that you specifically choose 0.46
https://github.com/strimzi/strimzi-kafka-operator/releases?

I went to 0.46.0 because that was the first version where it would work again. We sat on 0.35.0 for awhile and that was from like...June of 2023? 😄

I can bump this to 0.47.0 if you want to go to 0.47.0, I was going for least change.

There is another PR updating the strimizi version. So please go to 0.47.0 :)

#6928

Thanks! Bumped this to 0.47.0. I was like "how are they passing e2e over in that other one without the rest of the fixes" but it looks like they only ran cron?

Sorry, that was my mistake. That should indeed have been Kafka.

Kube 1.33 added an emulationMajor field to the version API, which breaks the version parsing of Strimzi older than 0.47.0, which causes our Strimzi steup to fail to start on kube >= 1.33. Additionally, the faulure mode for this was silent, as all we currently test for is whether the helm chart for Strimzi was successfully applied. To rectify this, this does the following to the kafka scaler tests: - Waits for the Strimzi deployment to become available on setup - Bumps Strimzi version for tests to 0.47.0 - Bumps Kafka version for tests to 4.0.0 (3.4.0 is too old for Strimzi) - Configures Kafka for KRaft since Zookeeper has been deprecated - Disables topic finalization so topics don't block namespace deletion Signed-off-by: John Kyros <jkyros@redhat.com>

The Kafka offset tests flake in some environments if you move through the test cases too fast -- the state from the consumer group in the previous test seems to leak through to the next one because they are sharing a consumer group, and thus will share offsets. . This fixes these flakes by moving each of these tests to their own consumer group to prevent this test pollution. They will still share the same topic, which is fine, the offset is consumer-group specific. Signed-off-by: John Kyros <jkyros@redhat.com>

jkyros · 2025-08-01T21:52:37Z

Rebased and bumped strimzi to 0.47.0 (latest).

SpiritZhou · 2025-08-04T01:56:57Z

/run-e2e kafka
Update: You can check the progress here

jkyros · 2025-08-04T06:26:33Z

I did run these locally before I PR'd it, I promise 😄. In the new CRD definition zookeeper isn't required. So hmm...somehow...are we getting the old one during the test run?

 helper.go:575: Applying template: kafkaClusterTemplate
    helper.go:589: 
        	Error Trace:	/__w/keda/keda/tests/helper/helper.go:589
        	            				/__w/keda/keda/tests/scalers/kafka/kafka_test.go:784
        	            				/__w/keda/keda/tests/scalers/kafka/kafka_test.go:46***
        	Error:      	Received unexpected error:
        	            	The Kafka "kafka-test-kafka" is invalid: spec.zookeeper: Required value
        	Test:       	TestScaler
        	Messages:   	cannot apply file - The Kafka "kafka-test-kafka" is invalid: spec.zookeeper: Required value
    kafka_test.go:786: 
        	Error Trace:	/__w/keda/keda/tests/scalers/kafka/kafka_test.go:786
        	            				/__w/keda/keda/tests/scalers/kafka/kafka_test.go:46***
        	Error:      	Received unexpected error:
        	            	Error from server (NotFound): kafkas.kafka.strimzi.io "kafka-test-kafka" not found
        	Test:       	TestScaler
        	Messages:   	cannot execute command - Error from server (NotFound): kafkas.kafka.strimzi.io "kafka-test-kafka" not found
    helper.go:644: Deleting template: kafkaClientTemplate
    helper.go:***60: deleting namespace kafka-test-ns
    helper.go:***4: waiting for namespace kafka-test-ns deletion
    helper.go:***4: waiting for namespace kafka-test-ns deletion
    helper.go:***4: waiting for namespace kafka-test-ns deletion

Strimzi installs its CRDs in the cluster when it does a helm install during the e2e test run, but helm is a big chicken and won't overwrite or remove any CRDs during cleanup/reinstall, so we're stuck with the first versions helm installed unless something explicitly removes them. That wouldn't be a problem if we were grabbing a fresh cluster every time, but we're not. We just scale up an existing one and create some testing namespaces, so those old CRDs conflict with the newer versions of Strimzi we're trying to move to. This just adds strimzi CRD cleanup to the e2e cleanup script so they get removed at the end of a test run, so the next test run can install the proper ones. Signed-off-by: John Kyros <jkyros@redhat.com>

jkyros · 2025-08-12T05:23:41Z

So it looks like:

The cluster that we run the tests in gets re-used (I did not know that, today I learned!)
We never clean those strimzi CRDs out of that cluster -- we just scale it up/down and delete the test namespaces and keda stuff

keda/.github/workflows/pr-e2e.yml

Line 226 in adbda79

run: make scale-node-pool
Helm won't replace or remove the Strimzi CRDs once they're there.

So we're stuck with those old CRDs unless we delete them. I added a "clean up Strimzi CRDs" section to the e2e-cleanup, but I'm having "race condition" feelings. Like if two concurrent runs of kafka e2e happen on separate PRs, the first one will probably blow the CRDs out from under the second one.

It looks like already have most of that risk given that we install Strimzi to just the strimzi namespace, so the tests can already step on each other, but I want to make sure I'm not misunderstanding the intent and the risks we've already accepted here 😄

JorTurFer · 2025-08-17T17:47:15Z

/run-e2e kafka
Update: You can check the progress here

JorTurFer · 2025-08-17T17:49:10Z

I've deleted all the Kafka CRDs from both clusters and triggered the e2e test again. in any case, old CRDs can be reinstalled by other e2e tests, currently, @zroubalik @wozniakjan and me have access to the cluster, just ping us if they have to be deleted again

…edacore#6929) * Update strimzi to 0.47.0 for Kube 1.33, fix setup Kube 1.33 added an emulationMajor field to the version API, which breaks the version parsing of Strimzi older than 0.47.0, which causes our Strimzi steup to fail to start on kube >= 1.33. Additionally, the faulure mode for this was silent, as all we currently test for is whether the helm chart for Strimzi was successfully applied. To rectify this, this does the following to the kafka scaler tests: - Waits for the Strimzi deployment to become available on setup - Bumps Strimzi version for tests to 0.47.0 - Bumps Kafka version for tests to 4.0.0 (3.4.0 is too old for Strimzi) - Configures Kafka for KRaft since Zookeeper has been deprecated - Disables topic finalization so topics don't block namespace deletion Signed-off-by: John Kyros <jkyros@redhat.com> * Move Kafka offset tests to own consumer group The Kafka offset tests flake in some environments if you move through the test cases too fast -- the state from the consumer group in the previous test seems to leak through to the next one because they are sharing a consumer group, and thus will share offsets. . This fixes these flakes by moving each of these tests to their own consumer group to prevent this test pollution. They will still share the same topic, which is fine, the offset is consumer-group specific. Signed-off-by: John Kyros <jkyros@redhat.com> * Delete strimzi CRDs during e2e test cleanup Strimzi installs its CRDs in the cluster when it does a helm install during the e2e test run, but helm is a big chicken and won't overwrite or remove any CRDs during cleanup/reinstall, so we're stuck with the first versions helm installed unless something explicitly removes them. That wouldn't be a problem if we were grabbing a fresh cluster every time, but we're not. We just scale up an existing one and create some testing namespaces, so those old CRDs conflict with the newer versions of Strimzi we're trying to move to. This just adds strimzi CRD cleanup to the e2e cleanup script so they get removed at the end of a test run, so the next test run can install the proper ones. Signed-off-by: John Kyros <jkyros@redhat.com> --------- Signed-off-by: John Kyros <jkyros@redhat.com>

…edacore#6929) * Update strimzi to 0.47.0 for Kube 1.33, fix setup Kube 1.33 added an emulationMajor field to the version API, which breaks the version parsing of Strimzi older than 0.47.0, which causes our Strimzi steup to fail to start on kube >= 1.33. Additionally, the faulure mode for this was silent, as all we currently test for is whether the helm chart for Strimzi was successfully applied. To rectify this, this does the following to the kafka scaler tests: - Waits for the Strimzi deployment to become available on setup - Bumps Strimzi version for tests to 0.47.0 - Bumps Kafka version for tests to 4.0.0 (3.4.0 is too old for Strimzi) - Configures Kafka for KRaft since Zookeeper has been deprecated - Disables topic finalization so topics don't block namespace deletion Signed-off-by: John Kyros <jkyros@redhat.com> * Move Kafka offset tests to own consumer group The Kafka offset tests flake in some environments if you move through the test cases too fast -- the state from the consumer group in the previous test seems to leak through to the next one because they are sharing a consumer group, and thus will share offsets. . This fixes these flakes by moving each of these tests to their own consumer group to prevent this test pollution. They will still share the same topic, which is fine, the offset is consumer-group specific. Signed-off-by: John Kyros <jkyros@redhat.com> * Delete strimzi CRDs during e2e test cleanup Strimzi installs its CRDs in the cluster when it does a helm install during the e2e test run, but helm is a big chicken and won't overwrite or remove any CRDs during cleanup/reinstall, so we're stuck with the first versions helm installed unless something explicitly removes them. That wouldn't be a problem if we were grabbing a fresh cluster every time, but we're not. We just scale up an existing one and create some testing namespaces, so those old CRDs conflict with the newer versions of Strimzi we're trying to move to. This just adds strimzi CRD cleanup to the e2e cleanup script so they get removed at the end of a test run, so the next test run can install the proper ones. Signed-off-by: John Kyros <jkyros@redhat.com> --------- Signed-off-by: John Kyros <jkyros@redhat.com> Signed-off-by: David Pochopsky <david.pochopsky@united.com>

…edacore#6929) * Update strimzi to 0.47.0 for Kube 1.33, fix setup Kube 1.33 added an emulationMajor field to the version API, which breaks the version parsing of Strimzi older than 0.47.0, which causes our Strimzi steup to fail to start on kube >= 1.33. Additionally, the faulure mode for this was silent, as all we currently test for is whether the helm chart for Strimzi was successfully applied. To rectify this, this does the following to the kafka scaler tests: - Waits for the Strimzi deployment to become available on setup - Bumps Strimzi version for tests to 0.47.0 - Bumps Kafka version for tests to 4.0.0 (3.4.0 is too old for Strimzi) - Configures Kafka for KRaft since Zookeeper has been deprecated - Disables topic finalization so topics don't block namespace deletion Signed-off-by: John Kyros <jkyros@redhat.com> * Move Kafka offset tests to own consumer group The Kafka offset tests flake in some environments if you move through the test cases too fast -- the state from the consumer group in the previous test seems to leak through to the next one because they are sharing a consumer group, and thus will share offsets. . This fixes these flakes by moving each of these tests to their own consumer group to prevent this test pollution. They will still share the same topic, which is fine, the offset is consumer-group specific. Signed-off-by: John Kyros <jkyros@redhat.com> * Delete strimzi CRDs during e2e test cleanup Strimzi installs its CRDs in the cluster when it does a helm install during the e2e test run, but helm is a big chicken and won't overwrite or remove any CRDs during cleanup/reinstall, so we're stuck with the first versions helm installed unless something explicitly removes them. That wouldn't be a problem if we were grabbing a fresh cluster every time, but we're not. We just scale up an existing one and create some testing namespaces, so those old CRDs conflict with the newer versions of Strimzi we're trying to move to. This just adds strimzi CRD cleanup to the e2e cleanup script so they get removed at the end of a test run, so the next test run can install the proper ones. Signed-off-by: John Kyros <jkyros@redhat.com> --------- Signed-off-by: John Kyros <jkyros@redhat.com> Signed-off-by: Dmitriy Altuhov <altuhovd@gmail.com>

…edacore#6929) * Update strimzi to 0.47.0 for Kube 1.33, fix setup Kube 1.33 added an emulationMajor field to the version API, which breaks the version parsing of Strimzi older than 0.47.0, which causes our Strimzi steup to fail to start on kube >= 1.33. Additionally, the faulure mode for this was silent, as all we currently test for is whether the helm chart for Strimzi was successfully applied. To rectify this, this does the following to the kafka scaler tests: - Waits for the Strimzi deployment to become available on setup - Bumps Strimzi version for tests to 0.47.0 - Bumps Kafka version for tests to 4.0.0 (3.4.0 is too old for Strimzi) - Configures Kafka for KRaft since Zookeeper has been deprecated - Disables topic finalization so topics don't block namespace deletion Signed-off-by: John Kyros <jkyros@redhat.com> * Move Kafka offset tests to own consumer group The Kafka offset tests flake in some environments if you move through the test cases too fast -- the state from the consumer group in the previous test seems to leak through to the next one because they are sharing a consumer group, and thus will share offsets. . This fixes these flakes by moving each of these tests to their own consumer group to prevent this test pollution. They will still share the same topic, which is fine, the offset is consumer-group specific. Signed-off-by: John Kyros <jkyros@redhat.com> * Delete strimzi CRDs during e2e test cleanup Strimzi installs its CRDs in the cluster when it does a helm install during the e2e test run, but helm is a big chicken and won't overwrite or remove any CRDs during cleanup/reinstall, so we're stuck with the first versions helm installed unless something explicitly removes them. That wouldn't be a problem if we were grabbing a fresh cluster every time, but we're not. We just scale up an existing one and create some testing namespaces, so those old CRDs conflict with the newer versions of Strimzi we're trying to move to. This just adds strimzi CRD cleanup to the e2e cleanup script so they get removed at the end of a test run, so the next test run can install the proper ones. Signed-off-by: John Kyros <jkyros@redhat.com> --------- Signed-off-by: John Kyros <jkyros@redhat.com>

jkyros force-pushed the fix-kafka-tests-kube-1.33 branch from f6b071c to 0c36265 Compare July 25, 2025 02:44

jkyros mentioned this pull request Jul 25, 2025

AUTOSCALE-294: CMA 2.17.2 Rebase openshift/kedacore-keda#46

Merged

dttung2905 reviewed Jul 28, 2025

View reviewed changes

jkyros added 2 commits August 1, 2025 14:48

jkyros force-pushed the fix-kafka-tests-kube-1.33 branch from 0c36265 to 8b4dfe4 Compare August 1, 2025 19:48

JorTurFer mentioned this pull request Aug 17, 2025

Update Strimzi version #6928

Closed

JorTurFer approved these changes Aug 17, 2025

View reviewed changes

rickbrouwer approved these changes Aug 18, 2025

View reviewed changes

JorTurFer requested review from a team, SpiritZhou and dttung2905 August 18, 2025 07:58

JorTurFer merged commit 794fe80 into kedacore:main Aug 18, 2025
23 checks passed

rickbrouwer mentioned this pull request Aug 18, 2025

Need Strimzi in the e2e helper file an version update? #6918

Closed

wozniakjan mentioned this pull request Sep 24, 2025

add DeliverGetRate, PublishedToDeliveredRatio and ExpectedQueueConsumptionTime trigger modes to RabbitMQ scaler #6933

Merged

7 tasks

Conversation

jkyros commented Jul 25, 2025

Checklist

Uh oh!

joelsmith commented Jul 28, 2025

Uh oh!

dttung2905 left a comment

Choose a reason for hiding this comment

Uh oh!

dttung2905 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

jkyros Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

SpiritZhou Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkyros Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

rickbrouwer Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

jkyros commented Aug 1, 2025

Uh oh!

SpiritZhou commented Aug 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkyros commented Aug 4, 2025

Uh oh!

jkyros commented Aug 12, 2025

Uh oh!

JorTurFer commented Aug 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorTurFer commented Aug 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

SpiritZhou Aug 1, 2025 •

edited

Loading

SpiritZhou commented Aug 4, 2025 •

edited by github-actions bot

Loading

JorTurFer commented Aug 17, 2025 •

edited by github-actions bot

Loading