Skip to content

Fix Kafka flaky tests#48785

Merged
belimawr merged 7 commits intoelastic:mainfrom
rdner:fix-flaky-kafka-tests
Feb 26, 2026
Merged

Fix Kafka flaky tests#48785
belimawr merged 7 commits intoelastic:mainfrom
rdner:fix-flaky-kafka-tests

Conversation

@rdner
Copy link
Copy Markdown
Member

@rdner rdner commented Feb 10, 2026

Proposed commit message

  • Added delays and waits at certain steps to ensure the Kafka configuration running properly before starting the test.
  • Add connection retries for the Kafka producer

Assisted by Cursor.

The Kafka tests are flaky and sometime fails with:

=== FAIL: filebeat/input/kafka TestSASLAuthentication/PLAIN (0.93s)
kafka_integration_test.go:369: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes

For example, in this build https://buildkite.com/elastic/filebeat/builds/28161#019c482f-481b-40db-85e6-34da7056207f

Relates to #48026

@rdner rdner self-assigned this Feb 10, 2026
@rdner rdner added Filebeat Filebeat flaky-test Unstable or unreliable test cases. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Feb 10, 2026
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 10, 2026
@rdner rdner added needs_team Indicates that the issue/PR needs a Team:* label backport-active-all Automated backport with mergify to all the active branches labels Feb 10, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 10, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer

Assisted by Cursor.
@rdner rdner force-pushed the fix-flaky-kafka-tests branch from b7f8b79 to 722b13f Compare February 10, 2026 19:34
@rdner rdner marked this pull request as ready for review February 10, 2026 20:08
@rdner rdner requested a review from a team as a code owner February 10, 2026 20:08
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@rdner
Copy link
Copy Markdown
Member Author

rdner commented Feb 10, 2026

Looks like it still fails:

=== FAIL: filebeat/input/kafka TestTest (1.03s)
kafka_integration_test.go:462: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes

I'm taking this back to draft and will investigate.

@rdner rdner marked this pull request as draft February 10, 2026 20:22
@rdner
Copy link
Copy Markdown
Member Author

rdner commented Feb 10, 2026

@faec is it a requirement to have 3 partitions for Kafka? I cannot find any test that requires this:

https://github.com/elastic/beats/blame/b34256a57d1a1d441a2787bb55302434d4978c94/testing/environments/docker/kafka/run.sh#L37

@rdner
Copy link
Copy Markdown
Member Author

rdner commented Feb 11, 2026

I'm going to run the tests on this PR multiple times to make sure it actually fixed the flaky behavior. After this confirmation the PR can be merged.

@rdner
Copy link
Copy Markdown
Member Author

rdner commented Feb 11, 2026

Failed again with:

kafka_integration_test.go:463: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes

I'll investigate further but the easiest way to fix this would be switching to a single partition instead of 3.

@rdner rdner requested a review from faec February 11, 2026 12:31
Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor CLI, Model: GPT-5.3 Codex
@rdner rdner marked this pull request as ready for review February 25, 2026 13:59
@rdner
Copy link
Copy Markdown
Member Author

rdner commented Feb 25, 2026

/test

@rdner
Copy link
Copy Markdown
Member Author

rdner commented Feb 26, 2026

Looks like 3 consecutive CI runs didn't fail. Let's merge it and observe.
@belimawr if you share my confidence, please approve the PR 🙂

@belimawr
Copy link
Copy Markdown
Contributor

@belimawr if you share my confidence, please approve the PR 🙂

Yes I share the confidence!

Also, the test is already flaky: best case scenario the flakiness is fixed, worst case it stays flaky and we learn that this was not the root cause. So, merging it it's a win-win.

@belimawr belimawr merged commit 0132a25 into elastic:main Feb 26, 2026
191 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

@Mergifyio backport 8.19 9.2 9.3

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 26, 2026

backport 8.19 9.2 9.3

✅ Backports have been created

Details

mergify bot pushed a commit that referenced this pull request Feb 26, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer

Assisted by Cursor.

* Try to disable leader election

* Replace the attempts loop with the eventually call

* Make the helper more debuggable

* filebeat/input/kafka: stabilize TestSASLAuthentication topic setup

Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor CLI, Model: GPT-5.3 Codex

---------

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
(cherry picked from commit 0132a25)
mergify bot pushed a commit that referenced this pull request Feb 26, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer

Assisted by Cursor.

* Try to disable leader election

* Replace the attempts loop with the eventually call

* Make the helper more debuggable

* filebeat/input/kafka: stabilize TestSASLAuthentication topic setup

Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor CLI, Model: GPT-5.3 Codex

---------

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
(cherry picked from commit 0132a25)
mergify bot pushed a commit that referenced this pull request Feb 26, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer

Assisted by Cursor.

* Try to disable leader election

* Replace the attempts loop with the eventually call

* Make the helper more debuggable

* filebeat/input/kafka: stabilize TestSASLAuthentication topic setup

Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor CLI, Model: GPT-5.3 Codex

---------

Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
(cherry picked from commit 0132a25)
@rdner rdner deleted the fix-flaky-kafka-tests branch February 26, 2026 14:08
rdner added a commit that referenced this pull request Feb 26, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer
* Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

Assisted by Cursor.

---------


(cherry picked from commit 0132a25)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
rdner added a commit that referenced this pull request Feb 26, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer
* Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

Assisted by Cursor.

---------


(cherry picked from commit 0132a25)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
rdner added a commit that referenced this pull request Feb 26, 2026
* Added delays and waits at certain steps to ensure the Kafka
configuration running properly before starting the test.
* Add connection retries for the Kafka producer
* Create the test topic explicitly and wait for partition leaders before producing messages, preventing transient CI failures with “no leader for this partition” during topic auto-creation.

Assisted by Cursor.

---------


(cherry picked from commit 0132a25)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
@belimawr
Copy link
Copy Markdown
Contributor

belimawr commented Mar 2, 2026

I failed again in a backport PR to 9.2, even after the fix was merged.

Stack trace:

=== Failed
=== FAIL: filebeat/input/kafka TestTest (1.07s)
    kafka_integration_test.go:676:
        	Error Trace:	/opt/buildkite-agent/builds/bk-agent-prod-gcp-1772484810880252862/elastic/filebeat/filebeat/input/kafka/kafka_integration_test.go:676
        	            				/opt/buildkite-agent/builds/bk-agent-prod-gcp-1772484810880252862/elastic/filebeat/filebeat/input/kafka/kafka_integration_test.go:464
        	Error:      	Received unexpected error:
        	            	kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes
        	Test:       	TestTest

I'll create a flaky test issue for this.

@rdner
Copy link
Copy Markdown
Member Author

rdner commented Mar 3, 2026

@belimawr thanks for creating the issue and investigating!

It might be an environmental problem too. Perhaps some buildkite runners have something different about them.

@belimawr
Copy link
Copy Markdown
Contributor

belimawr commented Mar 3, 2026

@belimawr thanks for creating the issue and investigating!

It might be an environmental problem too. Perhaps some buildkite runners have something different about them.

I was looking more into it, well I mostly delegated the analysis to AI. Even though it is the same error, the failure is in another test, same root cause.

We'll (AI and I) fix it in all test for this input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-active-all Automated backport with mergify to all the active branches Filebeat Filebeat flaky-test Unstable or unreliable test cases. skip-changelog Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants