KAFKA-17463: Fixing test share groups test by apoorvmittal10 · Pull Request #17645 · apache/kafka

apoorvmittal10 · 2024-10-31T00:45:42Z

Fixing the test by adding retry in Persister for NOT_COORDINATOR error.

Do we need to have a retry in Persister for NOT_COORDINATOR error as well? The error is retriable. I have added that check in this PR and that fixes this test else we do receive below exception:

[2024-10-31 00:48:31,821] ERROR Unable to perform read state RPC: This is not the correct coordinator. (org.apache.kafka.server.share.persister.PersisterStateManager$ReadStateHandler:694)
[2024-10-31 00:48:31,849] ERROR Unable to perform read state RPC: This is not the correct coordinator. (org.apache.kafka.server.share.persister.PersisterStateManager$ReadStateHandler:694)

However I can notice following while running test multiple times, the above mentioned error can also occur sometimes but then also test passes as test do not wait for successful response from SharePartition initialization i.e. never consume from the share partition in test. Hence do we also need to fix something in test as well?

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

chia7712 · 2024-10-31T10:55:39Z

                            // check retryable errors
                            case COORDINATOR_NOT_AVAILABLE:
                            case COORDINATOR_LOAD_IN_PROGRESS:
+                            case NOT_COORDINATOR:


Could you please adjust behavior of testReadStateRequestFailButCoordinatorFoundSuccessfully to match this change?

AndrewJSchofield · 2024-10-31T10:59:23Z

I remain unconvinced by this PR. There is certainly a problem somewhere in this area, but I was unable to reproduce it yesterday in spite of trying. I'll try again today. But I want to understand WHY the problem is occurring before approving a change to fix it.

apoorvmittal10 · 2024-10-31T14:45:43Z

I remain unconvinced by this PR. There is certainly a problem somewhere in this area, but I was unable to reproduce it yesterday in spite of trying. I'll try again today. But I want to understand WHY the problem is occurring before approving a change to fix it.

So previously we never removed cached SharePartition in manager despite receiving NOT_COORDINATOR error during initialization. The testShareGroups used to pass as that test vaildates group related data not consumption.

The response from read state persister RPC is flaky, sometimes successful response do appear but sometimes not. This check in the PR adds retries on NOT_COORDINATOR.

AndrewJSchofield · 2024-10-31T15:25:51Z

It appears to me that this NOT_COORDINATOR is coming from CoordinatorRuntime.contextOrThrow. I don't yet know whether it's an acceptable thing to just add it to the list of retriable errors, or whether it is necessary to track down why the context is not present for the share-group state partition which is being looked for. Is it asynchronous startup, or is something more serious wrong?

AndrewJSchofield · 2024-10-31T16:50:08Z

I think this PR is on the wrong track. The problem seems to be a mismatch between two pieces of code using the SharePartitionKey. There are two different string representations of this class, and both appear to be used when dealing with the share coordinator. As a result, requests are going to the wrong coordinator.

AndrewJSchofield · 2024-10-31T17:24:41Z

I think this PR fixes the test failure: #17656.

Whether to handle NOT_COORDINATOR as a retriable error in this situation is another matter. I think the answer is still yes, but the error code would only happen in usual operation as a share coordinator is unloading one of the partitions it used to own. So, a request sent to the right coordinator which is no longer the coordinator would get this error, and then it should be retried.

apoorvmittal10 · 2024-10-31T17:31:04Z

Make sense, thanks @AndrewJSchofield. Once the other PR is merged, I ll close this.

AndrewJSchofield

There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).

Apart from that, this PR looks good to me.

chia7712

LGTM

chia7712 · 2024-11-03T20:36:11Z

There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).

@apoorvmittal10 Could you please address this comment ?

apoorvmittal10 · 2024-11-03T21:17:58Z

There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).

@apoorvmittal10 Could you please address this comment ?

Yes, I will do that tomorrow.

apoorvmittal10 · 2024-11-04T10:35:28Z

@AndrewJSchofield I have addressed the comment for retry on write state rpc.

AndrewJSchofield · 2024-11-04T14:07:01Z

lgtm

…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>

KAFKA-17463: Fixing test share groups test

5d6988b

apoorvmittal10 requested a review from AndrewJSchofield October 31, 2024 00:45

github-actions Bot added KIP-932 Queues for Kafka small Small PRs labels Oct 31, 2024

apoorvmittal10 marked this pull request as ready for review October 31, 2024 00:46

chia7712 mentioned this pull request Oct 31, 2024

KAFKA-17903: Remove KafkaFuture#Function and KafkaFuture#thenApply #17644

Merged

3 tasks

FrankYang0529 mentioned this pull request Oct 31, 2024

KAFKA-17880: Move integration test from streams module to streams/integration-tests module #17615

Merged

3 tasks

chia7712 reviewed Oct 31, 2024

View reviewed changes

frankvicky mentioned this pull request Oct 31, 2024

KAFKA-17837: Rewrite DeleteTopicTest #17579

Merged

3 tasks

Fixing test

8d0748e

apoorvmittal10 requested a review from chia7712 October 31, 2024 15:21

apoorvmittal10 mentioned this pull request Nov 1, 2024

KAFKA-17919: enable back the failing testShareGroups test #17658

Merged

3 tasks

AndrewJSchofield requested changes Nov 1, 2024

View reviewed changes

chia7712 approved these changes Nov 1, 2024

View reviewed changes

apoorvmittal10 added 2 commits November 4, 2024 10:29

Merge remote-tracking branch 'upstream/trunk' into KAFKA-17463

2e9a716

Adding retry on write state rpc

4c7900d

apoorvmittal10 requested a review from AndrewJSchofield November 4, 2024 10:35

chia7712 merged commit a0292ba into apache:trunk Nov 4, 2024

abhishekgiri23 pushed a commit to abhishekgiri23/kafka that referenced this pull request Nov 5, 2024

KAFKA-17463 add retry in Persister for NOT_COORDINATOR error (apache#…

f99cfa1

…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>

chiacyu pushed a commit to chiacyu/kafka that referenced this pull request Nov 30, 2024

KAFKA-17463 add retry in Persister for NOT_COORDINATOR error (apache#…

d2c3383

…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>

tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025

KAFKA-17463 add retry in Persister for NOT_COORDINATOR error (apache#…

27d651e

…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>

Conversation

apoorvmittal10 commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

chia7712 Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield commented Oct 31, 2024

Uh oh!

apoorvmittal10 commented Oct 31, 2024

Uh oh!

AndrewJSchofield commented Oct 31, 2024

Uh oh!

AndrewJSchofield commented Oct 31, 2024

Uh oh!

AndrewJSchofield commented Oct 31, 2024

Uh oh!

apoorvmittal10 commented Oct 31, 2024

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 commented Nov 3, 2024

Uh oh!

apoorvmittal10 commented Nov 3, 2024

Uh oh!

apoorvmittal10 commented Nov 4, 2024

Uh oh!

AndrewJSchofield commented Nov 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apoorvmittal10 commented Oct 31, 2024 •

edited

Loading