Skip to content

KAFKA-17463: Fixing test share groups test#17645

Merged
chia7712 merged 4 commits into
apache:trunkfrom
apoorvmittal10:KAFKA-17463
Nov 4, 2024
Merged

KAFKA-17463: Fixing test share groups test#17645
chia7712 merged 4 commits into
apache:trunkfrom
apoorvmittal10:KAFKA-17463

Conversation

@apoorvmittal10

@apoorvmittal10 apoorvmittal10 commented Oct 31, 2024

Copy link
Copy Markdown
Contributor

Fixing the test by adding retry in Persister for NOT_COORDINATOR error.

Do we need to have a retry in Persister for NOT_COORDINATOR error as well? The error is retriable. I have added that check in this PR and that fixes this test else we do receive below exception:

[2024-10-31 00:48:31,821] ERROR Unable to perform read state RPC: This is not the correct coordinator. (org.apache.kafka.server.share.persister.PersisterStateManager$ReadStateHandler:694)
[2024-10-31 00:48:31,849] ERROR Unable to perform read state RPC: This is not the correct coordinator. (org.apache.kafka.server.share.persister.PersisterStateManager$ReadStateHandler:694)

However I can notice following while running test multiple times, the above mentioned error can also occur sometimes but then also test passes as test do not wait for successful response from SharePartition initialization i.e. never consume from the share partition in test. Hence do we also need to fix something in test as well?

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

// check retryable errors
case COORDINATOR_NOT_AVAILABLE:
case COORDINATOR_LOAD_IN_PROGRESS:
case NOT_COORDINATOR:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please adjust behavior of testReadStateRequestFailButCoordinatorFoundSuccessfully to match this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@AndrewJSchofield

Copy link
Copy Markdown
Member

I remain unconvinced by this PR. There is certainly a problem somewhere in this area, but I was unable to reproduce it yesterday in spite of trying. I'll try again today. But I want to understand WHY the problem is occurring before approving a change to fix it.

@apoorvmittal10

Copy link
Copy Markdown
Contributor Author

I remain unconvinced by this PR. There is certainly a problem somewhere in this area, but I was unable to reproduce it yesterday in spite of trying. I'll try again today. But I want to understand WHY the problem is occurring before approving a change to fix it.

So previously we never removed cached SharePartition in manager despite receiving NOT_COORDINATOR error during initialization. The testShareGroups used to pass as that test vaildates group related data not consumption.

The response from read state persister RPC is flaky, sometimes successful response do appear but sometimes not. This check in the PR adds retries on NOT_COORDINATOR.

@AndrewJSchofield

Copy link
Copy Markdown
Member

It appears to me that this NOT_COORDINATOR is coming from CoordinatorRuntime.contextOrThrow. I don't yet know whether it's an acceptable thing to just add it to the list of retriable errors, or whether it is necessary to track down why the context is not present for the share-group state partition which is being looked for. Is it asynchronous startup, or is something more serious wrong?

@AndrewJSchofield

Copy link
Copy Markdown
Member

I think this PR is on the wrong track. The problem seems to be a mismatch between two pieces of code using the SharePartitionKey. There are two different string representations of this class, and both appear to be used when dealing with the share coordinator. As a result, requests are going to the wrong coordinator.

@AndrewJSchofield

Copy link
Copy Markdown
Member

I think this PR fixes the test failure: #17656.

Whether to handle NOT_COORDINATOR as a retriable error in this situation is another matter. I think the answer is still yes, but the error code would only happen in usual operation as a share coordinator is unloading one of the partitions it used to own. So, a request sent to the right coordinator which is no longer the coordinator would get this error, and then it should be retried.

@apoorvmittal10

Copy link
Copy Markdown
Contributor Author

Make sense, thanks @AndrewJSchofield. Once the other PR is merged, I ll close this.

@AndrewJSchofield AndrewJSchofield left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).

Apart from that, this PR looks good to me.

@chia7712 chia7712 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chia7712

chia7712 commented Nov 3, 2024

Copy link
Copy Markdown
Member

There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).

@apoorvmittal10 Could you please address this comment ?

@apoorvmittal10

Copy link
Copy Markdown
Contributor Author

There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).

@apoorvmittal10 Could you please address this comment ?

Yes, I will do that tomorrow.

@apoorvmittal10

Copy link
Copy Markdown
Contributor Author

@AndrewJSchofield I have addressed the comment for retry on write state rpc.

@chia7712 chia7712 merged commit a0292ba into apache:trunk Nov 4, 2024
@AndrewJSchofield

Copy link
Copy Markdown
Member

lgtm

abhishekgiri23 pushed a commit to abhishekgiri23/kafka that referenced this pull request Nov 5, 2024
…17645)

Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
chiacyu pushed a commit to chiacyu/kafka that referenced this pull request Nov 30, 2024
…17645)

Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
…17645)

Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

KIP-932 Queues for Kafka small Small PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants