KAFKA-17463: Fixing test share groups test#17645
Conversation
| // check retryable errors | ||
| case COORDINATOR_NOT_AVAILABLE: | ||
| case COORDINATOR_LOAD_IN_PROGRESS: | ||
| case NOT_COORDINATOR: |
There was a problem hiding this comment.
Could you please adjust behavior of testReadStateRequestFailButCoordinatorFoundSuccessfully to match this change?
|
I remain unconvinced by this PR. There is certainly a problem somewhere in this area, but I was unable to reproduce it yesterday in spite of trying. I'll try again today. But I want to understand WHY the problem is occurring before approving a change to fix it. |
So previously we never removed cached SharePartition in manager despite receiving NOT_COORDINATOR error during initialization. The The response from read state persister RPC is flaky, sometimes successful response do appear but sometimes not. This check in the PR adds retries on NOT_COORDINATOR. |
|
It appears to me that this |
|
I think this PR is on the wrong track. The problem seems to be a mismatch between two pieces of code using the |
|
I think this PR fixes the test failure: #17656. Whether to handle |
|
Make sense, thanks @AndrewJSchofield. Once the other PR is merged, I ll close this. |
AndrewJSchofield
left a comment
There was a problem hiding this comment.
There is one more case for COORDINATOR_NOT_AVAILABLE in PersisterStateManager which needs NOT_COORDINATOR added (the write state RPC).
Apart from that, this PR looks good to me.
@apoorvmittal10 Could you please address this comment ? |
Yes, I will do that tomorrow. |
|
@AndrewJSchofield I have addressed the comment for retry on write state rpc. |
|
lgtm |
…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
…17645) Reviewers: Andrew Schofield <aschofield@confluent.io>, Chia-Ping Tsai <chia7712@gmail.com>
Fixing the test by adding retry in Persister for NOT_COORDINATOR error.
Do we need to have a retry in Persister for NOT_COORDINATOR error as well? The error is retriable. I have added that check in this PR and that fixes this test else we do receive below exception:
However I can notice following while running test multiple times, the above mentioned error can also occur sometimes but then also test passes as test do not wait for successful response from SharePartition initialization i.e. never consume from the share partition in test. Hence do we also need to fix something in test as well?
Committer Checklist (excluded from commit message)