MINOR: Fix bug in AdminClient node reassignment following connection failure#5112
Conversation
ed382c6 to
052ab94
Compare
ijuma
left a comment
There was a problem hiding this comment.
Thanks for the PR, left a few questions.
There was a problem hiding this comment.
Sorry, did this for debugging and forgot to remove.
There was a problem hiding this comment.
Why did we rename this method? The javadoc and logging seems to indicate that we are still trying to choose nodes for the pending calls?
There was a problem hiding this comment.
I renamed it for consistency with the singular checkPendingCall. I could have named that method chooseNodeForPendingCall instead, but I found the return argument not very intuitive since it would not necessarily indicate that a node was successfully chosen, but only that the call had been handled and should be removed from pendingCalls. I was not too thrilled with either naming, so happy to hear suggestions.
There was a problem hiding this comment.
Could checkPendingCalls be maybeChooseNodeForPendingCall? A bit of a mouthful, but having a checkX method that does much more than checking is a bit confusing.
There was a problem hiding this comment.
Yeah it seems that checkPendingCalls could be a bit misleading because the name suggests this is a stateless method. But the method actually changes the state of KafkaAdminClient.
There was a problem hiding this comment.
Yeah it seems that checkPendingCalls could be a bit misleading because the name suggests this is a stateless method. But the method actually changes the state of KafkaAdminClient.
There was a problem hiding this comment.
The comment seems to suggest that we should use !client.hasInFlightRequests() && !pendingCalls.isEmpty() instead of client.hasInFlightRequests() && !pendingCalls.isEmpty()?
There was a problem hiding this comment.
Yes, you are right.
There was a problem hiding this comment.
Would the logic be slightly simpler by just doing pendingCalls.add(metadataCall) here without invoking checkPendingCall, and move pollTimeout = Math.min(pollTimeout, checkPendingCalls(now)) to be after the if (metadataFetchDelayMs == 0) {...}?
The benefit is that the choose-node-for-call logic will only be invoked through checkPendingCalls() and checkPendingCalls() will be invoked in only one place here.
There was a problem hiding this comment.
I think the reason it is done in this order is that the node selection itself may trigger the metadata update. What if we just add the metadata call to pendingCalls and continue?
There was a problem hiding this comment.
Ah I see. Makes sense. Then I think the current code looks good.
lindong28
left a comment
There was a problem hiding this comment.
Thanks for the update. The change LGTM. Could you also explain what is bug that this patch tries to fix? I tried the two newly added tests testUnreachableBootstrapServer and testConnectionFailureOnMetadataUpdate and it seems that both two tests can pass without the diff in KafkaAdminClient.
There was a problem hiding this comment.
nits: it seems more intuitive to name it expired. But not a big deal. It is update to you.
|
@lindong28 Thanks for reviewing. See the description for the explanation of the bug. It is indeed difficult to hit with our MockClient. I had it failing at one point, but I may have altered the test later so that it no longer works. Let me try to reproduce it again. |
|
@hachikuji Ah I forgot to read the description. I guess it will help me understand it better if we can reproduce the error. Otherwise, if the issue is hard to reproduce, I am OK with this patch as well. Please let me know whether you can fix the test and whether you would like further review by @rajinisivaram. Thanks. |
4fe3cdd to
8873984
Compare
|
@lindong28 Thanks, I think I worked out the issue. There was inconsistent behavior between NetworkClient and MockClient in the handling of |
lindong28
left a comment
There was a problem hiding this comment.
@hachikuji Thanks for the update! Left one minor comment. It seems that the both two tests can pass even after I revert the change in KafkaAdminClient. Does these two tests consistently reproduce the error (if without the change in KafkaAdminClient) on you side?
| } | ||
|
|
||
| // Ensure that we use a small poll timeout if there are pending calls which need to be sent | ||
| if (!pendingCalls.isEmpty()) |
There was a problem hiding this comment.
Not sure this optimization is useful. If there is pending calls, it means these calls can not find nodes to send to and we need to wait for the metadata update before trying again. We already have the logic for reducing pollTimeout based on metadataFetchDelayMs. And client.poll() will return immediately after metadata response is received even if we use large pollTimeout. Did I miss something here?
There was a problem hiding this comment.
It's less of an optimization and more of a safety net. The intent was to ensure we are not stuck in poll() with a long timeout while we have pending requests waiting to be sent. It may be unnecessary if we're convinced that the poll timeout is computed correctly.
There was a problem hiding this comment.
I see. It is reasonable to have a safety net.
|
@lindong28 You need the changes in MockClient to reproduce the failure. You should see |
lindong28
left a comment
There was a problem hiding this comment.
@hachikuji Not sure what has changed on my side. testUnreachableBootstrapServer could succeed on both my mac and desktop if I apply all changes except those changes of KafkaAdminClient.
Thanks for the explanation. Though I don't fully understand the bug, the patch LGTM in the sense that it makes the code cleaner/safer. Not sure if you would like to have further review by @rajinisivaram.
rajinisivaram
left a comment
There was a problem hiding this comment.
@hachikuji Thanks for the PR, LGTM.
|
Merging to trunk. |
…failure (apache#5112) We added logic to reassign nodes in callToSend after a connection failure, but we do not handle the case when there is no node currently available to reassign the request to. This can happen when using MetadataUpdateNodeIdProvider if all of the known nodes are blacked out awaiting the retry backoff. To fix this, we need to ensure that the call is added to pendingCalls if a new node cannot be found.
We added logic to reassign nodes in
callToSendafter a connection failure, but we do not handle the case when there is no node currently available to reassign the request to. This can happen when usingMetadataUpdateNodeIdProviderif all of the known nodes are blacked out awaiting the retry backoff. To fix this, we need to ensure that the call is added topendingCallsif a new node cannot be found.Committer Checklist (excluded from commit message)