MINOR: Fix bug in AdminClient node reassignment following connection failure by hachikuji · Pull Request #5112 · apache/kafka

hachikuji · 2018-05-31T23:46:06Z

We added logic to reassign nodes in callToSend after a connection failure, but we do not handle the case when there is no node currently available to reassign the request to. This can happen when using MetadataUpdateNodeIdProvider if all of the known nodes are blacked out awaiting the retry backoff. To fix this, we need to ensure that the call is added to pendingCalls if a new node cannot be found.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ijuma

Thanks for the PR, left a few questions.

ijuma · 2018-06-01T06:16:19Z

Why did we increase this?

Sorry, did this for debugging and forgot to remove.

ijuma · 2018-06-01T06:17:05Z

Method references, fancy.

ijuma · 2018-06-01T06:17:42Z

Nit: hasInFlightRequests

ijuma · 2018-06-01T06:41:53Z

Why did we rename this method? The javadoc and logging seems to indicate that we are still trying to choose nodes for the pending calls?

I renamed it for consistency with the singular checkPendingCall. I could have named that method chooseNodeForPendingCall instead, but I found the return argument not very intuitive since it would not necessarily indicate that a node was successfully chosen, but only that the call had been handled and should be removed from pendingCalls. I was not too thrilled with either naming, so happy to hear suggestions.

Could checkPendingCalls be maybeChooseNodeForPendingCall? A bit of a mouthful, but having a checkX method that does much more than checking is a bit confusing.

Yeah it seems that checkPendingCalls could be a bit misleading because the name suggests this is a stateless method. But the method actually changes the state of KafkaAdminClient.

lindong28 · 2018-06-01T19:28:15Z

Yeah it seems that checkPendingCalls could be a bit misleading because the name suggests this is a stateless method. But the method actually changes the state of KafkaAdminClient.

lindong28 · 2018-06-01T19:29:55Z

The comment seems to suggest that we should use !client.hasInFlightRequests() && !pendingCalls.isEmpty() instead of client.hasInFlightRequests() && !pendingCalls.isEmpty()?

Yes, you are right.

lindong28 · 2018-06-01T19:39:47Z

Would the logic be slightly simpler by just doing pendingCalls.add(metadataCall) here without invoking checkPendingCall, and move pollTimeout = Math.min(pollTimeout, checkPendingCalls(now)) to be after the if (metadataFetchDelayMs == 0) {...}?

The benefit is that the choose-node-for-call logic will only be invoked through checkPendingCalls() and checkPendingCalls() will be invoked in only one place here.

I think the reason it is done in this order is that the node selection itself may trigger the metadata update. What if we just add the metadata call to pendingCalls and continue?

Ah I see. Makes sense. Then I think the current code looks good.

lindong28

Thanks for the update. The change LGTM. Could you also explain what is bug that this patch tries to fix? I tried the two newly added tests testUnreachableBootstrapServer and testConnectionFailureOnMetadataUpdate and it seems that both two tests can pass without the diff in KafkaAdminClient.

lindong28 · 2018-06-01T23:31:06Z

nits: it seems more intuitive to name it expired. But not a big deal. It is update to you.

hachikuji · 2018-06-02T00:25:39Z

@lindong28 Thanks for reviewing. See the description for the explanation of the bug. It is indeed difficult to hit with our MockClient. I had it failing at one point, but I may have altered the test later so that it no longer works. Let me try to reproduce it again.

lindong28 · 2018-06-02T00:50:38Z

@hachikuji Ah I forgot to read the description. I guess it will help me understand it better if we can reproduce the error. Otherwise, if the issue is hard to reproduce, I am OK with this patch as well. Please let me know whether you can fix the test and whether you would like further review by @rajinisivaram. Thanks.

hachikuji · 2018-06-02T18:26:51Z

@lindong28 Thanks, I think I worked out the issue. There was inconsistent behavior between NetworkClient and MockClient in the handling of leastLoadedNode. For NetworkClient, we will not return blacked out nodes that are awaiting the reconnect backoff, but we would in MockClient. Unfortunately, we had a few unrelated test cases which were erroneously depending on this difference in behavior. We probably need to spend some time cleaning up MockClient to ensure that it's behavior actually matches NetworkClient in all cases, especially since our client testing depends on it so heavily. At the moment, it's just a tangle of one-off hacks.

lindong28

@hachikuji Thanks for the update! Left one minor comment. It seems that the both two tests can pass even after I revert the change in KafkaAdminClient. Does these two tests consistently reproduce the error (if without the change in KafkaAdminClient) on you side?

lindong28 · 2018-06-02T19:21:11Z

                }

+                // Ensure that we use a small poll timeout if there are pending calls which need to be sent
+                if (!pendingCalls.isEmpty())


Not sure this optimization is useful. If there is pending calls, it means these calls can not find nodes to send to and we need to wait for the metadata update before trying again. We already have the logic for reducing pollTimeout based on metadataFetchDelayMs. And client.poll() will return immediately after metadata response is received even if we use large pollTimeout. Did I miss something here?

It's less of an optimization and more of a safety net. The intent was to ensure we are not stuck in poll() with a long timeout while we have pending requests waiting to be sent. It may be unnecessary if we're convinced that the poll timeout is computed correctly.

I see. It is reasonable to have a safety net.

hachikuji · 2018-06-02T23:28:42Z

@lindong28 You need the changes in MockClient to reproduce the failure. You should see testUnreachableBootstrapServer hang until the test ultimately times out. The other test case was added for completeness and passes on trunk.

lindong28

@hachikuji Not sure what has changed on my side. testUnreachableBootstrapServer could succeed on both my mac and desktop if I apply all changes except those changes of KafkaAdminClient.

Thanks for the explanation. Though I don't fully understand the bug, the patch LGTM in the sense that it makes the code cleaner/safer. Not sure if you would like to have further review by @rajinisivaram.

rajinisivaram

@hachikuji Thanks for the PR, LGTM.

rajinisivaram · 2018-06-04T09:22:37Z

Merging to trunk.

…failure (apache#5112) We added logic to reassign nodes in callToSend after a connection failure, but we do not handle the case when there is no node currently available to reassign the request to. This can happen when using MetadataUpdateNodeIdProvider if all of the known nodes are blacked out awaiting the retry backoff. To fix this, we need to ensure that the call is added to pendingCalls if a new node cannot be found.

hachikuji requested a review from rajinisivaram May 31, 2018 23:46

hachikuji force-pushed the fix-admin-client-connection-failure-detection branch from ed382c6 to 052ab94 Compare May 31, 2018 23:51

ijuma reviewed Jun 1, 2018

View reviewed changes

ijuma added this to the 2.0.0 milestone Jun 1, 2018

ijuma assigned rajinisivaram Jun 1, 2018

lindong28 reviewed Jun 1, 2018

View reviewed changes

lindong28 self-assigned this Jun 1, 2018

hachikuji added 6 commits June 2, 2018 11:12

MINOR: Fix bug in AdminClient connection failure detection

4a75c3e

Small cleanups

c93ce30

A bit of cleanup

e849eea

Fix retry backoff timeout check

34afa9b

Fix inconsistent leastLoadedNode behavior in MockClient

171be20

Fix compilation errors

8873984

hachikuji force-pushed the fix-admin-client-connection-failure-detection branch from 4fe3cdd to 8873984 Compare June 2, 2018 18:24

lindong28 reviewed Jun 2, 2018

View reviewed changes

lindong28 approved these changes Jun 3, 2018

View reviewed changes

rajinisivaram approved these changes Jun 4, 2018

View reviewed changes

rajinisivaram merged commit d02f021 into apache:trunk Jun 4, 2018

Conversation

hachikuji commented May 31, 2018

Committer Checklist (excluded from commit message)

Uh oh!

ijuma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lindong28 Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lindong28 left a comment

Choose a reason for hiding this comment

Uh oh!

lindong28 Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Jun 2, 2018

Uh oh!

lindong28 commented Jun 2, 2018

Uh oh!

hachikuji commented Jun 2, 2018

Uh oh!

lindong28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hachikuji commented Jun 2, 2018

Uh oh!

lindong28 left a comment

Choose a reason for hiding this comment

Uh oh!

rajinisivaram left a comment

Choose a reason for hiding this comment

Uh oh!

rajinisivaram commented Jun 4, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lindong28 Jun 1, 2018 •

edited

Loading

lindong28 Jun 1, 2018 •

edited

Loading