Skip to content

MINOR: Fix bug in AdminClient node reassignment following connection failure#5112

Merged
rajinisivaram merged 6 commits into
apache:trunkfrom
hachikuji:fix-admin-client-connection-failure-detection
Jun 4, 2018
Merged

MINOR: Fix bug in AdminClient node reassignment following connection failure#5112
rajinisivaram merged 6 commits into
apache:trunkfrom
hachikuji:fix-admin-client-connection-failure-detection

Conversation

@hachikuji

Copy link
Copy Markdown
Contributor

We added logic to reassign nodes in callToSend after a connection failure, but we do not handle the case when there is no node currently available to reassign the request to. This can happen when using MetadataUpdateNodeIdProvider if all of the known nodes are blacked out awaiting the retry backoff. To fix this, we need to ensure that the call is added to pendingCalls if a new node cannot be found.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@hachikuji hachikuji requested a review from rajinisivaram May 31, 2018 23:46
@hachikuji hachikuji force-pushed the fix-admin-client-connection-failure-detection branch from ed382c6 to 052ab94 Compare May 31, 2018 23:51

@ijuma ijuma left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, left a few questions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we increase this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did this for debugging and forgot to remove.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method references, fancy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: hasInFlightRequests

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we rename this method? The javadoc and logging seems to indicate that we are still trying to choose nodes for the pending calls?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it for consistency with the singular checkPendingCall. I could have named that method chooseNodeForPendingCall instead, but I found the return argument not very intuitive since it would not necessarily indicate that a node was successfully chosen, but only that the call had been handled and should be removed from pendingCalls. I was not too thrilled with either naming, so happy to hear suggestions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could checkPendingCalls be maybeChooseNodeForPendingCall? A bit of a mouthful, but having a checkX method that does much more than checking is a bit confusing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems that checkPendingCalls could be a bit misleading because the name suggests this is a stateless method. But the method actually changes the state of KafkaAdminClient.

@ijuma ijuma added this to the 2.0.0 milestone Jun 1, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems that checkPendingCalls could be a bit misleading because the name suggests this is a stateless method. But the method actually changes the state of KafkaAdminClient.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment seems to suggest that we should use !client.hasInFlightRequests() && !pendingCalls.isEmpty() instead of client.hasInFlightRequests() && !pendingCalls.isEmpty()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@lindong28 lindong28 Jun 1, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the logic be slightly simpler by just doing pendingCalls.add(metadataCall) here without invoking checkPendingCall, and move pollTimeout = Math.min(pollTimeout, checkPendingCalls(now)) to be after the if (metadataFetchDelayMs == 0) {...}?

The benefit is that the choose-node-for-call logic will only be invoked through checkPendingCalls() and checkPendingCalls() will be invoked in only one place here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason it is done in this order is that the node selection itself may trigger the metadata update. What if we just add the metadata call to pendingCalls and continue?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Makes sense. Then I think the current code looks good.

@lindong28 lindong28 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The change LGTM. Could you also explain what is bug that this patch tries to fix? I tried the two newly added tests testUnreachableBootstrapServer and testConnectionFailureOnMetadataUpdate and it seems that both two tests can pass without the diff in KafkaAdminClient.

@lindong28 lindong28 Jun 1, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: it seems more intuitive to name it expired. But not a big deal. It is update to you.

@lindong28 lindong28 self-assigned this Jun 1, 2018
@hachikuji

Copy link
Copy Markdown
Contributor Author

@lindong28 Thanks for reviewing. See the description for the explanation of the bug. It is indeed difficult to hit with our MockClient. I had it failing at one point, but I may have altered the test later so that it no longer works. Let me try to reproduce it again.

@lindong28

Copy link
Copy Markdown
Member

@hachikuji Ah I forgot to read the description. I guess it will help me understand it better if we can reproduce the error. Otherwise, if the issue is hard to reproduce, I am OK with this patch as well. Please let me know whether you can fix the test and whether you would like further review by @rajinisivaram. Thanks.

@hachikuji hachikuji force-pushed the fix-admin-client-connection-failure-detection branch from 4fe3cdd to 8873984 Compare June 2, 2018 18:24
@hachikuji

Copy link
Copy Markdown
Contributor Author

@lindong28 Thanks, I think I worked out the issue. There was inconsistent behavior between NetworkClient and MockClient in the handling of leastLoadedNode. For NetworkClient, we will not return blacked out nodes that are awaiting the reconnect backoff, but we would in MockClient. Unfortunately, we had a few unrelated test cases which were erroneously depending on this difference in behavior. We probably need to spend some time cleaning up MockClient to ensure that it's behavior actually matches NetworkClient in all cases, especially since our client testing depends on it so heavily. At the moment, it's just a tangle of one-off hacks.

@lindong28 lindong28 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji Thanks for the update! Left one minor comment. It seems that the both two tests can pass even after I revert the change in KafkaAdminClient. Does these two tests consistently reproduce the error (if without the change in KafkaAdminClient) on you side?

}

// Ensure that we use a small poll timeout if there are pending calls which need to be sent
if (!pendingCalls.isEmpty())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this optimization is useful. If there is pending calls, it means these calls can not find nodes to send to and we need to wait for the metadata update before trying again. We already have the logic for reducing pollTimeout based on metadataFetchDelayMs. And client.poll() will return immediately after metadata response is received even if we use large pollTimeout. Did I miss something here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less of an optimization and more of a safety net. The intent was to ensure we are not stuck in poll() with a long timeout while we have pending requests waiting to be sent. It may be unnecessary if we're convinced that the poll timeout is computed correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It is reasonable to have a safety net.

@hachikuji

Copy link
Copy Markdown
Contributor Author

@lindong28 You need the changes in MockClient to reproduce the failure. You should see testUnreachableBootstrapServer hang until the test ultimately times out. The other test case was added for completeness and passes on trunk.

@lindong28 lindong28 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji Not sure what has changed on my side. testUnreachableBootstrapServer could succeed on both my mac and desktop if I apply all changes except those changes of KafkaAdminClient.

Thanks for the explanation. Though I don't fully understand the bug, the patch LGTM in the sense that it makes the code cleaner/safer. Not sure if you would like to have further review by @rajinisivaram.

@rajinisivaram rajinisivaram left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji Thanks for the PR, LGTM.

@rajinisivaram

Copy link
Copy Markdown
Contributor

Merging to trunk.

@rajinisivaram rajinisivaram merged commit d02f021 into apache:trunk Jun 4, 2018
ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018
…failure (apache#5112)

We added logic to reassign nodes in callToSend after a connection failure, but we do not handle the case when there is no node currently available to reassign the request to. This can happen when using MetadataUpdateNodeIdProvider if all of the known nodes are blacked out awaiting the retry backoff. To fix this, we need to ensure that the call is added to pendingCalls if a new node cannot be found.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants