Skip to content

KAFKA-9069: Flaky Test AdminClientIntegrationTest.testCreatePartitions#7619

Closed
huxihx wants to merge 1 commit into
apache:trunkfrom
huxihx:KAFKA-9069
Closed

KAFKA-9069: Flaky Test AdminClientIntegrationTest.testCreatePartitions#7619
huxihx wants to merge 1 commit into
apache:trunkfrom
huxihx:KAFKA-9069

Conversation

@huxihx

@huxihx huxihx commented Oct 31, 2019

Copy link
Copy Markdown
Contributor

https://issues.apache.org/jira/browse/KAFKA-9069

Make getTopicMetadata in AdminClientIntegrationTest always read metadata from controller to get a consistent view.

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@huxihx

huxihx commented Oct 31, 2019

Copy link
Copy Markdown
Contributor Author

@guozhangwang @hachikuji Please review this patch. Thanks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DescribeTopicsOptions is a public API so if we do want to add more interfaces we need to do a small KIP..

On the other hand, I'm wondering how certain we are that the root cause of the flaky test failure is due to not reading from the controller? Eventually the metadata should be propagated to other nodes as well and our default waitUntilTrue is taking 15 secs, so not sure if we should fix the test if it just taking more than 15 secs by changing the behavior of the admin client to always talk to controller (since it would also introduce large amount of loads on the controller).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether failing reading from controller is the root cause, do you think it makes sense to offer an alternative option for users to let them select which nodes to read the metadata?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the current metadata propagation staleness could be an issue, in fact one of the major motivations of KIP-500 is to make the metadata propagation more controllable and up-to-date.

However, I think let users select if they should read from controller or not could be a risky option, since once opens up this opportunity I'd predict everyone would select to read from controller to get most up-to-date metadata which would bring a large load on the controller.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what about retrofitting numPartitions to accept an expected partition number and having getTopicMetadata more time to await?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds promising. Could you provide a revision of the PR so I can take a closer look at the proposal?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems getTopicMetadata is only called at the beginning of testCreatePartitions, but from the reported trace it is not thrown from here.

@huxihx

huxihx commented Nov 19, 2019

Copy link
Copy Markdown
Contributor Author

retest this please.

@jsancio jsancio left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @huxihx. Left a few suggestions.

The change conflicts with the trunk branch. This test was moved to PlaintextAdminIntegrationTest.scala.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 580, I think we should change it to assertEqual(3, numParititons(topic1, expectedNumPartitionsOpt = Some(3))).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 612, I think we should change it to assertEqual(3, numParititons(topic1, expectedNumPartitionsOpt = Some(3))). Even though we know that it had to have succeeded once, future describe may go to a different broker which may return a different value.

In other words tests should always wait for describe to return the expected value an not assume that all describes will return the expected value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few more cases. I think every call to numPartitions should be changed to pass the expected number of partitions. Having said that maybe we should change this signature to always take an Int. E.g.

def assertNumPartitions(topic: String, expectedNumPartitions: Int): Unit

@jsancio jsancio left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@omkreddy omkreddy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huxihx Thanks for the PR. LGTM.

@omkreddy omkreddy closed this in 9dbca4c Nov 29, 2019
@omkreddy

Copy link
Copy Markdown
Contributor

Merging to trunk and 2.4 (copy the changes to old classes).

omkreddy added a commit that referenced this pull request Nov 29, 2019
https://issues.apache.org/jira/browse/KAFKA-9069

Make `getTopicMetadata` in AdminClientIntegrationTest always read metadata from controller to get a consistent view.

Reviewers: Guozhang Wang <wangguoz@gmail.com>, José Armando García Sancio <jsancio@gmail.com>

Closes #7619 from huxihx/KAFKA-9069
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants