Skip to content

KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment#7795

Merged
ijuma merged 3 commits into
apache:trunkfrom
hachikuji:KAFKA-9212
Dec 8, 2019
Merged

KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment#7795
ijuma merged 3 commits into
apache:trunkfrom
hachikuji:KAFKA-9212

Conversation

@hachikuji

@hachikuji hachikuji commented Dec 7, 2019

Copy link
Copy Markdown
Contributor

KIP-320 improved fetch semantics by adding leader epoch validation. This relies on reliable propagation of leader epoch information from the controller. Unfortunately, we have encountered a bug during partition reassignment in which the leader epoch in the controller context does not get properly updated. This causes UpdateMetadata requests to be sent with stale epoch information which results in the metadata caches on the brokers falling out of sync.

This bug has existed for a long time, but it is only a problem due to the new epoch validation done by the client. Because the client includes the stale leader epoch in its requests, the leader rejects them, yet the stale metadata cache on the brokers prevents the consumer from getting the latest epoch. Hence the consumer cannot make progress while a reassignment is ongoing.

Although it is straightforward to fix this problem in the controller for the new releases (which this patch does), it is not so easy to fix older brokers which means new clients could still encounter brokers with this bug. To address this problem, this patch also modifies the client to treat the leader epoch returned from the Metadata response as "unreliable" if it comes from an older version of the protocol. The client in this case will discard the returned epoch and it won't be included in any requests.

Also, note that the correct epoch is still forwarded to replicas correctly in the LeaderAndIsr request, so this bug does not affect replication.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@hachikuji hachikuji changed the title KAFKA-9212; Update LeaderAndIsr state in controller context after reassignment KAFKA-9212; Ensure LeaderAndIsr state updated in controller context during reassignment Dec 7, 2019

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji : Thanks for the PR. LGTM

for (MetadataResponse.PartitionMetadata partitionMetadata : metadata.partitionMetadata()) {

Consumer<PartitionInfo> addToPartitions = partitionInfo -> {
int epoch = partitionMetadata.leaderEpoch().orElse(RecordBatch.NO_PARTITION_LEADER_EPOCH);

@hachikuji hachikuji Dec 7, 2019

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this was a bug. We were using the leader epoch from the response even if it was stale and we had taken the PartitionInfo from the previous update.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that covers this bug too?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe testStaleMetadata() does that

@ijuma ijuma left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Looks good overall, just a couple of comments below.

assertEquals(-1, info.epoch());
}

for (short version = 9; version <= ApiKeys.METADATA.oldestVersion(); version++) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApiKeys.METADATA.oldestVersion() -> ApiKeys.METADATA.latestVersion()?

finishedUpdates.headOption.map {
case (partition, Right(leaderAndIsr)) =>
finalLeaderIsrAndControllerEpoch = Some(LeaderIsrAndControllerEpoch(leaderAndIsr, epoch))
finishedUpdates.get(partition).exists {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would probably be clearer to just go with a Some/None pattern match given the side-effects we're doing below (throwing exceptions, mutations, etc).

@ijuma

ijuma commented Dec 8, 2019

Copy link
Copy Markdown
Member

One more question, did the reassignment test fail without the fix?

}

@Test
public void testIgnoreLeaderEpochInOlderMetadataResponse() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we also briefly explain the issue in the tests? Personally, I tend to read tests to understand the expected behavior and the issue with versions earlier than 9 is not immediately apparent

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK now?

}

@Test
def testProduceAndConsumeWithReassignmentInProgress(): Unit = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice test!

@stanislavkozlovski

Copy link
Copy Markdown
Contributor

@ijuma

One more question, did the reassignment test fail without the fix?

Yes, it does

@ijuma

ijuma commented Dec 8, 2019

Copy link
Copy Markdown
Member

I pushed a commit with the test fix and a couple of other minor change.

@ijuma ijuma left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ijuma ijuma merged commit 5d0cb14 into apache:trunk Dec 8, 2019
@ijuma

ijuma commented Dec 8, 2019

Copy link
Copy Markdown
Member

Merged to trunk. I think we should cherry-pick to 2.4 and 2.3 as well, but there are some conflicts. Will check if they're trivial or require a PR.

@ijuma

ijuma commented Dec 9, 2019

Copy link
Copy Markdown
Member

Submitted #7800 for the 2.4 cherry-pick. That version should hopefully apply cleanly to 2.3 as well.

ijuma pushed a commit to ijuma/kafka that referenced this pull request Dec 9, 2019
…uring reassignment (apache#7795)

KIP-320 improved fetch semantics by adding leader epoch validation. This relies on
reliable propagation of leader epoch information from the controller. Unfortunately, we
have encountered a bug during partition reassignment in which the leader epoch in the
controller context does not get properly updated. This causes UpdateMetadata requests
to be sent with stale epoch information which results in the metadata caches on the
brokers falling out of sync.

This bug has existed for a long time, but it is only a problem due to the new epoch
validation done by the client. Because the client includes the stale leader epoch in its
requests, the leader rejects them, yet the stale metadata cache on the brokers prevents
the consumer from getting the latest epoch. Hence the consumer cannot make progress
while a reassignment is ongoing.

Although it is straightforward to fix this problem in the controller for the new releases
(which this patch does), it is not so easy to fix older brokers which means new clients
could still encounter brokers with this bug. To address this problem, this patch also
modifies the client to treat the leader epoch returned from the Metadata response as
"unreliable" if it comes from an older version of the protocol. The client in this case will
discard the returned epoch and it won't be included in any requests.

Also, note that the correct epoch is still forwarded to replicas correctly in the
LeaderAndIsr request, so this bug does not affect replication.

Reviewers: Jun Rao <junrao@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
hachikuji added a commit to hachikuji/kafka that referenced this pull request Dec 9, 2019
…uring reassignment (apache#7795)

KIP-320 improved fetch semantics by adding leader epoch validation. This relies on
reliable propagation of leader epoch information from the controller. Unfortunately, we
have encountered a bug during partition reassignment in which the leader epoch in the
controller context does not get properly updated. This causes UpdateMetadata requests
to be sent with stale epoch information which results in the metadata caches on the
brokers falling out of sync.

This bug has existed for a long time, but it is only a problem due to the new epoch
validation done by the client. Because the client includes the stale leader epoch in its
requests, the leader rejects them, yet the stale metadata cache on the brokers prevents
the consumer from getting the latest epoch. Hence the consumer cannot make progress
while a reassignment is ongoing.

Although it is straightforward to fix this problem in the controller for the new releases
(which this patch does), it is not so easy to fix older brokers which means new clients
could still encounter brokers with this bug. To address this problem, this patch also
modifies the client to treat the leader epoch returned from the Metadata response as
"unreliable" if it comes from an older version of the protocol. The client in this case will
discard the returned epoch and it won't be included in any requests.

Also, note that the correct epoch is still forwarded to replicas correctly in the
LeaderAndIsr request, so this bug does not affect replication.

Reviewers: Jun Rao <junrao@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
hachikuji added a commit that referenced this pull request Dec 10, 2019
…uring reassignment (#7795) (#7805)

KIP-320 improved fetch semantics by adding leader epoch validation. This relies on
reliable propagation of leader epoch information from the controller. Unfortunately, we
have encountered a bug during partition reassignment in which the leader epoch in the
controller context does not get properly updated. This causes UpdateMetadata requests
to be sent with stale epoch information which results in the metadata caches on the
brokers falling out of sync.

This bug has existed for a long time, but it is only a problem due to the new epoch
validation done by the client. Because the client includes the stale leader epoch in its
requests, the leader rejects them, yet the stale metadata cache on the brokers prevents
the consumer from getting the latest epoch. Hence the consumer cannot make progress
while a reassignment is ongoing.

Although it is straightforward to fix this problem in the controller for the new releases
(which this patch does), it is not so easy to fix older brokers which means new clients
could still encounter brokers with this bug. To address this problem, this patch also
modifies the client to treat the leader epoch returned from the Metadata response as
"unreliable" if it comes from an older version of the protocol. The client in this case will
discard the returned epoch and it won't be included in any requests.

Also, note that the correct epoch is still forwarded to replicas correctly in the
LeaderAndIsr request, so this bug does not affect replication.

Reviewers: Jun Rao <junrao@gmail.com>, Stanislav Kozlovski <stanislav_kozlovski@outlook.com>, Ismael Juma <ismael@juma.me.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants