Skip to content

KAFKA-2557: separate REBALANCE_IN_PROGRESS and ILLEGAL_GENERATION error codes#222

Closed
onurkaraman wants to merge 1 commit into
apache:trunkfrom
onurkaraman:KAFKA-2557
Closed

KAFKA-2557: separate REBALANCE_IN_PROGRESS and ILLEGAL_GENERATION error codes#222
onurkaraman wants to merge 1 commit into
apache:trunkfrom
onurkaraman:KAFKA-2557

Conversation

@onurkaraman

Copy link
Copy Markdown
Contributor

No description provided.

@becketqin

Copy link
Copy Markdown
Contributor

LGTM.

@asfbot

asfbot commented Sep 18, 2015

Copy link
Copy Markdown

kafka-trunk-git-pr #448 FAILURE
Looks like there's a problem with this pull request

@hachikuji

Copy link
Copy Markdown
Contributor

LGTM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got the ordering wrong here. We probably should check for generation id mismatch before validating group stability to give a more accurate error code back to the user.

I'll update the PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha. Good call.

@asfbot

asfbot commented Sep 21, 2015

Copy link
Copy Markdown

kafka-trunk-git-pr #472 SUCCESS
This pull request looks good

@ijuma

ijuma commented Sep 22, 2015

Copy link
Copy Markdown
Member

@gwenshap or @guozhangwang, maybe one of you two would like to review this? Two LGTMs from @hachikuji and @becketqin and it's a small PR.

@guozhangwang

Copy link
Copy Markdown
Contributor

LGTM.

@asfgit asfgit closed this in 2837fa5 Sep 22, 2015
@onurkaraman

Copy link
Copy Markdown
Contributor Author

I thought about this some more and noticed some strange result worth sharing. I think ILLEGAL_GENERATION errors would now be very rare, as there is an overlap in the UNKNOWN_CONSUMER_ID and ILLEGAL_GENERATION codes, and in the PR's condition ordering, UNKNOWN_CONSUMER_ID gets priority.

I think ILLEGAL_GENERATION would only occur with faulty consumer implementations where the HeartbeatRequest correctly provided an up-to-date consumer id, but provided a stale generation id.

@hachikuji

Copy link
Copy Markdown
Contributor

@onurkaraman Funny, I was also wondering earlier today if the generation was still needed. To actually get that error instead of UNKNOWN_CONSUMER, the client implementation would have to successfully join the group, but fail to update its own generation. Perhaps that might be possible if the implementation timed out the JoinGroup request before receiving the response and then continued consuming anyway? Seems like a rather far-fetched case to dedicate a whole concept to. The generation might be more useful if clients provided their own consumerId, but we just use uuids, so the chance of collision is basically zero.

@hachikuji

Copy link
Copy Markdown
Contributor

After thinking a little more, the generation seems more useful when group state is persisted in zookeeper. I think it tries to handle an edge case where the coordinator fails immediately after writing the group state. When the new coordinator takes over, it initially has no idea whether all members in the last group actually received the assignment or not. After it receives the heartbeats from the consumers, however, it can can tell by comparing the generation with the one stored in zookeeper. I guess that's the main case where we'd see ILLEGAL_GENERATION.

@guozhangwang

Copy link
Copy Markdown
Contributor

It can happen that for example, the old "consumer-3" falls out of the group and a new comer is assigned "consumer-3", and later when the older "consumer-3" resumes and tries to talk to the coordinator again. Admittedly this is an edge case, and I think @hachikuji 's point is also common.

@onurkaraman

Copy link
Copy Markdown
Contributor Author

I'm not suggesting we drop the notion of generation id or anything like that. At the very least, it helps you sanely reason about the rounds of rebalance. I think this would end up being helpful for debugging.

I think the main cause of the overlap in the two error codes is because we now increment generation id at the beginning of the Rebalancing state instead of at the beginning of PreparingRebalance.

Previously, with incrementing at the beginning of PreparingRebalance, it was easy to hit the ILLEGAL_GENERATION in the common rebalance case:
When the group was PreparingRebalance, a preexisting member's heartbeat would pass the group.has(consumerId) check and fail the generation id check, resulting in an ILLEGAL_GENERATION which tells the consumer that it needs to rejoin so it can be included in the following generation of the group.

Now, with incrementing at the beginning of Rebalancing, it's harder to hit the ILLEGAL_GENERATION in the common rebalance case:
When the group was PreparingRebalance, a preexisting member's heartbeat would pass the group.has(consumerId) check, pass the generation id check, and fail the group.is(Stable) check, resulting in a REBALANCE_IN_PROGRESS which tells consumer that it needs to rejoin so it can be included in the following generation of the group.

@guozhangwang

Copy link
Copy Markdown
Contributor

Yeah I agree.

xiowu0 pushed a commit to xiowu0/kafka that referenced this pull request Dec 9, 2021
…e#222)

TICKET = N/A
LI_DESCRIPTION = While we would still suffer from the flaky tests, this should speed up the release build time a lot.
EXIT_CRITERIA = When not using Github Actions for build
davide-armand pushed a commit to aiven/kafka that referenced this pull request Dec 1, 2025
jeqo added a commit to aiven/kafka that referenced this pull request Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants