KAFKA-2557: separate REBALANCE_IN_PROGRESS and ILLEGAL_GENERATION error codes#222
KAFKA-2557: separate REBALANCE_IN_PROGRESS and ILLEGAL_GENERATION error codes#222onurkaraman wants to merge 1 commit into
Conversation
|
LGTM. |
|
kafka-trunk-git-pr #448 FAILURE |
|
LGTM |
There was a problem hiding this comment.
I think I got the ordering wrong here. We probably should check for generation id mismatch before validating group stability to give a more accurate error code back to the user.
I'll update the PR.
f92a9f7 to
e0f14df
Compare
|
kafka-trunk-git-pr #472 SUCCESS |
|
@gwenshap or @guozhangwang, maybe one of you two would like to review this? Two LGTMs from @hachikuji and @becketqin and it's a small PR. |
|
LGTM. |
|
I thought about this some more and noticed some strange result worth sharing. I think ILLEGAL_GENERATION errors would now be very rare, as there is an overlap in the UNKNOWN_CONSUMER_ID and ILLEGAL_GENERATION codes, and in the PR's condition ordering, UNKNOWN_CONSUMER_ID gets priority. I think ILLEGAL_GENERATION would only occur with faulty consumer implementations where the HeartbeatRequest correctly provided an up-to-date consumer id, but provided a stale generation id. |
|
@onurkaraman Funny, I was also wondering earlier today if the generation was still needed. To actually get that error instead of UNKNOWN_CONSUMER, the client implementation would have to successfully join the group, but fail to update its own generation. Perhaps that might be possible if the implementation timed out the JoinGroup request before receiving the response and then continued consuming anyway? Seems like a rather far-fetched case to dedicate a whole concept to. The generation might be more useful if clients provided their own consumerId, but we just use uuids, so the chance of collision is basically zero. |
|
After thinking a little more, the generation seems more useful when group state is persisted in zookeeper. I think it tries to handle an edge case where the coordinator fails immediately after writing the group state. When the new coordinator takes over, it initially has no idea whether all members in the last group actually received the assignment or not. After it receives the heartbeats from the consumers, however, it can can tell by comparing the generation with the one stored in zookeeper. I guess that's the main case where we'd see ILLEGAL_GENERATION. |
|
It can happen that for example, the old "consumer-3" falls out of the group and a new comer is assigned "consumer-3", and later when the older "consumer-3" resumes and tries to talk to the coordinator again. Admittedly this is an edge case, and I think @hachikuji 's point is also common. |
|
I'm not suggesting we drop the notion of generation id or anything like that. At the very least, it helps you sanely reason about the rounds of rebalance. I think this would end up being helpful for debugging. I think the main cause of the overlap in the two error codes is because we now increment generation id at the beginning of the Rebalancing state instead of at the beginning of PreparingRebalance. Previously, with incrementing at the beginning of PreparingRebalance, it was easy to hit the ILLEGAL_GENERATION in the common rebalance case: Now, with incrementing at the beginning of Rebalancing, it's harder to hit the ILLEGAL_GENERATION in the common rebalance case: |
|
Yeah I agree. |
…e#222) TICKET = N/A LI_DESCRIPTION = While we would still suffer from the flaky tests, this should speed up the release build time a lot. EXIT_CRITERIA = When not using Github Actions for build
No description provided.