Skip to content

KAFKA-9701 (fix): Only check protocol name when generation is valid#8324

Merged
guozhangwang merged 2 commits into
apache:trunkfrom
abbccdda:KAFKA-9701-fix
Mar 21, 2020
Merged

KAFKA-9701 (fix): Only check protocol name when generation is valid#8324
guozhangwang merged 2 commits into
apache:trunkfrom
abbccdda:KAFKA-9701-fix

Conversation

@abbccdda

Copy link
Copy Markdown

This bug was incurred by #7994 with a too-strong consistency check. It is because a reset generation operation could be called in between the joinGroupRequest -> joinGroupResponse -> SyncGroupRequest -> SyncGroupResponse sequence of events, if user calls unsubscribe in the middle of consumer#poll().

Proper fix is to avoid the protocol name check when the generation is invalid.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@apurvam

apurvam commented Mar 21, 2020

Copy link
Copy Markdown
Contributor

Thanks for the patch.. How was this found? Should we add tests?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just side cleanups starting L700

@mjsax

mjsax commented Mar 21, 2020

Copy link
Copy Markdown
Member

@abbccdda Would it be valid to call unsubscribe() during poll()? (I assume you mean a call in one of the RebalanceListener callback methods?)

If we consider it not valid, we might rather raise an IllegalStateException of somebody make such a call. What would be the use case to call unsubscribe() in the callback?

@abbccdda

Copy link
Copy Markdown
Author

@apurvam It was reproduced from the soak test. Will discuss with the team whether we need a stream side test as the case to mock here is a little bit tricky.

@guozhangwang

Copy link
Copy Markdown
Contributor

@mjsax I think in streams case we are not unsubscribe during poll, as a rebalance may be completed in several consecutive poll calls, and we call unsubscribe in between, which to the consumer's pov is totally legitimate.

@abbccdda I think this is not a streams-only issue, as mentioned above, any consumer could call unsubscribe after a poll call while the rebalance is actually not completed yet. It is not end of world that we did not capture it previously until we hit in soak, since unsubscribe is not a common callee indeed, and with that regard I think covering this on the consumer side is fine.

@guozhangwang

Copy link
Copy Markdown
Contributor

LGTM! Merging to trunk and 2.5

@guozhangwang guozhangwang merged commit c75dc5e into apache:trunk Mar 21, 2020
guozhangwang pushed a commit that referenced this pull request Mar 21, 2020
…8324)

This bug was incurred by #7994 with a too-strong consistency check. It is because a reset generation operation could be called in between the joinGroupRequest -> joinGroupResponse -> SyncGroupRequest -> SyncGroupResponse sequence of events, if user calls unsubscribe in the middle of consumer#poll().

Proper fix is to avoid the protocol name check when the generation is invalid.

Reviewers: Matthias J. Sax <matthias@confluent.io>, Guozhang Wang <wangguoz@gmail.com>
@mjsax mjsax added the consumer label Mar 21, 2020
@dajac

dajac commented Mar 21, 2020

Copy link
Copy Markdown
Member

Nice find!

Comment on lines +916 to +917
return protocolName != null && generation() != Generation.NO_GENERATION
&& !protocolName.equals(generation().protocolName);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang @abbccdda Shouldn't we synchronise this or use a local reference of the generation to be 100% safe?

@dajac

dajac commented Mar 23, 2020

Copy link
Copy Markdown
Member

@guozhangwang @abbccdda While I agree with the fix to unblock the release, I wonder if it wouldn't be better to explicitly handle this case in the SyncGroupResponseHandler and fail the rebalance with when it happens. I think that an explicit handling would avoid such subtile mistakes in the future. We already do something like this in the JoinGroupResponseHandler. What do you think?

@alfhv

alfhv commented May 11, 2022

Copy link
Copy Markdown

I'm running on version 2.6.1 and I'm having exactly same issue described here.
During a DR test some kafka nodes and consumers went down. When consumers were restarted they were rejected by broker with InconsistencyGroupProtocol exception.
Stopping and restarting all consumers solved the issue.
It is possible that scenario described on this thread could arrive on version 2.6.1 (maybe due to other reason)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants