KAFKA-8715: fix timestamp bug for static member.id generation#7116
Conversation
| } | ||
|
|
||
| @Test | ||
| def shouldGetDifferentStaticMemberIdBasedOnTimestamp() { |
There was a problem hiding this comment.
Does this test expose the original bug? I thought the bug would only surface if we have a loaded group metadata with old versioned format; otherwise the time stays the same but not None.
In this test MockTimer does not advance after the staticMembersJoinAndRebalance call, so the joinTimeAsStr would be the same?
I think we need to explicitly call timer.advanceClock() within the loop as well.
There was a problem hiding this comment.
We do have that within staticMembersJoinAndRebalance
There was a problem hiding this comment.
Yes but I'm thinking that we should also advance time within the forloop right? Otherwise the timestamp would still be static.
guozhangwang
left a comment
There was a problem hiding this comment.
Thanks for the quick fix, I have a question about the test coverage.
hachikuji
left a comment
There was a problem hiding this comment.
@abbccdda Thanks for the fix. I think this bug has both a direct and indirect impact. The direct impact is the NPE. The indirect impact is that the memberId is not being uniquely generated. This seems to have an impact for fencing? I think I'm wondering whether we should use UUID.randomUUID() rather than timestamp, similar to the dynamic path.
| staticMembersJoinAndRebalance(leaderInstanceId, followerInstanceId) | ||
|
|
||
| val timeAdvance = 1 | ||
| for (_ <- 1 to 20) { |
There was a problem hiding this comment.
20 seems like overkill. Maybe two times is enough.
Assuming the instance-id is unique already, then practically time.millisecond() should be sufficient as you will only break fencing if the clients configured with the same instance id are joining at the exact same milli-second. So I think this theoretical risk is not worth dropping the debugging efficiency? |
|
I still feel timestamp is useful for debugging like Guozhang suggested. For uniqueness I'm fine with both. |
|
Ok, we can stick with timestamp. |
|
LGTM. |
|
Hmm, the reasoning for using a timestamp seems a bit odd. Are we sure about this? |
|
Yea, I think we discussed a couple of times during past reviews, and felt that a timestamp suffix could help debug when a static member actually rejoins the group. |
|
get 1/3 build with jdk 8 unknown failure retest this please |
|
We can use a time-based UUID or you could add the timestamp in addition to the UUID. Relying on a timestamp for uniqueness is an anti-pattern. In particular, the following statement is weird: "Assuming the instance-id is unique already, then practically time.millisecond() should be sufficient as you will only break fencing if the clients configured with the same instance id are joining at the exact same milli-second. So I think this theoretical risk" We know things like this happen, it's not theoretical. |
|
I'm fine with a |
|
I prefer |
|
Per offline discussion, we shall use UUID for consistency. Thank you all! |
… generation (#7116) The bug is that we accidentally used the current state timestamp for the group instead of the real current time. When a group is first loaded, this timestamp is not initialized, so this resulted in a `NoSuchElementException`. Additionally this violated the intended uniqueness of the memberId, which could have broken the group instance fencing. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id. Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
* apache-github/2.3: MINOR: Avoid dividing by zero (apache#7143) KAFKA-8731: InMemorySessionStore throws NullPointerException on startup (apache#7132) KAFKA-8715; Fix buggy reliance on state timestamp in static member.id generation (apache#7116) KAFKA-8678; Fix leave group protocol bug in throttling and error response (apache#7101)
…mestamp in static member.id generation (apache#7116) TICKET = KAFKA-8715 LI_DESCRIPTION = EXIT_CRITERIA = HASH [c29380f] ORIGINAL_DESCRIPTION = The bug is that we accidentally used the current state timestamp for the group instead of the real current time. When a group is first loaded, this timestamp is not initialized, so this resulted in a `NoSuchElementException`. Additionally this violated the intended uniqueness of the memberId, which could have broken the group instance fencing. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id. Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io> (cherry picked from commit c29380f)
The bug is that we accidentally used the loaded timestamp for the group instead of the real current time. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.
Committer Checklist (excluded from commit message)