Skip to content

KAFKA-8715: fix timestamp bug for static member.id generation#7116

Merged
hachikuji merged 3 commits into
apache:trunkfrom
abbccdda:timestamp
Jul 26, 2019
Merged

KAFKA-8715: fix timestamp bug for static member.id generation#7116
hachikuji merged 3 commits into
apache:trunkfrom
abbccdda:timestamp

Conversation

@abbccdda

Copy link
Copy Markdown

The bug is that we accidentally used the loaded timestamp for the group instead of the real current time. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@abbccdda

Copy link
Copy Markdown
Author

@guozhangwang @hachikuji FYI

}

@Test
def shouldGetDifferentStaticMemberIdBasedOnTimestamp() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test expose the original bug? I thought the bug would only surface if we have a loaded group metadata with old versioned format; otherwise the time stays the same but not None.

In this test MockTimer does not advance after the staticMembersJoinAndRebalance call, so the joinTimeAsStr would be the same?

I think we need to explicitly call timer.advanceClock() within the loop as well.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have that within staticMembersJoinAndRebalance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I'm thinking that we should also advance time within the forloop right? Otherwise the timestamp would still be static.

@guozhangwang guozhangwang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix, I have a question about the test coverage.

@hachikuji hachikuji left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abbccdda Thanks for the fix. I think this bug has both a direct and indirect impact. The direct impact is the NPE. The indirect impact is that the memberId is not being uniquely generated. This seems to have an impact for fencing? I think I'm wondering whether we should use UUID.randomUUID() rather than timestamp, similar to the dynamic path.

Comment thread core/src/test/scala/unit/kafka/coordinator/group/GroupCoordinatorTest.scala Outdated
staticMembersJoinAndRebalance(leaderInstanceId, followerInstanceId)

val timeAdvance = 1
for (_ <- 1 to 20) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20 seems like overkill. Maybe two times is enough.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, will reduce it to 5.

@guozhangwang

Copy link
Copy Markdown
Contributor

@abbccdda Thanks for the fix. I think this bug has both a direct and indirect impact. The direct impact is the NPE. The indirect impact is that the memberId is not being uniquely generated. This seems to have an impact for fencing? I think I'm wondering whether we should use UUID.randomUUID() rather than timestamp, similar to the dynamic path.

Assuming the instance-id is unique already, then practically time.millisecond() should be sufficient as you will only break fencing if the clients configured with the same instance id are joining at the exact same milli-second. So I think this theoretical risk is not worth dropping the debugging efficiency?

@abbccdda

Copy link
Copy Markdown
Author

I still feel timestamp is useful for debugging like Guozhang suggested. For uniqueness I'm fine with both.

@hachikuji

Copy link
Copy Markdown
Contributor

Ok, we can stick with timestamp.

@guozhangwang

Copy link
Copy Markdown
Contributor

LGTM.

@ijuma

ijuma commented Jul 26, 2019

Copy link
Copy Markdown
Member

Hmm, the reasoning for using a timestamp seems a bit odd. Are we sure about this?

@abbccdda

Copy link
Copy Markdown
Author

Yea, I think we discussed a couple of times during past reviews, and felt that a timestamp suffix could help debug when a static member actually rejoins the group.

@abbccdda

Copy link
Copy Markdown
Author

get 1/3 build with jdk 8 unknown failure

retest this please

@ijuma

ijuma commented Jul 26, 2019

Copy link
Copy Markdown
Member

We can use a time-based UUID or you could add the timestamp in addition to the UUID. Relying on a timestamp for uniqueness is an anti-pattern. In particular, the following statement is weird:

"Assuming the instance-id is unique already, then practically time.millisecond() should be sufficient as you will only break fencing if the clients configured with the same instance id are joining at the exact same milli-second. So I think this theoretical risk"

We know things like this happen, it's not theoretical.

@guozhangwang

Copy link
Copy Markdown
Contributor

I'm fine with a instanceId + UUID + timestamp format than instanceId + timestamp -- I do feel maintaining the timestamp would be convenient for debugging.

@hachikuji

Copy link
Copy Markdown
Contributor

I prefer instanceId + UUID for consistency. I think the debugging benefit of having a timestamp is marginal.

@abbccdda

Copy link
Copy Markdown
Author

Per offline discussion, we shall use UUID for consistency. Thank you all!

@hachikuji hachikuji merged commit 79341a1 into apache:trunk Jul 26, 2019
hachikuji pushed a commit that referenced this pull request Jul 26, 2019
… generation (#7116)

The bug is that we accidentally used the current state timestamp for the group instead of the real current time. When a group is first loaded, this timestamp is not initialized, so this resulted in a `NoSuchElementException`. Additionally this violated the intended uniqueness of the memberId, which could have broken the group instance fencing. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
ijuma added a commit to confluentinc/kafka that referenced this pull request Aug 4, 2019
* apache-github/2.3:
  MINOR: Avoid dividing by zero (apache#7143)
  KAFKA-8731: InMemorySessionStore throws NullPointerException on startup (apache#7132)
  KAFKA-8715; Fix buggy reliance on state timestamp in static member.id generation (apache#7116)
  KAFKA-8678; Fix leave group protocol bug in throttling and error response (apache#7101)
xiowu0 pushed a commit to linkedin/kafka that referenced this pull request Aug 22, 2019
…mestamp in static member.id generation (apache#7116)

TICKET = KAFKA-8715
LI_DESCRIPTION =
EXIT_CRITERIA = HASH [c29380f]
ORIGINAL_DESCRIPTION =

The bug is that we accidentally used the current state timestamp for the group instead of the real current time. When a group is first loaded, this timestamp is not initialized, so this resulted in a `NoSuchElementException`. Additionally this violated the intended uniqueness of the memberId, which could have broken the group instance fencing. Fix is made and unit test to make sure the timestamp is properly encoded within the returned member.id.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Guozhang Wang <wangguoz@gmail.com>, Jason Gustafson <jason@confluent.io>
(cherry picked from commit c29380f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants