Skip to content

KAFKA-17912: Align string representations of SharePartitionKey#17656

Merged
chia7712 merged 2 commits into
apache:trunkfrom
AndrewJSchofield:KAFKA-17912
Nov 1, 2024
Merged

KAFKA-17912: Align string representations of SharePartitionKey#17656
chia7712 merged 2 commits into
apache:trunkfrom
AndrewJSchofield:KAFKA-17912

Conversation

@AndrewJSchofield

Copy link
Copy Markdown
Member

The recent test failure documented by #17649 seems to be related to the string representation of SharePartitionKey. This is a combination of group-id:topic-id:partition used by share groups to refer to a "share-partition" which is a particular share group's view of a topic-partition. The SharePartitionKey is used to find the share coordinator responsible for a share-partition.

It looks like some code was using a simple concatenation of the fields, while others was using toString which is human-readable. This mismatch seems to have caused requests to be sent to the wrong share coordinator, resulting in tests timing out.

This PR aligns all lookup-based uses of SharePartitionKey on the simple concatenation of its constituent parts.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@mumrah mumrah left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix on this @AndrewJSchofield 👍


private TopicPartition topicPartitionFor(SharePartitionKey key) {
return new TopicPartition(Topic.SHARE_GROUP_STATE_TOPIC_NAME, partitionFor(key.toString()));
return new TopicPartition(Topic.SHARE_GROUP_STATE_TOPIC_NAME, partitionFor(key.asCoordinatorKey()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndrewJSchofield could we move away from using String keys to avoid issues like this in the future? IOW, could we refactor partitionFor to accept a SharePartitionKey?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created https://issues.apache.org/jira/browse/KAFKA-17914 to cover this. Unfortunately, it's a little more work than just changing this class, and I think it would be better to prioritise getting this merged.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah Changing partitionFor would be counter productive, IMO. This is because it will be invoked by the KafkaApis FIND_COORD API handler.

The find coordinator request comprises of the coordinator keys as Strings (as its a JSON defined RPC spec). The find coordinator handler will simply invoke partitionFor with the key specified in the request argument.

If we change partitionFor signature to accept the SharePartitionKey, we would not have enough information in FIND_COORD API handler to create a SharePartitionKey object and call partitionFor unless we resort to splitting and formatting string in the request and creating the required object and pass to partitionFor.

cc: @AndrewJSchofield

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smjn I don't quite agree. I understand that the FindCoordinator RPC includes a string group-id:topic-id:partition. We could transform that into a SharePartitionKey on receipt, and then ShareCoordinator.partitionFor could take a SharePartitionKey as its parameter.

One other thing jumped out at me when I was debugging this. SharePartitionKey is a combination of a group ID and a TopicIdPartition. The topic name in the TopicIdPartition was null because it was constructed when no topic name was available. While a partition can be identified by just a topic ID and a partition, it does strike me that we need to be super-careful that we don't have a topic name sometimes, and no topic name at others, and introduce a bug due to this difference. One thing we could do for instance is ensure that the TopicIdPartition in a SharePartitionKey always has a null topic name. Maybe it doesn't make sense to use TopicIdPartition.toString() as part of the SharePartitionKey.toString() because the topic name is always null. Just wondering whether there's an easy win here in terms of defensive coding.

@smjn smjn Nov 1, 2024

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndrewJSchofield

We could transform that into a SharePartitionKey on receipt

Which would mean writing code to split and format strings - was trying to avoid it.

is a combination of a group ID and a TopicIdPartition

I did not want to do this. The initial impl of SPK clearly defined 3 fields -String groupId, Uuid topicId, int partition. In fact, the entire share-coordinator flow does not use the TopicIdPartition argument to the constructor at all. Not sure whether it is necessary for the SharePartition flow or was it done for aesthetic reasons.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the entire share-coordinator flow does not use the TopicIdPartition argument to the constructor at all

IMO, we should use TopicIdPartition class itself as used elsewhere, not beacause of aesthetic reasons but it's a logical encapsulated entity.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's a logical encapsulated entity.

Yes thats true but it might not be a good fit since topicName is always (or mostly) null.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats true but it might not be a good fit since topicName is always (or mostly) null.

And why is that? For the calls from SharePartition we can fill up topicName as well. We do a metadata read to verify first.

@smjn smjn Nov 1, 2024

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to use it then it should be fine. Otherwise, serves no purpose.

It can be argued that doing this causes additional problems, both with FIND_COORD RPC as well as well as persistence RPCs.

If we forgo the topicName from both the SharePartition and the ShareCoordinator side, we could directly invoke SPK.hashCode() method in the partitionFor method thus completely abstracting out the reliance on groupId:topicId:partitionId.

And as pointed out above we do not have access to the topic name in the above RPCs.

@@ -539,7 +539,7 @@ public void onNewMetadataImage(MetadataImage newImage, MetadataDelta delta) {
}

private TopicPartition topicPartitionFor(SharePartitionKey key) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a unit test for this method?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the string representation String.format("%s:%s:%d", groupId, topicId, partition) included by KIP-932? if not, maybe we should add it to the KIP and json file (FindCoordinatorRequest).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it to the JSON file. I'll make sure the KIP contains this too.

@smjn smjn Nov 1, 2024

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apoorvmittal10

Copy link
Copy Markdown
Contributor

Thanks @AndrewJSchofield for the quick fix.

@chia7712

Copy link
Copy Markdown
Member

@AndrewJSchofield nice find. It seems like it could pass sometimes if we're lucky, as both string representations might end up mapping to the same partition.

@apoorvmittal10 apoorvmittal10 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also enable back the failing testShareGroups test?

@chia7712 chia7712 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I'll merge this to unblock the CI. Please feel free to leave more comments or open a JIRA for further improvements.

@chia7712 chia7712 merged commit 346fdba into apache:trunk Nov 1, 2024
@chia7712

chia7712 commented Nov 1, 2024

Copy link
Copy Markdown
Member

Shall we also enable back the failing testShareGroups test?

https://issues.apache.org/jira/browse/KAFKA-17919

@AndrewJSchofield AndrewJSchofield deleted the KAFKA-17912 branch November 1, 2024 13:01
abhishekgiri23 pushed a commit to abhishekgiri23/kafka that referenced this pull request Nov 2, 2024
chiacyu pushed a commit to chiacyu/kafka that referenced this pull request Nov 30, 2024
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants