KAFKA-17848: Fixing share purgatory request and locks handling by apoorvmittal10 · Pull Request #17583 · apache/kafka

apoorvmittal10 · 2024-10-22T22:29:51Z

For delayed fetch, tryComplete can be called again after onComplete. As the requests are processed with parallel threads hence this scenario can occur. We attain locks in tryComplete which keeps pending as onComplete is never called when request is already completed.

I have added a Uuid in each DelayedShareFetch on local to track the calls. When second tryComplete is called then it did make a call to forceComplete() but as it was already completed hence onComplete is never called.

The PR adds a safe check of isCompleted() which alone is not sufficient hence again check for the resonse of forceComplete to release the locks.

Also have moved purgatory calls in share partition out of partition lock as it's not required.

[2024-10-23 16:27:03,875] INFO Try to complete. Member NDHrnOecQ8aTXA6PXfdcTw, Uuid: mQMiOtV9RYSQciVQ9dT3AA (kafka.server.share.DelayedShareFetch)

[2024-10-23 16:27:03,875] INFO Completing the delayed share fetch request for group perf-share-consumer, member NDHrnOecQ8aTXA6PXfdcTw, Uuid: mQMiOtV9RYSQciVQ9dT3AA (kafka.server.share.DelayedShareFetch)

[2024-10-23 16:27:03,907] INFO Try to complete. Member NDHrnOecQ8aTXA6PXfdcTw, mQMiOtV9RYSQciVQ9dT3AA (kafka.server.share.DelayedShareFetch)

apoorvmittal10 · 2024-10-22T23:15:26Z

@junrao For my understanding, what can trigger onComplete without invoking tryComplete? Is it the number of requests that can be in purgatory? I was doing 10 share consumers paralled read with 5Million records already produced over 16 partitions.

adixitconfluent · 2024-10-23T05:09:36Z

        Map<TopicIdPartition, FetchRequest.PartitionData> topicPartitionData;
        // tryComplete did not invoke forceComplete, so we need to check if we have any partitions to fetch.
-        if (topicPartitionDataFromTryComplete.isEmpty())
+        if (topicPartitionDataFromTryComplete == null || topicPartitionDataFromTryComplete.isEmpty())


For my understanding, how can this value be null? We initialize it in the DelayedShareFetch constructor and its updation always returns a map.

I read one thing wrong and instead of null it'a always empty. Re-checking and verifying why this fixed consistently with this PR change, I can easily reproduce the issue without this change and can never with. I will update.

adixitconfluent · 2024-10-23T18:04:32Z

+        // However, this check alone cannot guarantee that request is really completed. It is possible that
+        // tryComplete is invoked by multiple threads and state has yet not updated. Hence, we need to check
+        // the forceComplete response as well.
+        if (isCompleted()) {


Can we also add unit tests for the conditions to verify this line and line 164

Yeah, let me do that tomorrow. I have tested the runs manually and verified with 25 parallel share consumers and 25 million messages.

I have added tests.

AndrewJSchofield

Nice catch.

junrao

@apoorvmittal10 : Thanks for the PR. Left a few comments.

junrao · 2024-10-23T20:28:01Z

        }
+        // If we have an acquisition lock timeout for a share-partition, then we should check if
+        // there is a pending share fetch request for the share-partition and complete it.
+        DelayedShareFetchKey delayedShareFetchKey = new DelayedShareFetchGroupKey(groupId, topicIdPartition.topicId(), topicIdPartition.partition());


Should we call this under if (!stateBatches.isEmpty())?

My bad, corrected it.

junrao · 2024-10-23T20:31:08Z

+        // However, this check alone cannot guarantee that request is really completed. It is possible that
+        // tryComplete is invoked by multiple threads and state has yet not updated. Hence, we need to check
+        // the forceComplete response as well.
+        if (isCompleted()) {


This is unnecessary since the caller DelayedOperationPurgatory.Watchers.tryCompleteWatched already does this.

Ohh, I added a log inside the method and can see a lot request lands here.

So I rechecked and added log line to see the tryComplete is being called even when completed is true.

Here is my understanding, I can see in DelayedOperation.scala:

tryComplete() is always executed safely i.e. from safeTryComplete or safeTryCompleteOrElse which takes a lock on DelayedOperation itself hence no 2 threads can execute tryComplete() simultaneously. Correct?

You are right that tryCompleteWatched has completed check already. But the issue does exist.

This triggers only when there are multiple share consumers for same group and same topic partition. I have traced the calls and can find following: the calls originates from addToActionQueue defined in onCompleted of DelayedShareFetch. Though the request goes through tryCompletedWatch but then again the tryComplete is called despite completed. The conditional variable in DelayedOperation etc. seems fine to me. Not sure how it triggers.

[2024-10-25 17:34:32,670] INFO Share fetch request for group SG1, member hYDnbPjATHqM7uFXBGYKTw is already completed (kafka.server.share.DelayedShareFetch) [2024-10-25 17:34:32,703] INFO Share fetch request for group SG1, member hYDnbPjATHqM7uFXBGYKTw is already completed (kafka.server.share.DelayedShareFetch) [2024-10-25 17:34:32,754] INFO Share fetch request for group SG1, member hYDnbPjATHqM7uFXBGYKTw is already completed (kafka.server.share.DelayedShareFetch) [2024-10-25 17:34:33,191] INFO Share fetch request for group SG1, member hYDnbPjATHqM7uFXBGYKTw is already completed (kafka.server.share.DelayedShareFetch) [2024-10-25 17:34:33,391] INFO Share fetch request for group SG1, member OE4al-DNR6C8u3tiD05rGQ is already completed (kafka.server.share.DelayedShareFetch) [2024-10-25 17:34:33,682] INFO Share fetch request for group SG1, member OE4al-DNR6C8u3tiD05rGQ is already completed (kafka.server.share.DelayedShareFetch) [2024-10-25 17:34:34,363] INFO Share fetch request for group SG1, member OE4al-DNR6C8u3tiD05rGQ is already completed (kafka.server.share.DelayedShareFetch)

But once it's in tryComplete with isCompleted as true of DelayedShareFetch then never again it arrives in tryComplete of that DelayedShareFetch.

Interesting. tryCompleteWatched checks isCompleted without the lock. So, it's possible that multiple callers check isCompleted and get false. They all get queued up on the lock and will call safeTryComplete and tryComplete multiple times. Perhaps we could further add a isCompleted check inside safeTryComplete before making the tryComplete call.

if (curr.isCompleted) { // another thread has completed this operation, just remove it iter.remove() } else if (curr.safeTryComplete()) {

Yeah make more sense to move to DelayedOperation. Done.

junrao · 2024-10-23T20:33:53Z

+                        log.trace("Record lock partition limit exceeded for SharePartition {}, " +
                            "cannot acquire more records", sharePartition);
+                    }
+                } catch (Exception e) {


Where is the exception coming from?

I have added the exception block so if there is any unrealized exception then atleast lock should be released.

AndrewJSchofield · 2024-10-25T12:17:35Z

+            }
+            return result;
+        }
+        log.trace("Can't acquire records for any partition in the share fetch request for group {}, member {}, " +


The frequency of this log line is phenomenal. I wonder whether it's really helpful or just likely to flood the logs to the extent that it's impossible to see anything else.

yeah, I think I made it INFO at the beginning because I was testing the purgatory stuff, but going forward we will make the logs as trace/debug. I forgot to change it to DEBUG when it got merged.

I have removed this log line, we can check the purgatory metric to see waiting requests.

Thanks. That seems much more appropriate.

apoorvmittal10 · 2024-10-25T17:22:49Z

@junrao @adixitconfluent @AndrewJSchofield Can I please get a re-review.

junrao

@apoorvmittal10 : Thanks for the updated PR. Just a minor comment.

junrao · 2024-10-25T20:37:01Z

-                "topic partitions {}", shareFetchData.groupId(),
-                shareFetchData.memberId(), shareFetchData.partitionMaxBytes().keySet());
+        if (!topicPartitionDataFromTryComplete.isEmpty()) {
+            boolean result = forceComplete();


result => completedByMe?

apoorvmittal10 · 2024-10-25T22:03:02Z

@junrao Thanks for the suggestion, I have addressed the feedback.

junrao

@apoorvmittal10 : Thanks for the updated PR. One more comment.

junrao · 2024-10-25T22:56:45Z

-  private[server] def safeTryComplete(): Boolean = inLock(lock)(tryComplete())
+  private[server] def safeTryComplete(): Boolean = inLock(lock) {
+    if (isCompleted)
+      true


If yes execute the completion logic by calling
forceComplete() and return true iff forceComplete returns true; otherwise return false

This is the return value definition. So if the request is completed, we should return false.

Yeah, you are right. As the request is already completed then it should return false as other thread should have already bumped the completed count. And it should be removed from iteration once watched sees it completed. I made the change.

junrao

@apoorvmittal10 : Thanks for the updated PR. LGTM

…e#17583) For delayed fetch, tryComplete can be called again after onComplete. As the requests are processed with parallel threads hence this scenario can occur. We attain locks in tryComplete which keeps pending as onComplete is never called when request is already completed. Reviewers: Abhinav Dixit <adixit@confluent.io>, Andrew Schofield <aschofield@confluent.io>, Jun Rao <junrao@gmail.com>

KAFKA-17848: Fixing NPE in delayed share fetch

f3a60a1

github-actions Bot added core Kafka Broker KIP-932 Queues for Kafka small Small PRs labels Oct 22, 2024

adixitconfluent reviewed Oct 23, 2024

View reviewed changes

apoorvmittal10 marked this pull request as draft October 23, 2024 07:43

Fixing issues with purgatory

1d8f489

apoorvmittal10 changed the title ~~KAFKA-17848: Fixing NPE in delayed share fetch~~ KAFKA-17848: Fixing share purgatory request and locks handling Oct 23, 2024

spotless fixes

fb7f482

apoorvmittal10 requested a review from adixitconfluent October 23, 2024 17:55

apoorvmittal10 marked this pull request as ready for review October 23, 2024 17:57

apoorvmittal10 requested review from junrao and removed request for adixitconfluent October 23, 2024 17:57

adixitconfluent suggested changes Oct 23, 2024

View reviewed changes

AndrewJSchofield reviewed Oct 23, 2024

View reviewed changes

junrao reviewed Oct 23, 2024

View reviewed changes

AndrewJSchofield added the ci-approved label Oct 24, 2024

Merge remote-tracking branch 'upstream/trunk' into KAFKA-17848

c3d3d05

AndrewJSchofield requested changes Oct 25, 2024

View reviewed changes

Adding tests

1c75cd5

github-actions Bot removed the small Small PRs label Oct 25, 2024

apoorvmittal10 requested review from AndrewJSchofield, adixitconfluent and junrao October 25, 2024 17:22

junrao reviewed Oct 25, 2024

View reviewed changes

Moving check to delayed operation

06988fa

apoorvmittal10 requested a review from junrao October 25, 2024 22:02

junrao reviewed Oct 25, 2024

View reviewed changes

apoorvmittal10 added 2 commits October 26, 2024 11:08

Merge remote-tracking branch 'upstream/trunk' into KAFKA-17848

9ae0a9f

Correcting return value

8ddc810

apoorvmittal10 requested a review from junrao October 26, 2024 10:16

junrao approved these changes Oct 26, 2024

View reviewed changes

junrao merged commit 397ae59 into apache:trunk Oct 26, 2024

apoorvmittal10 mentioned this pull request Nov 4, 2024

KAFKA-17890: Move DelayedOperationPurgatory to server-common #17636

Merged

3 tasks

Conversation

apoorvmittal10 commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apoorvmittal10 commented Oct 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 commented Oct 25, 2024

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvmittal10 commented Oct 25, 2024

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

apoorvmittal10 commented Oct 22, 2024 •

edited

Loading