KAFKA-5886: Implement KIP-91 delivery.timeout.ms by sutambe · Pull Request #3849 · apache/kafka

sutambe · 2017-09-13T17:40:32Z

First shot at implementing kip-91 delivery.timeout.ms. Summary

Added delivery.timeout.ms config. Default 120,000
Changed retries to MAX_INT.
batches may expire whether they are inflight or not. So muted is no longer used in RecordAccumulator.expiredBatches.
In some rare situations batch.done may be called twice. Attempted transitions from failed to succeeded are logged. Successful to successful is an error (exception as before). Other transitions (failed-->aborted, aborted-->failed) are ignored.
The old test from RecordAccumulatorTest is removed. It has three additional tests. testExpiredBatchSingle, testExpiredBatchesSize, testExpiredBatchesRetry. All of them test that expiry is independent of muted.

tedyu · 2017-09-13T20:14:46Z

Should <= be used ?

tedyu · 2017-09-13T20:16:27Z

deliveryTimeoutMs should be mentioned

apurvam · 2017-09-13T23:27:51Z

Thanks for the PR @sutambe . Looking over the changes, it seems that there are two cases from the KIP which don't seem to be covered:

Setting the pollTimeout to be the expiryTime of the oldest batch being sent in the produce request. I think we need this to make sure that we expire batches in a timely manner.
Related to the previous point, the current patch doesn't seem to expire inflight requests, which is another feature that the KIP seems to promise.

Have I missed something? Or are you planning on adding the functionality above?

becketqin

Thanks for the patch. Left some comments.

becketqin · 2017-09-13T18:15:47Z

It seems better to say Producer.send() instead of send.

becketqin · 2017-09-13T18:18:06Z

We are passing now everywhere else. Maybe we can just keep the argument name the same.

The actual argument is now. However, I like the formal argument name to be createTime because it's an immutable value while constructing a batch. now, is by definition, changing.

becketqin · 2017-09-14T03:06:17Z

Should we validate the delivery.timeout.ms is greater than request.timeout.ms?

becketqin · 2017-09-14T03:31:27Z

It is probably cleaner to have an explicit EXPIRED state.

I did some digging around. An expired batch's final state is FAILED. I don't feel great about adding yet another finalState. We already have ABORTED and FAILED. The ProducerBatch.done will get even more complicated.

Maybe it's not a big deal but just want to call out that this is a behavior change. Currently the producer will throw exception when transition from FAILED state to another state due to some reason other than expiration. If we change this logic, we may miss those cases which are not failed by expiration but still got state update twice. It may not be that important if we do not have programming bugs.

Personally I think it is better to clearly define the states of the batches even if additional complexity is necessary.

The comments should probably also cover the force close case for completeness.

becketqin · 2017-09-14T03:46:07Z

The logic here probably needs more comments. We may have the following three cases that the state of a batch has been updated before the ProduceResponse returns:

A transaction abortion happens. The state of the batches would have been updated to ABORTED.

The producer is closed forcefully. The state of the batches would have been updated to ABORTED.

The batch is expired when it is in-flight. The state of the batch would have been updated to EXPIRED.

In the other cases, we should throw IllegalStateException.

Please review the updated method documentation.

becketqin · 2017-09-14T03:50:16Z

The batches still needs to be expired in order if max.in.flight.requests.per.connection is set to 1. So we probably still want to check if the partition is muted or not. That said, if we guarantee that when RecordAccumulator.expiredBatches() returns non-empty list, all the earlier batches have already been expired, we can remove the muted check here.

BTW, I did not see the logic of expiring an in-flight batch in the current patch. Am I missing something?

It's there now

@becketqin Added the muted check back

becketqin · 2017-09-14T03:50:53Z

isFull is no longer used.

Still not used.

apurvam · 2017-09-14T23:22:39Z

Heads up @sutambe , the following PR just got merged, and may have some conflicts with the current patch : #3743

There shouldn't be any impact on the logic, however.

ijuma · 2017-09-18T16:41:36Z

Friendly reminder that the feature freeze is this Wednesday.

becketqin · 2017-09-18T18:43:40Z

@ijuma Just want to check. Do you think this feature is a "minor" feature?

ijuma · 2017-09-19T04:31:01Z

@becketqin, it is possible to classify this as a minor feature, but the fact that it affects a core part of the Producer puts it in a bit of a grey area. If the PR is almost ready and we miss the feature freeze, my take is that it would be OK to merge it by the end of this week. Later than that and it seems a bit risky.

It's a bit worrying that the merge conflicts haven't been resolved since last week.

sutambe · 2017-09-19T15:52:00Z

@ijuma @becketqin I've an new PR but after a rebase I've to fix one more test. Working on that now.

ijuma · 2017-09-19T15:58:56Z

Thanks @sutambe!

sutambe · 2017-09-19T16:44:02Z

@apurvam It's not clear to me why testExpiryOfFirstBatchShouldCauseResetIfFutureBatchesFail and testExpiryOfFirstBatchShouldNotCauseUnresolvedSequencesIfFutureBatchesSucceed are failing. It looks like a batch that should be in IncompleteBatches isn't there. Any thoughts?

tedyu · 2017-09-19T17:53:22Z

This variable can be dropped.

tedyu · 2017-09-19T21:08:10Z

I added the following to testExpiryOfFirstBatchShouldCauseResetIfFutureBatchesFail before the first sender.run() call

        Sender sender = new Sender(logContext, client, metadata, this.accumulator, true, MAX_REQUEST_SIZE, ACKS_ALL, 10,
            new Metrics(), new SenderMetricsRegistry(), time, REQUEST_TIMEOUT, DELIVERY_TIMEOUT, 50, transactionManager, apiVersions);

The test still fails.

apurvam · 2017-09-19T23:10:16Z

@sutambe where are those tests failing? The latest PR builder suggests that the clients and core tests all passed.

sutambe · 2017-09-19T23:16:08Z

@apurvam @ijuma @becketqin The Sender and RecordAccumulator are passing now. The failing tests are connect tests that are irrelevant.

apurvam · 2017-09-19T23:21:58Z

@sutambe I don't think the test failures are irrelevant since the same 3 tests failed in all the runs. Further, the cause of the failure is:

java.lang.AssertionError: 
  Unexpected method call Listener.onFailure(job-0, org.apache.kafka.common.KafkaException: Failed to construct kafka producer):

I think their mocks may need to be updated to take account of the new configs and attendant ConfigExceptions

tedyu · 2017-09-19T23:24:22Z

in not -> is not

tedyu · 2017-09-19T23:27:37Z

The check 'if (deliveryTimeoutMs <= (now - this.createdMs))' inside maybeExpire() would be true.
Looks like another method can be created inside ProducerBatch which expires the batch.

maybeExpire has a side-effect of setting errorMessage internally. Hence calling it again in if.

Understand.
That part can be refactored - goal is to reduce unnecessary comparison.

@apurvam Those test don't even compile or run on my machine. What's up with those tests?

They can't construct a kafka producer with the changes made in this PR.

Assuming nFlightBatches is a TreeSet suggested above, this code can be simplified to:

while (!inFlightBatches.isEmpty() && inFlightBatches.first().maybeExpire(deliveryTimeoutMs, now)) { expiredBatches.add(inFlightBatches.pollFirst()); }

becketqin

@sutambe Thanks for updating the patch. Made a pass on the non-test file and left some comments. Will review the tests tomorrow. We may need to have some quick turnaround to get this into 1.0.0.

becketqin · 2017-09-20T04:01:43Z

Is this comment accurate? The new state is not necessarily SUCCEEDED.

becketqin · 2017-09-20T04:04:58Z

Maybe it's not a big deal but just want to call out that this is a behavior change. Currently the producer will throw exception when transition from FAILED state to another state due to some reason other than expiration. If we change this logic, we may miss those cases which are not failed by expiration but still got state update twice. It may not be that important if we do not have programming bugs.

Personally I think it is better to clearly define the states of the batches even if additional complexity is necessary.

The comments should probably also cover the force close case for completeness.

becketqin · 2017-09-20T04:33:40Z

Some typos in this comments. "Expire the batch if no outcome is known within delivery.timeout.ms"

becketqin · 2017-09-20T04:44:01Z

Does this have to be a per partition Map? Intuitively we just need a TreeSet<ProducerBatch> with a comparator?

Apparently the my understanding of TreeSet is not accurate. It uses the comparator to decide whether the entries are the same or not. We can use a TreeMap<Long, Set> then. We may also want to bucket the timestamp a little bit, say one second to avoid huge amount of Sets created for each ms in the TreeMap.

I was thinking about this too. Using millisecond as unit for Map key is not prudent.

After the switch to second as unit, we may need to check the two adjacent buckets keyed by ts-1 (sec) and ts+1 (sec).

As we discussed, TreeSet does not cut it. The naming is consistent. A TreeSet is a set. It's just that equality criterion is different.

becketqin · 2017-09-20T04:45:39Z

Assuming nFlightBatches is a TreeSet suggested above, this code can be simplified to:

while (!inFlightBatches.isEmpty() && inFlightBatches.first().maybeExpire(deliveryTimeoutMs, now)) { expiredBatches.add(inFlightBatches.pollFirst()); }

becketqin · 2017-09-20T04:50:43Z

This logic would become inFlightRequests.remove(batch) when a TreeSet is used for this.

becketqin · 2017-09-20T04:52:44Z

This would be just inFlightBatches.add(batch)

becketqin · 2017-09-20T04:54:13Z

We usually just use earliestDeliveryTimeout in Kafka.

becketqin · 2017-09-20T04:58:47Z

It seems we don't need the deliveryTimeoutMs in the sender. It is only used as an argument passed to the accumulator. But the accumulator already has the config.

becketqin · 2017-09-20T05:06:51Z

It seems an existing issue. When we expire the batches here. The memory of those batches will be deallocated. It seems that we will deallocate the same batch again when the ProduceResponse returns?

apurvam · 2017-09-21T01:24:08Z

@sutambe I had a look at the failing Sender expiry tests. What is happening is that the tests are not modified to account for the fact that the inflight batches can be expired. In the tests, we used to expire a batch sitting in the accumulator but not the inflight batch. When the inflight batch returned, it would be re queued.

But now, the test sends the response for the inflight batch, but when it goes to requeue, it discovers that there shouldn't be an inflight request an raises an exception.

The tests should be updated to account for the new behavior and make sure that the inflight batch is not expired.

apurvam · 2017-09-21T01:26:25Z

Actually, the test reveals a bug in the current patch: the response for the inflight batch which expired is not being handled correctly. We should not be trying to requeue it to start with.

So we need two tests: one where the inflight batch is not expired, and the current case. The reenqueue logic in the sender needs to be updated to not reenqueue the expired batches.

becketqin

@sutambe Thanks for updating the batch. A few comments:

for a batch that is got expired prematurely, we should not reqenqueu the batch. (as @apurvam noticed) and we should not double deallocate the memory.
There are a few review comments before that are not addressed yet. (such as unused local variables)
We may want to revisit some of the tests and see if they still make sense.
It would be good to add more unit tests to the patch. More specifically, we may want to have the following tests:

Test a batch is correctly inserted into the in.flight.batches if needed. And not inserted if not needed.
Test the callback of an expired batch is fired in time when it is in-flight/not in-flight
Test when expire an in-flight batch, we still wait for the request to finish before sending the next batch.
Test we are not going to retry an already expired batch.
Test when batch is expired prematurely, the buffer pool is only deallocated after the response is returned. (because we are still holding the batch until the response is returned)

becketqin · 2017-09-21T01:50:36Z

This test has nothing to do with linger.ms anymore...

We should change the test name to something like testBatchExpiration. and the test below to testBatchExpirationAfterReenqueue.

becketqin · 2017-09-21T01:51:43Z

Similar to above we should rename this.

becketqin · 2017-09-21T01:52:35Z

typo: timeout

The typo is still there.

becketqin · 2017-09-21T01:53:02Z

typo: timeout

becketqin · 2017-09-21T02:19:02Z

Should we still expire the batches when they are expired instead of expiring all the bucket? Having a second granularity bucket does not prevent us from doing that, right?

sutambe · 2017-12-20T23:47:57Z

@apurvam @becketqin I updated the implementation to use ConcurrentMap<TopicPartition, Deque<ProducerBatch>>. Please take a look. I don't see the above test failures on my machine.

becketqin

Thanks for updating the patch. Left some comments.

becketqin · 2018-01-04T06:27:34Z

Still not used.

becketqin · 2018-01-04T06:34:00Z

We don't need a PriorityQueue for this because the batches in the RecordAccumulator is already in order. So we just need to keep the draining order.

becketqin · 2018-01-04T06:39:32Z

If we always insert the batch to the inFlightBatches queue and there is no bug, the batch to be removed should always be the first batch. Can we assert on that?

becketqin · 2018-01-04T06:44:30Z

The original reason we have this optimization is because we used to have a big sorted data structure. So avoiding inserting elements to it makes sense. Given that now the batch order in the RecordAccumulator is already guaranteed. It seems we can just put all the drained batches to the inFlightBatches queue, which is simpler.

becketqin · 2018-01-04T07:41:33Z

The while loop may break if the request size has reached. So there is no guarantee that it will iterate over all the partitions. One alternative is to find the nextBatchExpiryTimeMs in the expireBatches.

becketqin · 2018-01-04T07:50:21Z

It seems intuitively this should be the earliest batch in the entire record accumulator?

becketqin · 2018-01-04T08:36:38Z

It seems we may release the memory for the expired batches before the response is returned. This means the underneath ByteBuffer is still referred by the ProducerBatch instance in the inFlightRequests. I am not sure if this would cause any problem, but it seems a little dangerous.

becketqin · 2018-01-04T08:50:18Z

Is the response preparation needed in this case?

apurvam · 2018-01-16T21:42:58Z

retest this please

apurvam · 2018-01-17T19:32:17Z

retest this please

apurvam · 2018-01-17T23:08:17Z

So the org.apache.kafka.clients.producer.internals.SenderTest.testMetadataTopicExpiry test has failed twice in a row with:

java.lang.ArrayIndexOutOfBoundsException
	at java.base/java.util.zip.CRC32C.update(CRC32C.java:151)
	at org.apache.kafka.common.utils.Checksums.update(Checksums.java:42)
	at org.apache.kafka.common.utils.Crc32C.compute(Crc32C.java:72)
	at org.apache.kafka.common.record.DefaultRecordBatch.writeHeader(DefaultRecordBatch.java:468)
	at org.apache.kafka.common.record.MemoryRecordsBuilder.writeDefaultBatchHeader(MemoryRecordsBuilder.java:357)
	at org.apache.kafka.common.record.MemoryRecordsBuilder.close(MemoryRecordsBuilder.java:311)
	at org.apache.kafka.clients.producer.internals.ProducerBatch.close(ProducerBatch.java:427)
	at org.apache.kafka.clients.producer.internals.RecordAccumulator.drain(RecordAccumulator.java:614)
	at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:270)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
	at org.apache.kafka.clients.producer.internals.SenderTest.testMetadataTopicExpiry(SenderTest.java:473)

Given that these changes are on the same code, and given the consistent failure of this test, it is probably a regression. @sutambe can you reproduce the failure locally?

apurvam · 2018-01-17T23:08:58Z

Just looking at the stack trace and the test, it may be that an expired batch is being closed twice in some cases.

hachikuji · 2018-02-24T22:43:34Z

@sutambe @becketqin It would be nice to unblock this. Can someone else pick up the work?

becketqin · 2018-03-05T18:02:09Z

@hachikuji Yeah, this has been pending for too long. I have spoken to @sutambe and he said he still wants to finish the patch. He will figure out the ETA and see if that works.

bbejeck · 2018-05-01T15:56:47Z

@sutambe @becketqin is there any update on the status of this PR? It would be great if we could get this in the next release.

Ishiihara · 2018-05-09T01:18:58Z

@guozhangwang @junrao @bbejeck @becketqin We also hit this issue when running Kafka Streams library with some high volume output topics. It would be nice to get this moving and push it to the next release.

radai-rosenblatt · 2018-05-09T01:35:41Z

becket cant load this page for some reason (some weird issue with his github profile?).
we are ok with you taking over this patch.

sutambe · 2018-05-09T03:28:23Z

@bbejeck @apurvam @becketqin @hachikuji I don't have any update since Dec last year. Sorry, the work has stalled and it has been very hard to find cycles for this effort. I don't mind if Confluent wants to take this effort forward. Better later than never.

Avoiding overflow when deliveryTimeoutMs is MAX_VALUE per-partition map for tracking soon to expire batches Updated tests

Ishiihara · 2018-05-09T06:41:26Z

cc @abbccdda @yuyang08

yuyang08 · 2018-05-31T17:49:33Z

@sutambe i made some change based on your pull request to fix style check and test failure. do yo mind I amend the change to this pull request? cc @becketqin @apurvam @hachikuji
https://github.com/yuyang08/kafka/commit/69fc79a91d0556408c8037649f1e03aa56206ef2

guozhangwang · 2018-05-31T22:22:49Z

@yuyang08 I'd suggest you create your own PR against apache kafka trunk and let other reviewers to continue reviewing that one.

yuyang08 · 2018-06-01T05:18:09Z

@guozhangwang sure. will create a separate pull request

yuyang08 · 2018-06-21T23:12:09Z

@guozhangwang @apurvam @becketqin created new pr #5270 for KAFKA-5886

ijuma · 2019-02-18T19:07:09Z

This has been merged via a different PR, closing.

…ache#3849) This issue has been there for multiple years. Also adjust the logging to only include overridden topic configs, I _think_ this behavior changed unintentionally as part of the kraft work (and made the original issue worse). Unit test included and also tested manually. Reviewer: Alok Nikhil <anikhil@confluent.io>, Kowshik Prakasam <kprakasam@confluent.io>

sutambe changed the title ~~Implement KIP-91 delivery.timeout.ms~~ KAFKA-5886: Implement KIP-91 delivery.timeout.ms Sep 13, 2017

tedyu reviewed Sep 13, 2017

View reviewed changes

becketqin reviewed Sep 14, 2017

View reviewed changes

sutambe force-pushed the kip91 branch from 43fa462 to 8823134 Compare September 19, 2017 16:18

tedyu reviewed Sep 19, 2017

View reviewed changes

sutambe force-pushed the kip91 branch 4 times, most recently from 00145bf to 9d8b7ea Compare September 19, 2017 20:27

sutambe force-pushed the kip91 branch from 9d8b7ea to 9ad558c Compare September 19, 2017 22:39

tedyu reviewed Sep 19, 2017

View reviewed changes

becketqin reviewed Sep 20, 2017

View reviewed changes

sutambe force-pushed the kip91 branch from 9ad558c to 26513c0 Compare September 20, 2017 23:51

becketqin reviewed Sep 21, 2017

View reviewed changes

becketqin reviewed Jan 4, 2018

View reviewed changes

Rebasing KIP-91 delivery.timeout.ms for kafka 1.1.0

588e26b

Avoiding overflow when deliveryTimeoutMs is MAX_VALUE per-partition map for tracking soon to expire batches Updated tests

sutambe force-pushed the kip91 branch from a5b06ed to 588e26b Compare May 9, 2018 05:17

yuyang08 mentioned this pull request Jun 21, 2018

KAFKA-5886: Introduce delivery.timeout.ms producer config (KIP-91) #5270

Merged

ijuma closed this Feb 18, 2019

Conversation

sutambe commented Sep 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apurvam commented Sep 13, 2017

Uh oh!

becketqin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sutambe Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apurvam commented Sep 14, 2017

Uh oh!

ijuma commented Sep 18, 2017

Uh oh!

becketqin commented Sep 18, 2017

Uh oh!

ijuma commented Sep 19, 2017

Uh oh!

sutambe commented Sep 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ijuma commented Sep 19, 2017

Uh oh!

sutambe commented Sep 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tedyu commented Sep 19, 2017

Uh oh!

apurvam commented Sep 19, 2017

Uh oh!

sutambe commented Sep 19, 2017

Uh oh!

apurvam commented Sep 19, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sutambe commented Sep 13, 2017 •

edited

Loading

sutambe Sep 14, 2017 •

edited

Loading

sutambe commented Sep 19, 2017 •

edited

Loading

sutambe Sep 21, 2017 •

edited

Loading