Skip to content

KAFKA-19479: at_least_once mode in Kafka Streams silently drops messages when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees#20285

Merged
junrao merged 20 commits into
apache:trunkfrom
shashankhs11:KAFKA-19479-more-tests
Oct 16, 2025

Conversation

@shashankhs11

@shashankhs11 shashankhs11 commented Jul 31, 2025

Copy link
Copy Markdown
Contributor

Fixes a bug in producer.

Cf - #20254 (comment)
and KAFKA-19479 for
more details

@github-actions

github-actions Bot commented Aug 8, 2025

Copy link
Copy Markdown

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the PR. Left a comment.

@github-actions github-actions Bot removed needs-attention triage PRs from the community labels Sep 30, 2025
@shashankhs11 shashankhs11 force-pushed the KAFKA-19479-more-tests branch from 467de10 to a51d19b Compare October 6, 2025 01:34

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the updated PR. A few more comments.

@shashankhs11 shashankhs11 requested a review from junrao October 7, 2025 02:12

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the updated PR. A few more comments.

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the updated PR. A few more comments.

@shashankhs11 shashankhs11 requested a review from junrao October 12, 2025 15:17

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the updated PR. A few more comments.

@junrao

junrao commented Oct 15, 2025

Copy link
Copy Markdown
Contributor

@shashankhs11 : Could you fix the build error? Also, could you run the same jmh test again to make sure there is no regression on the flush() time?

@shashankhs11

Copy link
Copy Markdown
Contributor Author

JMH Test Report

This is how I ran the test

  1. I created a new local branch based off trunk with just the jmh test and ran the test.
  2. Ran the test in this branch
  3. 2 different cases - numRecords = 5000 and numRecords = 10000

Result BEFORE fix

image

Case 1: numRecords = 5000

image

Case 2: numRecords = 10000

image

Result AFTER fix (current branch)

image

Case 1: numRecords = 5000

image

Case 2: numRecords = 10000

image

@shashankhs11

Copy link
Copy Markdown
Contributor Author

@junrao: fixed the build error and added jmh test report in a seperate comment above. There seems to be a regression when numRecords = 10000 -- Would this be acceptable?

Also left a comment in reply to this - #20285 (comment)

@junrao

junrao commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

@shashankhs11 : The small regression from 0.004 to 0.005 ms/op is negligible and is fine.

@shashankhs11 shashankhs11 force-pushed the KAFKA-19479-more-tests branch from 0e9c522 to 6a4745c Compare October 16, 2025 02:02
@shashankhs11

Copy link
Copy Markdown
Contributor Author

CI check with Java 25 seemed to be failing. Rebased with latest trunk.

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the updated PR. Just a minor comment. Also, could you rebase to pick up a flaky test fix #20713 ?

@shashankhs11 shashankhs11 force-pushed the KAFKA-19479-more-tests branch from 6a4745c to 896ecc1 Compare October 16, 2025 17:41
@shashankhs11

Copy link
Copy Markdown
Contributor Author

Also, could you rebase to pick up a flaky test fix

@junrao: Done :)

@junrao junrao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shashankhs11 : Thanks for the updated PR. LGTM. Could you update the description of the PR before merging?

@shashankhs11 shashankhs11 changed the title KAFKA-19479: at_least_once mode in Kafka Streams silently drops messages when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees KAFKA-19479: Bug Fix in Producer where flush() does not wait for its dependents Oct 16, 2025
@shashankhs11

Copy link
Copy Markdown
Contributor Author

Could you update the description of the PR before merging?

@junrao: Done! Changed the title as well as the description.

Thank you very much for your time and patience, Jun. I am very very grateful for the opportunity to have worked closely with you over the past couple of weeks. I got to learn a lot from this :)

@junrao junrao changed the title KAFKA-19479: Bug Fix in Producer where flush() does not wait for its dependents KAFKA-19479: at_least_once mode in Kafka Streams silently drops messages when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees Oct 16, 2025
@junrao junrao merged commit 6f4ce76 into apache:trunk Oct 16, 2025
24 of 25 checks passed
@junrao

junrao commented Oct 16, 2025

Copy link
Copy Markdown
Contributor

@shashankhs11 : I kept the PR name to match the jira name, but adjusted the description in the commit message. Thanks for working on this issue!

@mjsax

mjsax commented Oct 17, 2025

Copy link
Copy Markdown
Member

Thanks for reviewing this PR @junrao!

mjsax pushed a commit that referenced this pull request Oct 17, 2025
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (#20285)

Bug Fix in Producer where flush() does not wait for a batch to complete after splitting.

Cf - #20254 (comment)
and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for
more details

Reviewers: Jun Rao <junrao@gmail.com>
mjsax pushed a commit that referenced this pull request Oct 17, 2025
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (#20285)

Bug Fix in Producer where flush() does not wait for a batch to complete after splitting.

Cf - #20254 (comment)
and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for
more details

Reviewers: Jun Rao <junrao@gmail.com>
mjsax pushed a commit that referenced this pull request Oct 17, 2025
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (#20285)

Bug Fix in Producer where flush() does not wait for a batch to complete after splitting.

Cf - #20254 (comment)
and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for
more details

Reviewers: Jun Rao <junrao@gmail.com>
@mjsax

mjsax commented Oct 17, 2025

Copy link
Copy Markdown
Member

Cherry-picked to 4.1, 4.0, and 3.9 branches.

eduwercamacaro pushed a commit to littlehorse-enterprises/kafka that referenced this pull request Nov 12, 2025
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (apache#20285)

Bug Fix in Producer where flush() does not wait for a batch to complete after splitting.

Cf - apache#20254 (comment)
and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for
more details

Reviewers: Jun Rao <junrao@gmail.com>
TaiJuWu pushed a commit to TaiJuWu/kafka that referenced this pull request Dec 3, 2025
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (apache#20285)

Bug Fix in Producer where flush() does not wait for a batch to complete after splitting.

Cf - apache#20254 (comment)
and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for
more details

Reviewers: Jun Rao <junrao@gmail.com>
shashankhs11 added a commit to shashankhs11/kafka that referenced this pull request Dec 15, 2025
…ges when the producer fails with MESSAGE_TOO_LARGE, violating delivery guarantees (apache#20285)

Bug Fix in Producer where flush() does not wait for a batch to complete after splitting.

Cf - apache#20254 (comment)
and [KAFKA-19479](https://issues.apache.org/jira/browse/KAFKA-19479) for
more details

Reviewers: Jun Rao <junrao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants