Skip to content

KAFKA-18019: Make INVALID_PRODUCER_ID_MAPPING a fatal error#17822

Merged
jolshan merged 4 commits into
apache:trunkfrom
rreddy-22:KAFKA-18019-InvalidPidError_Abortable_to_Fatal
Nov 18, 2024
Merged

KAFKA-18019: Make INVALID_PRODUCER_ID_MAPPING a fatal error#17822
jolshan merged 4 commits into
apache:trunkfrom
rreddy-22:KAFKA-18019-InvalidPidError_Abortable_to_Fatal

Conversation

@rreddy-22

@rreddy-22 rreddy-22 commented Nov 14, 2024

Copy link
Copy Markdown
Contributor

This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error.
Quoted from KIP-890
Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance.

With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example:

Consider an application that is copying data from one partition to another

Application instance A processes to offset 4
Application instance B comes up and fences application instance A
Application instance B processes to offset 5
Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server
Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing
Thus, INVALID_PID_MAPPING should be fatal to the producer.

This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision.

@jolshan jolshan self-requested a review November 14, 2024 23:24
Comment thread core/src/test/scala/integration/kafka/api/TransactionsExpirationTest.scala Outdated

@CalvinLiu7947 CalvinLiu7947 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Left some minor comments

enqueueRequest(handler);

// If an epoch bump is required for recovery, initialize the transaction after completing the EndTxn request.
if (epochBumpRequired) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check. The long-term goal is to only call initProducerId request when initializing the producer and we will not call it again when the producer is running. I guess it will be a separate change?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be a separate change, because for for old protocol we still need to do the bump explicitly on the client. For the new protocol, we won't need to to bump from the client.

Comment thread core/src/test/scala/integration/kafka/api/TransactionsTest.scala Outdated

@artemlivshits artemlivshits left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

enqueueRequest(handler);

// If an epoch bump is required for recovery, initialize the transaction after completing the EndTxn request.
if (epochBumpRequired) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be a separate change, because for for old protocol we still need to do the bump explicitly on the client. For the new protocol, we won't need to to bump from the client.

@jolshan jolshan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jolshan jolshan merged commit e4c0034 into apache:trunk Nov 18, 2024
chiacyu pushed a commit to chiacyu/kafka that referenced this pull request Nov 30, 2024
…7822)

This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error.
Quoted from KIP-890
Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance.

With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example:

Consider an application that is copying data from one partition to another

Application instance A processes to offset 4
Application instance B comes up and fences application instance A
Application instance B processes to offset 5
Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server
Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing
Thus, INVALID_PID_MAPPING should be fatal to the producer.

This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision.

This PR also fixes the flakiness of TransactionsExpirationTest.

Reviewers:  Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>, Calvin Liu <caliu@confluent.io>
tedyu pushed a commit to tedyu/kafka that referenced this pull request Jan 6, 2025
…7822)

This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error.
Quoted from KIP-890
Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance.

With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example:

Consider an application that is copying data from one partition to another

Application instance A processes to offset 4
Application instance B comes up and fences application instance A
Application instance B processes to offset 5
Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server
Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing
Thus, INVALID_PID_MAPPING should be fatal to the producer.

This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision.

This PR also fixes the flakiness of TransactionsExpirationTest.

Reviewers:  Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>, Calvin Liu <caliu@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants