KAFKA-18019: Make INVALID_PRODUCER_ID_MAPPING a fatal error#17822
Conversation
CalvinLiu7947
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left some minor comments
| enqueueRequest(handler); | ||
|
|
||
| // If an epoch bump is required for recovery, initialize the transaction after completing the EndTxn request. | ||
| if (epochBumpRequired) { |
There was a problem hiding this comment.
Just to double check. The long-term goal is to only call initProducerId request when initializing the producer and we will not call it again when the producer is running. I guess it will be a separate change?
There was a problem hiding this comment.
It'll be a separate change, because for for old protocol we still need to do the bump explicitly on the client. For the new protocol, we won't need to to bump from the client.
| enqueueRequest(handler); | ||
|
|
||
| // If an epoch bump is required for recovery, initialize the transaction after completing the EndTxn request. | ||
| if (epochBumpRequired) { |
There was a problem hiding this comment.
It'll be a separate change, because for for old protocol we still need to do the bump explicitly on the client. For the new protocol, we won't need to to bump from the client.
…7822) This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error. Quoted from KIP-890 Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance. With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example: Consider an application that is copying data from one partition to another Application instance A processes to offset 4 Application instance B comes up and fences application instance A Application instance B processes to offset 5 Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing Thus, INVALID_PID_MAPPING should be fatal to the producer. This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision. This PR also fixes the flakiness of TransactionsExpirationTest. Reviewers: Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>, Calvin Liu <caliu@confluent.io>
…7822) This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error. Quoted from KIP-890 Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance. With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example: Consider an application that is copying data from one partition to another Application instance A processes to offset 4 Application instance B comes up and fences application instance A Application instance B processes to offset 5 Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing Thus, INVALID_PID_MAPPING should be fatal to the producer. This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision. This PR also fixes the flakiness of TransactionsExpirationTest. Reviewers: Artem Livshits <alivshits@confluent.io>, Justine Olshan <jolshan@confluent.io>, Calvin Liu <caliu@confluent.io>
This patch contains changes to the handling of the INVALID_PRODUCER_ID_MAPPING error.
Quoted from KIP-890
Since we bump epoch on abort, we no longer need to call InitProducerId to fence requests. InitProducerId will only be called when the producer starts up to fence a previous instance.
With this change, some other calls to InitProducerId were inspected including the call after receiving an InvalidPidMappingException. This exception was changed to abortable as part of KIP-360: Improve reliability of idempotent/transactional producer. However, this change means that we can violate EOS guarantees. As an example:
Consider an application that is copying data from one partition to another
Application instance A processes to offset 4
Application instance B comes up and fences application instance A
Application instance B processes to offset 5
Application instances A and B are idle for transaction.id.expiration.ms, transaction id expires on server
Application instance A attempts to process offset 5 (since in its view, that is next) -- if we recover from invalid pid mapping, we can duplicate this processing
Thus, INVALID_PID_MAPPING should be fatal to the producer.
This is consistent with KIP-1050: Consistent error handling for Transactions where errors that are fatal to the producer are in the "application recoverable" category. This is a grouping that indicates to the client that the producer needs to restart and recovery on the application side is necessary. KIP-1050 is approved so we are consistent with that decision.