fix(replica_cluster): resolve race condition in designated primary transition#9601
Merged
gbartolini merged 1 commit intomainfrom Jan 19, 2026
Merged
fix(replica_cluster): resolve race condition in designated primary transition#9601gbartolini merged 1 commit intomainfrom
gbartolini merged 1 commit intomainfrom
Conversation
Contributor
|
❗ By default, the pull request is configured to backport to all release branches.
|
Member
Author
|
/test |
Contributor
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20601810333 |
Member
Author
|
/test |
Contributor
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20604778441 |
armru
approved these changes
Jan 8, 2026
eda3325 to
7b11de8
Compare
Collaborator
|
/test |
Contributor
|
@jbattiato, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20992783571 |
gbartolini
approved these changes
Jan 19, 2026
…ansition When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
cnpg-bot
pushed a commit
that referenced
this pull request
Jan 19, 2026
…ansition (#9601) When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit a73b322)
cnpg-bot
pushed a commit
that referenced
this pull request
Jan 19, 2026
…ansition (#9601) When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit a73b322)
cnpg-bot
pushed a commit
that referenced
this pull request
Jan 19, 2026
…ansition (#9601) When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit a73b322)
mnencia
added a commit
that referenced
this pull request
Jan 20, 2026
…ansition (#9601) When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit a73b322)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely.
The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update.
This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path.
Closes #9591