fix(replica_cluster): resolve race condition in designated primary transition by mnencia · Pull Request #9601 · cloudnative-pg/cloudnative-pg

mnencia · 2025-12-30T17:00:27Z

When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path.

Closes #9591

github-actions · 2025-12-30T17:00:36Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

mnencia · 2025-12-30T17:08:33Z

/test

github-actions · 2025-12-30T17:08:43Z

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20601810333

mnencia · 2025-12-30T19:59:37Z

/test

github-actions · 2025-12-30T19:59:45Z

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20604778441

jbattiato · 2026-01-14T11:40:22Z

/test

github-actions · 2026-01-14T11:40:38Z

@jbattiato, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20992783571

…ansition When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

…ansition (#9601) When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely. The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update. This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path. Closes #9591 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit a73b322)

mnencia requested a review from a team as a code owner December 30, 2025 17:00

cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Dec 30, 2025

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 30, 2025

dosubot bot added bug 🐛 Something isn't working ok to merge 👌 This PR can be merged labels Dec 30, 2025

mnencia removed the ok to merge 👌 This PR can be merged label Dec 30, 2025

cnpg-bot added the ok to merge 👌 This PR can be merged label Dec 30, 2025

mnencia force-pushed the dev/9591 branch from 7336247 to 9aa8f7c Compare January 7, 2026 17:55

armru approved these changes Jan 8, 2026

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 8, 2026

mnencia force-pushed the dev/9591 branch from 9aa8f7c to 0b10263 Compare January 9, 2026 13:30

jbattiato force-pushed the dev/9591 branch 2 times, most recently from eda3325 to 7b11de8 Compare January 14, 2026 11:39

jbattiato force-pushed the dev/9591 branch from 7b11de8 to 3c51c43 Compare January 14, 2026 14:41

mnencia force-pushed the dev/9591 branch from 3c51c43 to 0e6302c Compare January 15, 2026 16:45

gbartolini approved these changes Jan 19, 2026

View reviewed changes

gbartolini force-pushed the dev/9591 branch from 0e6302c to a3931e6 Compare January 19, 2026 07:17

gbartolini merged commit a73b322 into main Jan 19, 2026
34 checks passed

gbartolini deleted the dev/9591 branch January 19, 2026 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replica_cluster): resolve race condition in designated primary transition#9601

fix(replica_cluster): resolve race condition in designated primary transition#9601
gbartolini merged 1 commit intomainfrom
dev/9591

mnencia commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

mnencia commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

mnencia commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

jbattiato commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

mnencia commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

mnencia commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

mnencia commented Dec 30, 2025

Uh oh!

github-actions bot commented Dec 30, 2025

Uh oh!

jbattiato commented Jan 14, 2026

Uh oh!

github-actions bot commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants