Skip to content

fix(replica_cluster): resolve race condition in designated primary transition#9601

Merged
gbartolini merged 1 commit intomainfrom
dev/9591
Jan 19, 2026
Merged

fix(replica_cluster): resolve race condition in designated primary transition#9601
gbartolini merged 1 commit intomainfrom
dev/9591

Conversation

@mnencia
Copy link
Member

@mnencia mnencia commented Dec 30, 2025

When a replica cluster switch is initiated, a race condition could occur where the instance manager fails to set the designated primary transition completion condition after an optimistic lock conflict, causing the operator to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel calculation, which used IsPrimary() to check for the absence of standby.signal. After RefreshReplicaConfiguration() creates standby.signal during the first reconciliation loop, IsPrimary() returns false, making the sentinel false and causing subsequent loops to return early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of IsPrimary(), keeping it true throughout the transition and allowing retries when status updates fail. Additionally, the RetryOnConflict wrapper is removed since the reconciliation loop itself provides the retry mechanism, simplifying the code and making all conflicts follow the same clear path.

Closes #9591

@mnencia mnencia requested a review from a team as a code owner December 30, 2025 17:00
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Dec 30, 2025
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Dec 30, 2025
@github-actions
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added bug 🐛 Something isn't working ok to merge 👌 This PR can be merged labels Dec 30, 2025
@mnencia mnencia removed the ok to merge 👌 This PR can be merged label Dec 30, 2025
@mnencia
Copy link
Member Author

mnencia commented Dec 30, 2025

/test

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20601810333

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Dec 30, 2025
@mnencia
Copy link
Member Author

mnencia commented Dec 30, 2025

/test

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20604778441

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 8, 2026
@jbattiato jbattiato force-pushed the dev/9591 branch 2 times, most recently from eda3325 to 7b11de8 Compare January 14, 2026 11:39
@jbattiato
Copy link
Collaborator

/test

@github-actions
Copy link
Contributor

@jbattiato, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20992783571

…ansition

When a replica cluster switch is initiated, a race condition could occur
where the instance manager fails to set the designated primary transition
completion condition after an optimistic lock conflict, causing the operator
to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel
calculation, which used IsPrimary() to check for the absence of standby.signal.
After RefreshReplicaConfiguration() creates standby.signal during the first
reconciliation loop, IsPrimary() returns false, making the sentinel false and
causing subsequent loops to return early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of
IsPrimary(), keeping it true throughout the transition and allowing retries
when status updates fail. Additionally, the RetryOnConflict wrapper is
removed since the reconciliation loop itself provides the retry mechanism,
simplifying the code and making all conflicts follow the same clear path.

Closes #9591

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@gbartolini gbartolini merged commit a73b322 into main Jan 19, 2026
34 checks passed
@gbartolini gbartolini deleted the dev/9591 branch January 19, 2026 07:34
cnpg-bot pushed a commit that referenced this pull request Jan 19, 2026
…ansition (#9601)

When a replica cluster switch is initiated, a race condition could occur
where the instance manager fails to set the designated primary
transition completion condition after an optimistic lock conflict,
causing the operator to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel
calculation, which used IsPrimary() to check for the absence of
standby.signal. After RefreshReplicaConfiguration() creates
standby.signal during the first reconciliation loop, IsPrimary() returns
false, making the sentinel false and causing subsequent loops to return
early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of
IsPrimary(), keeping it true throughout the transition and allowing
retries when status updates fail. Additionally, the RetryOnConflict
wrapper is removed since the reconciliation loop itself provides the
retry mechanism, simplifying the code and making all conflicts follow
the same clear path.

Closes #9591

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit a73b322)
cnpg-bot pushed a commit that referenced this pull request Jan 19, 2026
…ansition (#9601)

When a replica cluster switch is initiated, a race condition could occur
where the instance manager fails to set the designated primary
transition completion condition after an optimistic lock conflict,
causing the operator to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel
calculation, which used IsPrimary() to check for the absence of
standby.signal. After RefreshReplicaConfiguration() creates
standby.signal during the first reconciliation loop, IsPrimary() returns
false, making the sentinel false and causing subsequent loops to return
early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of
IsPrimary(), keeping it true throughout the transition and allowing
retries when status updates fail. Additionally, the RetryOnConflict
wrapper is removed since the reconciliation loop itself provides the
retry mechanism, simplifying the code and making all conflicts follow
the same clear path.

Closes #9591

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit a73b322)
cnpg-bot pushed a commit that referenced this pull request Jan 19, 2026
…ansition (#9601)

When a replica cluster switch is initiated, a race condition could occur
where the instance manager fails to set the designated primary
transition completion condition after an optimistic lock conflict,
causing the operator to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel
calculation, which used IsPrimary() to check for the absence of
standby.signal. After RefreshReplicaConfiguration() creates
standby.signal during the first reconciliation loop, IsPrimary() returns
false, making the sentinel false and causing subsequent loops to return
early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of
IsPrimary(), keeping it true throughout the transition and allowing
retries when status updates fail. Additionally, the RetryOnConflict
wrapper is removed since the reconciliation loop itself provides the
retry mechanism, simplifying the code and making all conflicts follow
the same clear path.

Closes #9591

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit a73b322)
mnencia added a commit that referenced this pull request Jan 20, 2026
…ansition (#9601)

When a replica cluster switch is initiated, a race condition could occur
where the instance manager fails to set the designated primary
transition completion condition after an optimistic lock conflict,
causing the operator to wait indefinitely.

The root cause was in the RequiresDesignatedPrimaryTransition sentinel
calculation, which used IsPrimary() to check for the absence of
standby.signal. After RefreshReplicaConfiguration() creates
standby.signal during the first reconciliation loop, IsPrimary() returns
false, making the sentinel false and causing subsequent loops to return
early without retrying the status update.

This fix changes the sentinel to use CurrentPrimary status instead of
IsPrimary(), keeping it true throughout the transition and allowing
retries when status updates fail. Additionally, the RetryOnConflict
wrapper is removed since the reconciliation loop itself provides the
retry mechanism, simplifying the code and making all conflicts follow
the same clear path.

Closes #9591

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit a73b322)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working lgtm This PR has been approved by a maintainer ok to merge 👌 This PR can be merged release-1.25 release-1.27 release-1.28 size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Race condition in replica cluster switch prevents designated primary transition completion

5 participants