Skip to content

[Bug]: Race condition in replica cluster switch prevents designated primary transition completion #9591

@mnencia

Description

@mnencia

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

No response

Version

trunk (main)

What version of Kubernetes are you using?

1.34

What is your Kubernetes environment?

Self-managed: kind (evaluation)

How did you install the operator?

YAML manifest

What happened?

During a replica cluster switch operation, a rare race condition (observed in ~1-5% of test runs) occurs where the operator and instance manager concurrently modify the cluster object, resulting in an optimistic lock conflict that prevents the designated primary transition from completing.

From E2E test failure: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20547048546

Relevant log output

Attempted transition:

{"ts":"2025-12-28T02:52:54.810548Z",
 "msg":"Setting myself as the current designated primary"}

Optimistic lock conflict:

{"ts":"2025-12-28T02:52:55.372302Z",
 "error":"Operation cannot be fulfilled... the object has been modified"}

Test timeout:

Timed out after 300.000s.
Expected pod to be in recovery mode

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

bug 🐛Something isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions