[Feature]: Check for synchronous replication quorum before a failover

### Version

trunk (main)

### What happened?

i'm reviewing the code in `reconcileTargetPrimaryForNonReplicaCluster()` and it might be missing some logic?

i'm thinking automatic promotion as part of self-healing is only safe (ie. no data loss) in conjunction with sync replication. i'm not sure we should fully trust any methodologies based on trying to flush WAL.

*Edit (4-May-2025): can skip this initial design proposal below the line here, and just go straight to [the design proposal in the first comments](https://github.com/cloudnative-pg/cloudnative-pg/issues/7481#issuecomment-2849060324).*

-----

* Quorum-Based: assuming that standbyNamesPre/Post and maxStandbyNamesFromCluster are unset, only consider promoting the first N entries in instancesStatus[], where N is `synchronous.number - unreachable_replicas`.
  * *(if synchronous.number == 0 then we should not auto promote anyone. possibly we should default synchronous.number to `1` on all cnpg clusters with at least one replica? and should `dataDurability` default to `preferred` with 1 replica and `required` with 2+ replicas?)*
  * still need to work out the formula when standbyNamesPre/Post or maxStandbyNamesFromCluster are present
  * still need to think through `preferred` durability a bit more
  * with `synchronous.number=1` even a single unreachable replica would mean no promotion, but i think this is likely the behavior we want? would it make sense to have the default to up to `2` if there are 3+ replicas? i think we always want at least one async replica available for pod disruption budgets during maintenance.
* Priority-Based: only consider promoting entries in instancesStatus[] that are known sync replicas?
  * might be a race condition where an unavailable higher-priority replica has just come back online, and postgres has just demoted the final `sync` replica back to `potential`, but cnpg is not yet aware of this fact. if cnpg promotes the replica that it thought was `sync` then there could be data loss.

this is fairly complex to reason through, so would be good to discuss further

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Check for synchronous replication quorum before a failover #7481

Version

What happened?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Check for synchronous replication quorum before a failover #7481

Description

Version

What happened?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions