Skip to content

[Bug]: CNPG 1.26 -> 1.27 migration stuck #8371

@usernamedt

Description

@usernamedt

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

No response

Version

1.27 (latest patch)

What version of Kubernetes are you using?

1.33

What is your Kubernetes environment?

Other

How did you install the operator?

YAML manifest

What happened?

After migrating from CNPG 1.26 to 1.27 I see the following:

┌─────────────────────────────────────────────────────────────────────────────── Pods(namespace)[9] ───────────────────────────────────────────────────────────────────────────────┐
│ NAME↑                                         PF        READY        STATUS                   RESTARTS IP                                       NODE                                                AGE            │
│ postgresql-cluster-1                          ●         1/1          Running                         0 IP1          host1.net          7d2h           │
│ postgresql-cluster-2                          ●         1/1          Running                         0 IP2           host2.net          7d2h           │
│ postgresql-cluster-3                          ●         1/1          Running                         0 IP3           host3.net         52m            │

By looking inside the pod spec, postgresql-cluster-3 is running the 1.27 image, while others (postgresql-cluster-1 and postgresql-cluster-2) still on 1.26.

Cluster phase is stuck in "Waiting for the instances to become active".

Judging by operator log, it can't proceed further because of the PG config mismatch on the pods. Config diff:

➜  ~ diff pg3 pg2
5d4
< cnpg.synchronous_standby_names_metadata = '{"method":"ANY","number":1,"standbyNames":["postgresql-cluster-2","postgresql-cluster-3"]}'

Looking at config diff it looks like it will never match since cnpg.synchronous_standby_names_metadata was introduced only in 1.27.

So the only solution is either to drop the existing 1.26 pods by hand or remove the following setting from the cluster spec:

  maxSyncReplicas: 1
  minSyncReplicas: 1

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  creationTimestamp: "2025-05-29T09:37:55Z"
  generation: 3
  labels:
    app.kubernetes.io/name: postgresql-cluster
  name: postgresql-cluster
  namespace: NS
  resourceVersion: "433189235"
  uid: 7ab5fa98-8e5b-4fb1-82a8-6196027bd65e
spec:
  affinity:
    additionalPodAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: cnpg.io/podRole
            operator: In
            values:
            - instance
        topologyKey: host.dev/rack
    enablePodAntiAffinity: false
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: host.dev/rack
            operator: Exists
  backup:
    barmanObjectStore:
      data:
        compression: gzip
        encryption: AES256
        jobs: 2
      destinationPath: s3://bucket
      endpointCA:
        key: ca.crt
        name: name
      endpointURL: https://s3.net
      s3Credentials:
        accessKeyId:
          key: keyId
          name: s3-creds
        secretAccessKey:
          key: secret
          name: s3-creds
      wal:
        compression: gzip
        encryption: AES256
        maxParallel: 2
    retentionPolicy: 7d
    target: prefer-standby
  bootstrap:
    recovery:
      database: DB
      owner: OWNER
      recoveryTarget:
        backupID: 20250527T220200
        targetTime: "2025-05-28 10:36:00.000000"
      secret:
        name: bootstrap-secret
      source: cluster
  certificates:
    serverCASecret: CA
    serverTLSSecret: SECRET
  description: DESC
  enablePDB: true
  enableSuperuserAccess: false
  externalClusters:
  - barmanObjectStore:
      destinationPath: s3://BUCKET
      endpointCA:
        key: ca.crt
        name: NAME
      endpointURL: https://s3.net
      s3Credentials:
        accessKeyId:
          key: keyId
          name: s3-creds
        secretAccessKey:
          key: secret
          name: s3-creds
      wal:
        maxParallel: 8
    name: postgresql-cluster
  failoverDelay: 0
  imageName: pg16_image
  instances: 3
  logLevel: trace
  maxSyncReplicas: 1
  minSyncReplicas: 1
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: false
  postgresGID: ...
  postgresUID: ...
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      dynamic_shared_memory_type: posix
      full_page_writes: "on"
      hot_standby_feedback: "on"
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      max_connections: "400"
      max_parallel_workers: "32"
      max_replication_slots: "32"
      max_worker_processes: "32"
      password_encryption: scram-sha-256
      pg_failover_slots.drop_extra_slots: "on"
      pg_failover_slots.synchronize_slot_names: name_like:%
      pg_failover_slots.worker_nap_time: "60000"
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
      pg_stat_statements.track_utility: "off"
      shared_memory_type: mmap
      shared_preload_libraries: ""
      ssl_max_protocol_version: TLSv1.3
      ssl_min_protocol_version: TLSv1.3
      wal_keep_size: 512MB
      wal_level: logical
      wal_log_hints: "on"
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
    pg_hba:
    - hostssl support streaming_replica all cert
    syncReplicaElectionConstraint:
      enabled: true
      nodeLabelsAntiAffinity:
      - host.net/rack
  primaryUpdateMethod: switchover
  primaryUpdateStrategy: unsupervised
  probes:
    liveness:
      isolationCheck:
        connectionTimeout: 1000
        enabled: true
        requestTimeout: 1000
  replicationSlots:
    highAvailability:
      enabled: true
      slotPrefix: _cnpg_
    synchronizeReplicas:
      enabled: true
    updateInterval: 30
  resources:
    limits:
      cpu: "2"
      memory: 7Gi
    requests:
      cpu: 1800m
      memory: 7Gi
  smartShutdownTimeout: 180
  startDelay: 3600
  stopDelay: 1800
  storage:
    resizeInUseVolumes: true
    size: 10Gi
    storageClass: CLASS
  switchoverDelay: 3600
status:
  availableArchitectures:
  - goArch: amd64
    hash: 0ac9a3dc1e7e0122ae5a89f03626ad5cbf3ba637a1232a6f95576d4801a043d9
  certificates:
    clientCASecret: CA
    expirations:
      some1: 2025-10-26 09:37:51 +0000 UTC
      some2: 2025-08-27 09:32:56 +0000 UTC
      some3: 2025-08-27 09:32:56 +0000 UTC
      some4: 2029-05-02 08:00:00 +0000 UTC
    replicationTLSSecret: secret
    serverAltDNSNames:
    - postgresql-cluster-rw
    - postgresql-cluster-rw.NAMESPACE
    - postgresql-cluster-rw.NAMESPACE.svc
    - postgresql-cluster-rw.NAMESPACE.svc.cluster.local
    - postgresql-cluster-r
    - postgresql-cluster-r.NAMESPACE
    - postgresql-cluster-r.NAMESPACE.svc
    - postgresql-cluster-r.NAMESPACE.svc.cluster.local
    - postgresql-cluster-ro
    - postgresql-cluster-ro.NAMESPACE
    - postgresql-cluster-ro.NAMESPACE.svc
    - postgresql-cluster-ro.NAMESPACE.svc.cluster.local
    serverCASecret: SECRET
    serverTLSSecret: SECRET
  cloudNativePGCommitHash: 1dc9a2909
  cloudNativePGOperatorHash: 0ac9a3dc1e7e0122ae5a89f03626ad5cbf3ba637a1232a6f95576d4801a043d9
  conditions:
  - lastTransitionTime: "2025-08-14T12:44:47Z"
    message: Cluster Is Not Ready
    reason: ClusterIsNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-08-08T13:54:53Z"
    message: Continuous archiving is working
    reason: ContinuousArchivingSuccess
    status: "True"
    type: ContinuousArchiving
  - lastTransitionTime: "2025-08-14T22:02:56Z"
    message: Backup was successful
    reason: LastBackupSucceeded
    status: "True"
    type: LastBackupSucceeded
  - lastTransitionTime: "2025-08-14T12:44:42Z"
    message: A single, unique system ID was found across reporting instances.
    reason: Unique
    status: "True"
    type: ConsistentSystemID
  configMapResourceVersion:
    metrics:
      cnpg-default-monitoring: "411429773"
  currentPrimary: postgresql-cluster-1
  currentPrimaryTimestamp: "2025-08-08T13:54:52.003603Z"
  firstRecoverabilityPoint: "2025-08-07T22:02:24Z"
  firstRecoverabilityPointByMethod:
    barmanObjectStore: "2025-08-07T22:02:24Z"
  healthyPVC:
  - postgresql-cluster-1
  - postgresql-cluster-2
  - postgresql-cluster-3
  image: PG16_IMG
  instanceNames:
  - postgresql-cluster-1
  - postgresql-cluster-2
  - postgresql-cluster-3
  instances: 3
  instancesReportedState:
    postgresql-cluster-1:
      ip: IP1
      isPrimary: true
      timeLineID: 30
    postgresql-cluster-2:
      ip: IP2
      isPrimary: false
      timeLineID: 30
    postgresql-cluster-3:
      ip: IP3
      isPrimary: false
      timeLineID: 30
  instancesStatus:
    healthy:
    - postgresql-cluster-1
    - postgresql-cluster-2
    - postgresql-cluster-3
  lastSuccessfulBackup: "2025-08-14T22:02:54Z"
  lastSuccessfulBackupByMethod:
    barmanObjectStore: "2025-08-14T22:02:54Z"
  latestGeneratedNode: 3
  managedRolesStatus: {}
  pgDataImageInfo:
    image: PG16_IMG
    majorVersion: 16
  phase: Waiting for the instances to become active
  phaseReason: Some instances are not yet active. Please wait.
  poolerIntegrations:
    pgBouncerIntegration:
      secrets:
      - postgresql-cluster-pooler
  pvcCount: 3
  readService: postgresql-cluster-r
  readyInstances: 3
  secretsResourceVersion:
    applicationSecretVersion: "366149351"
    barmanEndpointCA: "382525586"
    clientCaSecretVersion: "366149447"
    replicationSecretVersion: "366149448"
    serverCaSecretVersion: "382525586"
    serverSecretVersion: "416429560"
  switchReplicaClusterStatus: {}
  systemID: "7412960607476973597"
  targetPrimary: postgresql-cluster-1
  targetPrimaryTimestamp: "2025-08-08T13:54:47.926682Z"
  timelineID: 30
  topology:
    instances:
      postgresql-cluster-1:
        host.dev/rack: rack1
      postgresql-cluster-2:
        host.dev/rack: rack2
      postgresql-cluster-3:
        host.dev/rack: rack3
    nodesUsed: 3
    successfullyExtracted: true
  writeService: postgresql-cluster-rw

Relevant log output

I found the following useful operator log (level=trace):

2025-08-15 15:32:33.784	
msg=try getting connection
	2025-08-15 15:32:33.784	
msg=Reconciliation loop start
	2025-08-15 15:32:33.784	
msg=Reconciliation loop end
	2025-08-15 15:32:33.784	
msg=Released logical plugin connection
	2025-08-15 15:32:33.784	
msg=Waiting for all Pods to have the same PostgreSQL configuration
	2025-08-15 15:32:33.784	
msg=haven't found any instance to create
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping cluster annotations reconciliation, because they are already present on pod
	2025-08-15 15:32:33.784	
msg=Skipping cluster label reconciliation, because they are already present on pod
	2025-08-15 15:32:33.784	
msg=Skipping cluster annotations reconciliation, because they are already present on pod
	2025-08-15 15:32:33.784	
msg=Skipping cluster label reconciliation, because they are already present on pod
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.784	
msg=Skipping reconciliation, no changes to be done
	2025-08-15 15:32:33.494	
msg=correctly loaded the plugin client
	2025-08-15 15:32:33.459	
msg=correctly loaded the plugin client
	2025-08-15 15:32:33.292	
msg=Acquired logical plugin connection

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

bug 🐛Something isn't working

Type

Projects

Status

Done

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions