Skip to content

[Bug]: Replicas crash-loop when future timeline history files exist in WAL archive #4188

@jakubhajek

Description

@jakubhajek

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

Version

1.22.2

What version of Kubernetes are you using?

1.26

What is your Kubernetes environment?

Cloud: Amazon EKS

How did you install the operator?

Helm

What happened?

We are experiencing an issue with creating cluster replicas by scaling up the cluster resource. After finishing the phase of creating a backup with pg_basebackup it failed with the following error:
requested timeline 5 does not contain minimum recovery point 466/FD7555E8 on timeline 4. I have been attempting to create other replicas but it always failed on the same phase.

The cluster is currently on the following timeline:
Current Write LSN: 481/769AC000 (Timeline: 4 - WAL File: 000000040000048100000076)
Please let me know what additional information are required to help solve that issue.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    cnpg.io/reloadedAt: "2024-03-29T11:07:33.020678+01:00"
    kubectl.kubernetes.io/restartedAt: "2024-03-29T11:52:20+01:00"
  creationTimestamp: "2024-03-29T11:17:22Z"
  generation: 7
  labels:
    argocd.argoproj.io/instance: prd-infra-eks-cnpg0-pg
  name: pg-database0
  namespace: cnpg1
  resourceVersion: "223159316"
  uid: cda6c776-0499-42bd-a7f2-14ac204ab15a
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cnpg-workload-32
            operator: In
            values:
            - enabled
    podAntiAffinityType: preferred
    tolerations:
    - effect: NoSchedule
      key: cnpg-workload-32
      operator: Equal
      value: enabled
    topologyKey: topology.kubernetes.io/zone
  backup:
    barmanObjectStore:
      destinationPath: s3://prd-cnpg-backups/backups
      s3Credentials:
        inheritFromIAMRole: true
      wal:
        compression: gzip
        encryption: AES256
        maxParallel: 4
    retentionPolicy: 30d
    target: prefer-standby
    volumeSnapshot:
      className: pg-data-snapshot-class
      labels:
        app.kubernetes.io/managed-by: cloudnative-pg
      online: true
      onlineConfiguration:
        immediateCheckpoint: true
        waitForArchive: true
      snapshotOwnerReference: backup
      walClassName: pg-wal-snapshot-class
  bootstrap:
    initdb:
      dataChecksums: false
      database: app
      encoding: UTF8
      localeCType: C
      localeCollate: C
      owner: app
  enableSuperuserAccess: true
  failoverDelay: 0
  imageName: ecr-private-repo/infra/postgresql:13.11-10-202308231244
  imagePullPolicy: Always
  instances: 1
  logLevel: info
  maxSyncReplicas: 0
  minSyncReplicas: 0
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: false
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      auto_explain.log_min_duration: 10s
      autovacuum_analyze_scale_factor: "0.05"
      autovacuum_max_workers: "6"
      autovacuum_naptime: "30"
      autovacuum_vacuum_cost_delay: "20"
      autovacuum_vacuum_scale_factor: "0.1"
      checkpoint_completion_target: "0.9"
      checkpoint_timeout: "300"
      client_encoding: UTF8
      dynamic_shared_memory_type: posix
      effective_cache_size: 16GB
      huge_pages: "on"
      idle_in_transaction_session_timeout: "300000"
      log_checkpoints: "1"
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_hostname: "1"
      log_min_duration_statement: "5000"
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      maintenance_work_mem: 1024MB
      max_connections: "2000"
      max_locks_per_transaction: "64"
      max_logical_replication_workers: "10"
      max_parallel_workers: "32"
      max_replication_slots: "10"
      max_stack_depth: "6144"
      max_sync_workers_per_subscription: "4"
      max_wal_senders: "20"
      max_wal_size: "2048"
      max_worker_processes: "15"
      min_wal_size: "192"
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
      pgaudit.log: all, -misc
      pgaudit.log_catalog: "off"
      pgaudit.log_parameter: "on"
      pgaudit.log_relation: "on"
      shared_buffers: 4000MB
      shared_memory_type: mmap
      shared_preload_libraries: ""
      ssl_max_protocol_version: TLSv1.3
      ssl_min_protocol_version: TLSv1.3
      tcp_keepalives_count: "2"
      tcp_keepalives_idle: "300"
      tcp_keepalives_interval: "30"
      timezone: UTC
      track_activity_query_size: "2048"
      track_commit_timestamp: "on"
      track_io_timing: "on"
      wal_keep_size: 512MB
      wal_level: logical
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
      work_mem: 32MB
    pg_hba:
    - hostssl app all all cert
    shared_preload_libraries:
    - pg_stat_statements
    - pg_stat_monitor
    - pg_qualstats
    - pg_stat_kcache
    - pg_wait_sampling
    - pgaudit
    - auto_explain
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: restart
  primaryUpdateStrategy: unsupervised
  priorityClassName: infra-cluster-high
  replicationSlots:
    highAvailability:
      enabled: true
      slotPrefix: _cnpg_
    updateInterval: 30
  resources:
    limits:
      hugepages-2Mi: 5Gi
      memory: 25Gi
    requests:
      cpu: "7"
      hugepages-2Mi: 5Gi
      memory: 16Gi
  serviceAccountTemplate:
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::1234:role/eks-cnpg0-cloudnative-pg
  smartShutdownTimeout: 180
  startDelay: 3600
  stopDelay: 1800
  storage:
    resizeInUseVolumes: true
    size: 220Gi
    storageClass: io2-pg-data
  switchoverDelay: 3600
  walStorage:
    resizeInUseVolumes: true
    size: 200Gi
    storageClass: io2
status:
  availableArchitectures:
  - goArch: amd64
    hash: 72462f8f498e9a1e31eeb819b74038375e02b5bf3d5f9adb6016ab713c96d9dc
  - goArch: arm64
    hash: 14ec408750588c000a891253608b5f6a84cc436f607b99a6c15611b106136e44
  certificates:
    clientCASecret: pg-database0-ca
    expirations:
      pg-database0-ca: 2024-06-27 11:12:23 +0000 UTC
      pg-database0-replication: 2024-06-27 11:12:23 +0000 UTC
      pg-database0-server: 2024-06-27 11:12:23 +0000 UTC
    replicationTLSSecret: pg-database0-replication
    serverAltDNSNames:
    - pg-database0-rw
    - pg-database0-rw.cnpg1
    - pg-database0-rw.cnpg1.svc
    - pg-database0-r
    - pg-database0-r.cnpg1
    - pg-database0-r.cnpg1.svc
    - pg-database0-ro
    - pg-database0-ro.cnpg1
    - pg-database0-ro.cnpg1.svc
    serverCASecret: pg-database0-ca
    serverTLSSecret: pg-database0-server
  cloudNativePGCommitHash: bcdcd885
  cloudNativePGOperatorHash: 72462f8f498e9a1e31eeb819b74038375e02b5bf3d5f9adb6016ab713c96d9dc
  conditions:
  - lastTransitionTime: "2024-03-29T13:06:28Z"
    message: Cluster is Ready
    reason: ClusterIsReady
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-03-29T11:18:08Z"
    message: Continuous archiving is working
    reason: ContinuousArchivingSuccess
    status: "True"
    type: ContinuousArchiving
  - lastTransitionTime: "2024-03-29T11:28:32Z"
    message: Backup was successful
    reason: LastBackupSucceeded
    status: "True"
    type: LastBackupSucceeded
  configMapResourceVersion:
    metrics:
      cnpg-default-monitoring: "174342296"
  currentPrimary: pg-database0-2
  currentPrimaryTimestamp: "2024-03-29T11:18:03.180909Z"
  firstRecoverabilityPoint: "2024-03-29T11:28:32Z"
  firstRecoverabilityPointByMethod:
    volumeSnapshot: "2024-03-29T11:28:32Z"
  healthyPVC:
  - pg-database0-2
  - pg-database0-2-wal
  instanceNames:
  - pg-database0-2
  instances: 1
  instancesReportedState:
    pg-database0-2:
      isPrimary: true
      timeLineID: 4
  instancesStatus:
    healthy:
    - pg-database0-2
  lastSuccessfulBackup: "2024-03-29T11:28:32Z"
  lastSuccessfulBackupByMethod:
    volumeSnapshot: "2024-03-29T11:28:32Z"
  latestGeneratedNode: 5
  managedRolesStatus: {}
  phase: Cluster in healthy state
  poolerIntegrations:
    pgBouncerIntegration:
      secrets:
      - pg-database0-pooler
  pvcCount: 2
  readService: pg-database0-r
  readyInstances: 1
  secretsResourceVersion:
    applicationSecretVersion: "223091672"
    clientCaSecretVersion: "223091668"
    replicationSecretVersion: "223091670"
    serverCaSecretVersion: "223091668"
    serverSecretVersion: "223091669"
    superuserSecretVersion: "223091671"
  targetPrimary: pg-database0-2
  timelineID: 4
  topology:
    instances:
      pg-database0-2: {}
    nodesUsed: 1
    successfullyExtracted: true
  writeService: pg-database0-rw

Relevant log output

{"level":"info","ts":"2024-03-29T11:58:29Z","logger":"postgres","msg":"record","logging_pod":"pg-database0-4","record":{"log_time":"2024-03-29 11:58:29.032 UTC","process_id":"27","session_id":"6606ace0.1b","session_line_num":"7","session_start_time":"2024-03-29 11:58:24 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"XX000","message":"requested timeline 5 is not a child of this server's history","detail":"Latest checkpoint is at 478/18379B60 on timeline 4, but in the history of the requested timeline, the server forked off from that timeline at 466/EB0000A0.","backend_type":"startup"}}

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Status

Done

Relationships

None yet

Development

No branches or pull requests

Issue actions