-
Notifications
You must be signed in to change notification settings - Fork 632
Closed
[Bug]: Former master cannot perform pg_rewind after failover if plugin is used for WAL archiving#9067
Defect
Copy link
Description
Is there an existing issue already for this bug?
- I have searched for an existing issue, and could not find anything. I believe this is a new bug.
I have read the troubleshooting guide
- I have read the troubleshooting guide and I think this is a new bug.
I am running a supported version of CloudNativePG
- I have read the troubleshooting guide and I think this is a new bug.
Contact Details
No response
Version
1.27 (latest patch)
What version of Kubernetes are you using?
1.34
What is your Kubernetes environment?
Self-managed: kind (evaluation)
How did you install the operator?
YAML manifest
What happened?
Hi!
While testing operator behavior on sudden network partitioning I've encountered situation when former primary fails to perform pg_rewind after network connection is returned, so that former primary is unable to catch up and become replica.
That only occurs when plugins are used for WAL archiving/restoration.
Precise steps for reproduction:
- Create 3-node cluster with synchronous replication in KinD
- Create pgbench deployment to create constant load on cluster
- Disconnect kind node with current primary (ex.
docker network disconnect kind cnpg-worker3) - Wait until failover is performed
- Connect kind node with former primary back (ex.
docker network connect kind cnpg-worker3) - See that former primary fails to perform pg_rewind with
--restore-target-walenabled due to unable to fetch WAL files
pg_rewind calls manager wal-restore with FULL path to target file:
/controller/manager wal-restore --log-destination /controller/log/postgres.json 00000001000000010000005F /var/lib/postgresql/data/pgdata/pg_wal/00000001000000010000005F
which we do not handle correctly at
| walFound, err := restoreWALViaPlugins(ctx, cluster, walName, path.Join(pgData, destinationPath)) |
making GPRC-call to plugin with duplicating PGDATA part of restore filepath:
/var/lib/postgresql/data/pgdata/var/lib/postgresql/data/pgdata/pg_wal/00000001000000010000005F
Cluster resource
Cluster:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
spec:
instances: 3
plugins:
- name: cnpg-extensions.yandex.cloud # Should always be "cnpg-extensions.yandex.cloud" for wal-g plugin
isWALArchiver: true # Set to true to enable continuous WAL archiving
parameters:
backupConfig: example-backup-config # Name of BackupConfig used to create backups
postgresql:
parameters:
max_connections: "250"
shared_buffers: 4GB
effective_cache_size: 12GB
maintenance_work_mem: 1GB
checkpoint_completion_target: "0.9"
wal_buffers: 16MB
default_statistics_target: "100"
random_page_cost: "1.1"
effective_io_concurrency: "200"
work_mem: 16513kB
huge_pages: "off"
min_wal_size: 1GB
max_wal_size: 4GB
max_worker_processes: "4"
max_parallel_workers_per_gather: "2"
max_parallel_workers: "4"
max_parallel_maintenance_workers: "2"
synchronous:
method: any
number: 1
dataDurability: required
storage:
size: 1Gi
PGBench deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbench-runner-cnpg
spec:
replicas: 1
selector:
matchLabels:
app: pgbench-runner-cnpg
template:
metadata:
labels:
app: pgbench-runner-cnpg
spec:
restartPolicy: Always # important to avoid infinite restart of completed pods
containers:
- name: pgbench
image: postgres:17
resources:
requests:
cpu: "10m"
memory: "128Mi"
limits: {}
env:
- name: PGHOST
valueFrom:
secretKeyRef:
name: cluster-example-app
key: host
- name: PGPORT
valueFrom:
secretKeyRef:
name: cluster-example-app
key: port
- name: PGUSER
valueFrom:
secretKeyRef:
name: cluster-example-app
key: user
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: cluster-example-app
key: password
- name: PGDATABASE
valueFrom:
secretKeyRef:
name: cluster-example-app
key: dbname
command: ["sh", "-c"]
args:
- |
echo "Initializing benchmark database if needed...";
pgbench -i -s 60 -d $PGDATABASE || true;
echo "Running pgbench...";
pgbench -j 6 -c 60 -b tpcb-like --time 600 -d $PGDATABASERelevant log output
Postgres instance logs:
{"level":"info","ts":"2025-11-05T15:32:38.4179151Z","msg":"Starting up pg_rewind","logger":"instance-manager","logging_pod":"cluster-separate-wal-storage-1","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"cluster-separate-wal-storage","namespace":"default"},"namespace":"default","name":"cluster-separate-wal-storage","reconcileID":"9063eeb8-8b27-45ec-a1cf-45305a5d63ff","logging_pod":"cluster-separate-wal-storage-1","pgdata":"/var/lib/postgresql/data/pgdata","options":["-P","--source-server","host=cluster-separate-wal-storage-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=cluster-separate-wal-storage-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"]}
{"level":"info","ts":"2025-11-05T15:32:38.46084317Z","logger":"pg_rewind","msg":"pg_rewind: connected to server","pipe":"stderr","logging_pod":"cluster-separate-wal-storage-1"}
{"level":"info","ts":"2025-11-05T15:32:38.463751188Z","logger":"pg_rewind","msg":"pg_rewind: servers diverged at WAL location 1/7F1322D0 on timeline 1","pipe":"stderr","logging_pod":"cluster-separate-wal-storage-1"}
{"level":"info","ts":"2025-11-05T15:32:39.396083475Z","logger":"pg_rewind","msg":"pg_rewind: error: could not restore file \"00000001000000010000005F\" from archive","pipe":"stderr","logging_pod":"cluster-separate-wal-storage-1"}
{"level":"info","ts":"2025-11-05T15:32:39.396124895Z","logger":"pg_rewind","msg":"pg_rewind: error: could not find previous WAL record at 1/5FFFE728","pipe":"stderr","logging_pod":"cluster-separate-wal-storage-1"}
{"level":"error","ts":"2025-11-05T15:32:39.398028104Z","msg":"Failed to execute pg_rewind","logger":"instance-manager","logging_pod":"cluster-separate-wal-storage-1","controller":"instance-cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"cluster-separate-wal-storage","namespace":"default"},"namespace":"default","name":"cluster-separate-wal-storage","reconcileID":"9063eeb8-8b27-45ec-a1cf-45305a5d63ff","logging_pod":"cluster-separate-wal-storage-1","options":["-P","--source-server","host=cluster-separate-wal-storage-rw user=streaming_replica port=5432 sslkey=/controller/certificates/streaming_replica.key sslcert=/controller/certificates/streaming_replica.crt sslrootcert=/controller/certificates/server-ca.crt application_name=cluster-separate-wal-storage-1 sslmode=verify-ca dbname=postgres","--target-pgdata","/var/lib/postgresql/data/pgdata","--restore-target-wal"],"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/machinery/pkg/log.(*logger).Error\n\tpkg/mod/github.com/cloudnative-pg/machinery@v0.3.1/pkg/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/postgres.(*Instance).Rewind\n\tpkg/management/postgres/instance.go:1165\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).verifyPgDataCoherenceForPrimary\n\tinternal/management/controller/instance_startup.go:140\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).initialize\n\tinternal/management/controller/instance_controller.go:457\ngithub.com/cloudnative-pg/cloudnative-pg/internal/management/controller.(*InstanceReconciler).Reconcile\n\tinternal/management/controller/instance_controller.go:193\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller...[]).Start.func2.1\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202"}
Plugin logs (using cnpg-plugin-wal-g):
{"level":"error","ts":"2025-11-05T15:32:39Z","logger":"plugin_wal","msg":"Failed to run wal-g fetch 00000001000000010000005F /var/lib/postgresql/data/pgdata/var/lib/postgresql/data/pgdata/pg_wal/00000001000000010000005F","method":"Restore","stdout":"","stderr":"ERROR: 2025/11/05 15:32:39.100804 open /var/lib/postgresql/data/pgdata/var/lib/postgresql/data/pgdata/pg_wal/00000001000000010000005F: no such file or directory\n","error":"subprocess run cmd result exit status 1"}Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bug 🐛Something isn't workingSomething isn't working
Type
Projects
Status
Done