fix: auto-recover replicas with diverged timelines by re-cloning by armru · Pull Request #9637 · cloudnative-pg/cloudnative-pg

armru · 2026-01-07T12:26:01Z

fix: auto-recover replicas with diverged timelines by re-cloning

When a replica's timeline diverges from the primary after a failover
(e.g., due to network issues or WAL archiving lag), PostgreSQL fails
to start with "requested timeline X is not a child of this server's
history". This left replicas in a crash-loop requiring manual PVC
deletion to recover.

This fix adds automatic recovery by:

Detecting timeline mismatch between replica and primary during
instance startup by checking cluster status
Exiting with a specific exit code (6) when divergence is detected
Operator detects this exit code and marks the instance as
unrecoverable, triggering automatic PVC deletion and re-cloning
via pg_basebackup

Closes #4990

github-actions · 2026-01-07T12:26:13Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

armru · 2026-01-07T12:26:18Z

/test

github-actions · 2026-01-07T12:26:26Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20781437972

armru · 2026-01-07T13:44:51Z

/test

github-actions · 2026-01-07T13:45:03Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20783536292

…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

When a replica's timeline diverges from the primary after a failover (e.g., due to network issues or WAL archiving lag), PostgreSQL fails to start with "requested timeline X is not a child of this server's history". This left replicas in a crash-loop requiring manual PVC deletion to recover. This fix adds automatic recovery by: - Detecting timeline mismatch between replica and primary during instance startup by checking cluster status - Exiting with a specific exit code (6) when divergence is detected - Operator detects this exit code and marks the instance as unrecoverable, triggering automatic PVC deletion and re-cloning via pg_basebackup Closes #4990 Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

armru · 2026-01-12T15:06:30Z

/test

github-actions · 2026-01-12T15:06:41Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20924190000

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

…story files (#9650) Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

…story files (#9650) Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> (cherry picked from commit ef73994)

cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Jan 7, 2026

armru force-pushed the dev/bug-history branch from d2f86e5 to d6a5a90 Compare January 7, 2026 12:37

armru changed the title ~~fix: auto-recover replicas with diverged timelines using pg_rewind~~ fix: auto-recover replicas with diverged timelines Jan 7, 2026

armru force-pushed the dev/bug-history branch 2 times, most recently from 56c6d16 to c60e639 Compare January 7, 2026 13:28

armru changed the title ~~fix: auto-recover replicas with diverged timelines~~ fix: auto-recover replicas with diverged timelines by re-cloning Jan 7, 2026

armru force-pushed the dev/bug-history branch from c60e639 to 78b3ecc Compare January 7, 2026 13:44

armru force-pushed the dev/bug-history branch from 6cbb563 to 45eb718 Compare January 7, 2026 16:35

mnencia mentioned this pull request Jan 8, 2026

fix(walrestore): prevent replicas from downloading future timeline history files #9650

Merged

armru added 2 commits January 12, 2026 15:38

feat: fork diff

750490d

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

armru force-pushed the dev/bug-history branch from 45eb718 to f8fd5cb Compare January 12, 2026 15:02

chore: improve code quality

2ecfe3a

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

armru force-pushed the dev/bug-history branch from f8fd5cb to 2ecfe3a Compare January 12, 2026 15:10

dosubot bot mentioned this pull request Jan 17, 2026

[Bug]: Problem upgrading on a restored cluster. #9764

Closed

4 tasks

TsengSR mentioned this pull request Jan 29, 2026

[Bug]: Replica does not come up: checkpoint not found in timeline #4990

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-recover replicas with diverged timelines by re-cloning#9637

fix: auto-recover replicas with diverged timelines by re-cloning#9637
armru wants to merge 3 commits intomainfrom
dev/bug-history

armru commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

armru commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

armru commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

armru commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

armru commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

armru commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

armru commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

armru commented Jan 12, 2026

Uh oh!

github-actions bot commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

armru commented Jan 7, 2026 •

edited

Loading