fix: auto-recover replicas with diverged timelines by re-cloning#9637
Draft
fix: auto-recover replicas with diverged timelines by re-cloning#9637
Conversation
Contributor
|
❗ By default, the pull request is configured to backport to all release branches.
|
Member
Author
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20781437972 |
56c6d16 to
c60e639
Compare
Member
Author
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20783536292 |
mnencia
added a commit
that referenced
this pull request
Jan 8, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia
added a commit
that referenced
this pull request
Jan 9, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia
added a commit
that referenced
this pull request
Jan 9, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia
added a commit
that referenced
this pull request
Jan 9, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia
added a commit
that referenced
this pull request
Jan 9, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
armru
pushed a commit
that referenced
this pull request
Jan 12, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
When a replica's timeline diverges from the primary after a failover (e.g., due to network issues or WAL archiving lag), PostgreSQL fails to start with "requested timeline X is not a child of this server's history". This left replicas in a crash-loop requiring manual PVC deletion to recover. This fix adds automatic recovery by: - Detecting timeline mismatch between replica and primary during instance startup by checking cluster status - Exiting with a specific exit code (6) when divergence is detected - Operator detects this exit code and marks the instance as unrecoverable, triggering automatic PVC deletion and re-cloning via pg_basebackup Closes #4990 Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
45eb718 to
f8fd5cb
Compare
Member
Author
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20924190000 |
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
f8fd5cb to
2ecfe3a
Compare
mnencia
added a commit
that referenced
this pull request
Jan 12, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
gbartolini
pushed a commit
that referenced
this pull request
Jan 14, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
4 tasks
mnencia
added a commit
that referenced
this pull request
Jan 19, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia
added a commit
that referenced
this pull request
Jan 20, 2026
…story files Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia
added a commit
that referenced
this pull request
Jan 20, 2026
…story files (#9650) Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
cnpg-bot
pushed a commit
that referenced
this pull request
Jan 20, 2026
…story files (#9650) Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> (cherry picked from commit ef73994)
cnpg-bot
pushed a commit
that referenced
this pull request
Jan 20, 2026
…story files (#9650) Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> (cherry picked from commit ef73994)
cnpg-bot
pushed a commit
that referenced
this pull request
Jan 20, 2026
…story files (#9650) Replicas can crash-loop when orphaned "future timeline" .history files exist in the WAL archive. This can occur during split-brain scenarios or other conditions where timeline history files are created for timelines that the cluster never officially adopts. This fix adds validation during WAL restore to prevent replicas from downloading timeline history files with timeline IDs greater than the cluster's current timeline. Primary instances retain full access. The validation works by: - Parsing timeline ID from .history filenames (e.g., 00000022.history) - Checking if the instance is a primary or replica - For replicas, rejecting files where fileTimeline > clusterTimeline - Returning "file not found" to PostgreSQL for rejected files This prevents PostgreSQL from ever seeing the problematic history file, allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery via re-cloning), this provides complete coverage of timeline divergence scenarios. Closes #4188 Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> (cherry picked from commit ef73994)
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix: auto-recover replicas with diverged timelines by re-cloning
When a replica's timeline diverges from the primary after a failover
(e.g., due to network issues or WAL archiving lag), PostgreSQL fails
to start with "requested timeline X is not a child of this server's
history". This left replicas in a crash-loop requiring manual PVC
deletion to recover.
This fix adds automatic recovery by:
instance startup by checking cluster status
unrecoverable, triggering automatic PVC deletion and re-cloning
via pg_basebackup
Closes #4990