Skip to content

fix: auto-recover replicas with diverged timelines by re-cloning#9637

Draft
armru wants to merge 3 commits intomainfrom
dev/bug-history
Draft

fix: auto-recover replicas with diverged timelines by re-cloning#9637
armru wants to merge 3 commits intomainfrom
dev/bug-history

Conversation

@armru
Copy link
Member

@armru armru commented Jan 7, 2026

fix: auto-recover replicas with diverged timelines by re-cloning

When a replica's timeline diverges from the primary after a failover
(e.g., due to network issues or WAL archiving lag), PostgreSQL fails
to start with "requested timeline X is not a child of this server's
history". This left replicas in a crash-loop requiring manual PVC
deletion to recover.

This fix adds automatic recovery by:

  • Detecting timeline mismatch between replica and primary during
    instance startup by checking cluster status
  • Exiting with a specific exit code (6) when divergence is detected
  • Operator detects this exit code and marks the instance as
    unrecoverable, triggering automatic PVC deletion and re-cloning
    via pg_basebackup

Closes #4990

@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Jan 7, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@armru
Copy link
Member Author

armru commented Jan 7, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20781437972

@armru armru changed the title fix: auto-recover replicas with diverged timelines using pg_rewind fix: auto-recover replicas with diverged timelines Jan 7, 2026
@armru armru force-pushed the dev/bug-history branch 2 times, most recently from 56c6d16 to c60e639 Compare January 7, 2026 13:28
@armru armru changed the title fix: auto-recover replicas with diverged timelines fix: auto-recover replicas with diverged timelines by re-cloning Jan 7, 2026
@armru
Copy link
Member Author

armru commented Jan 7, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20783536292

mnencia added a commit that referenced this pull request Jan 8, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 9, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 9, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 9, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 9, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
armru pushed a commit that referenced this pull request Jan 12, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
armru added 2 commits January 12, 2026 15:38
When a replica's timeline diverges from the primary after a failover
(e.g., due to network issues or WAL archiving lag), PostgreSQL fails
to start with "requested timeline X is not a child of this server's
history". This left replicas in a crash-loop requiring manual PVC
deletion to recover.

This fix adds automatic recovery by:
- Detecting timeline mismatch between replica and primary during
  instance startup by checking cluster status
- Exiting with a specific exit code (6) when divergence is detected
- Operator detects this exit code and marks the instance as
  unrecoverable, triggering automatic PVC deletion and re-cloning
  via pg_basebackup

Closes #4990

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@armru
Copy link
Member Author

armru commented Jan 12, 2026

/test

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20924190000

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 12, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
gbartolini pushed a commit that referenced this pull request Jan 14, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 19, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 20, 2026
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
cnpg-bot pushed a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit ef73994)
cnpg-bot pushed a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit ef73994)
cnpg-bot pushed a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit ef73994)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Replica does not come up: checkpoint not found in timeline

2 participants