Skip to content

fix(walrestore): prevent replicas from downloading future timeline history files#9650

Merged
mnencia merged 6 commits intomainfrom
dev/4188
Jan 20, 2026
Merged

fix(walrestore): prevent replicas from downloading future timeline history files#9650
mnencia merged 6 commits intomainfrom
dev/4188

Conversation

@mnencia
Copy link
Member

@mnencia mnencia commented Jan 8, 2026

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:

  • Parsing timeline ID from .history filenames (e.g., 00000022.history)
  • Checking if the instance is a primary or replica
  • For replicas, rejecting files where fileTimeline > clusterTimeline
  • Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

@mnencia mnencia requested a review from a team as a code owner January 8, 2026 16:32
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 8, 2026
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 labels Jan 8, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Jan 8, 2026
@mnencia
Copy link
Member Author

mnencia commented Jan 8, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20824279808

@mnencia mnencia force-pushed the dev/4188 branch 3 times, most recently from 3ffa226 to 34bf9be Compare January 8, 2026 17:17
@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Jan 8, 2026
@mnencia mnencia force-pushed the dev/4188 branch 3 times, most recently from b60ebdc to 6e3957e Compare January 9, 2026 14:11
@mnencia
Copy link
Member Author

mnencia commented Jan 9, 2026

/test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20854827166

@armru
Copy link
Member

armru commented Jan 12, 2026

/test

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20917830755

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 12, 2026
@mnencia
Copy link
Member Author

mnencia commented Jan 12, 2026

/test

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20930738286

mnencia and others added 6 commits January 20, 2026 10:16
…story files

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637 (auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Add comprehensive unit tests for the timeline history file
validation logic. Tests cover regular WAL files, invalid filenames,
primary behavior, and replica behavior with current, past, and
future timelines.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Add E2E test demonstrating protection against future timeline
history files through a backup and restore scenario. The test creates
two clusters sharing a WAL archive, where cluster 2 creates timeline 2
history files. When cluster 1 scales up, the new replica successfully
joins without crash-looping, validating the protection works correctly.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Move the timeline divergence test into its own Context block with its
own BeforeAll hook that creates a dedicated namespace and MinIO setup.
This allows the test to run independently without requiring the main
backup test cluster to remain running, reducing peak resource usage
from 5 to 3 PostgreSQL instances during timeline test execution.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@mnencia mnencia merged commit ef73994 into main Jan 20, 2026
34 checks passed
@mnencia mnencia deleted the dev/4188 branch January 20, 2026 13:02
cnpg-bot pushed a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit ef73994)
cnpg-bot pushed a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit ef73994)
cnpg-bot pushed a commit that referenced this pull request Jan 20, 2026
…story files (#9650)

Replicas can crash-loop when orphaned "future timeline" .history files
exist in the WAL archive. This can occur during split-brain scenarios
or other conditions where timeline history files are created for
timelines
that the cluster never officially adopts.

This fix adds validation during WAL restore to prevent replicas from
downloading timeline history files with timeline IDs greater than the
cluster's current timeline. Primary instances retain full access.

The validation works by:
- Parsing timeline ID from .history filenames (e.g., 00000022.history)
- Checking if the instance is a primary or replica
- For replicas, rejecting files where fileTimeline > clusterTimeline
- Returning "file not found" to PostgreSQL for rejected files

This prevents PostgreSQL from ever seeing the problematic history file,
allowing normal recovery to proceed. Combined with PR #9637
(auto-recovery
via re-cloning), this provides complete coverage of timeline divergence
scenarios.

Closes #4188

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit ef73994)
mnencia added a commit that referenced this pull request Feb 3, 2026
Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
mnencia added a commit that referenced this pull request Feb 3, 2026
Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
leonardoce pushed a commit that referenced this pull request Feb 4, 2026
Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
leonardoce added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
leonardoce added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 1044c8e)
(cherry picked from commit d0e801ee5392465d8629056c79e58c32b215dcff)
leonardoce added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 1044c8e)
(cherry picked from commit d0e801ee5392465d8629056c79e58c32b215dcff)
(cherry picked from commit 9e2e469)
mnencia added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 1044c8e)
mnencia added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 1044c8e)
mnencia added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 1044c8e)
(cherry picked from commit d0e801ee5392465d8629056c79e58c32b215dcff)
mnencia added a commit that referenced this pull request Feb 4, 2026
…9849)

Move timeline history file validation to execute before any WAL restore
attempt (plugin or in-tree) rather than only for in-tree restores. This
prevents replicas from downloading timeline history files with timeline
IDs higher than the cluster's current timeline when plugins handle WAL
restore.

The timeline protection added in #9650 was only applied to in-tree WAL
restore, but not to plugin-based restore. This allowed the protection to
be completely bypassed when using plugins, causing replicas to download
future timeline history files and fail with timeline mismatch errors.

Fixes the timeline protection introduced in ef73994.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
(cherry picked from commit 1044c8e)
(cherry picked from commit d0e801ee5392465d8629056c79e58c32b215dcff)
(cherry picked from commit 9e2e469)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working lgtm This PR has been approved by a maintainer ok to merge 👌 This PR can be merged release-1.25 release-1.27 release-1.28 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Replicas crash-loop when future timeline history files exist in WAL archive

4 participants