fix(backup): use instance session ID to detect instance manager restarts#9370
Merged
fix(backup): use instance session ID to detect instance manager restarts#9370
Conversation
Member
Author
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19924624981 |
1d9f291 to
e2ceb17
Compare
Member
Author
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19925770807 |
e2ceb17 to
30d2b47
Compare
Member
Author
|
/test |
Contributor
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20268263352 |
6568549 to
eea3f0b
Compare
Member
|
/test |
Contributor
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20432990340 |
mnencia
approved these changes
Dec 22, 2025
cc85f68 to
ac94a7c
Compare
36534cc to
30ee553
Compare
Member
|
/test ft=backup-restore |
Contributor
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/21032882195 |
jbattiato
approved these changes
Jan 16, 2026
This fix addresses an issue where backups would hang indefinitely during operator releases. When the operator and barman plugin were updated simultaneously, the backup goroutine running in the instance manager would be killed (due to the instance manager binary being replaced via exec()), but the operator would still think the backup was running because the container ID hadn't changed. The fix introduces a deterministic mechanism to detect instance manager restarts by tracking the ExecutableHash: - Added `ExecutableHash` field to `InstanceID` struct in backup status - When a backup starts, the instance manager's ExecutableHash is stored - On each reconcile, the stored hash is compared with the current hash - If the hashes differ, the instance manager was restarted/upgraded and the backup process is dead, marking the backup as FAILED Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
This commit improves the detection of instance manager restarts during backup operations by introducing an instance session ID instead of relying on the executable hash alone. Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Remove fallback that returned partial status without SessionID. Return retriable error when status is unavailable to ensure SessionID is always present for restart detection. Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
When a backup starts before SessionID support and the operator upgrades, the stored SessionID is empty but the current one is not. Previously this returned false (not restarted), causing backups to hang. Now we detect this scenario and fail safe by marking as restarted. Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Remove the guard condition that prevented isInstanceManagerRestarted() from being called when SessionID is empty. The function itself correctly handles the empty SessionID case by detecting operator upgrades during running backups. Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
c6b757d to
82d0789
Compare
mnencia
added a commit
that referenced
this pull request
Jan 16, 2026
…rts (#9370) This fix addresses an issue where backups would hang indefinitely during operator releases. When the operator and barman plugin were updated simultaneously, the backup goroutine running in the instance manager would be killed (due to the instance manager binary being replaced via exec()), but the operator would still think the backup was running because the container ID hadn't changed. The fix introduces a deterministic mechanism to detect instance manager restarts by tracking the SessionID Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 102d785)
mnencia
added a commit
that referenced
this pull request
Jan 16, 2026
…rts (#9370) This fix addresses an issue where backups would hang indefinitely during operator releases. When the operator and barman plugin were updated simultaneously, the backup goroutine running in the instance manager would be killed (due to the instance manager binary being replaced via exec()), but the operator would still think the backup was running because the container ID hadn't changed. The fix introduces a deterministic mechanism to detect instance manager restarts by tracking the SessionID Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 102d785)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This fix addresses an issue where backups would hang indefinitely during operator releases. When the operator and barman plugin were updated simultaneously, the backup goroutine running in the instance manager would be killed (due to the instance manager binary being replaced via exec()), but the operator would still think the backup was running because the container ID hadn't changed.
The fix introduces a deterministic mechanism to detect instance manager restarts by tracking the SessionID