Skip to content

fix(backup): use instance session ID to detect instance manager restarts#9370

Merged
mnencia merged 7 commits intomainfrom
dev/fix-backup-stuck
Jan 16, 2026
Merged

fix(backup): use instance session ID to detect instance manager restarts#9370
mnencia merged 7 commits intomainfrom
dev/fix-backup-stuck

Conversation

@armru
Copy link
Member

@armru armru commented Dec 4, 2025

This fix addresses an issue where backups would hang indefinitely during operator releases. When the operator and barman plugin were updated simultaneously, the backup goroutine running in the instance manager would be killed (due to the instance manager binary being replaced via exec()), but the operator would still think the backup was running because the container ID hadn't changed.

The fix introduces a deterministic mechanism to detect instance manager restarts by tracking the SessionID

@armru armru requested a review from a team as a code owner December 4, 2025 09:46
@armru armru added no-issue do not backport This PR must not be backported - it will be in the next minor release labels Dec 4, 2025
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug 🐛 Something isn't working ok to merge 👌 This PR can be merged labels Dec 4, 2025
@armru
Copy link
Member Author

armru commented Dec 4, 2025

/test

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19924624981

@armru armru force-pushed the dev/fix-backup-stuck branch from 1d9f291 to e2ceb17 Compare December 4, 2025 10:27
@armru armru requested a review from jsilvela as a code owner December 4, 2025 10:27
@armru
Copy link
Member Author

armru commented Dec 4, 2025

/test

@github-actions
Copy link
Contributor

github-actions bot commented Dec 4, 2025

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19925770807

@armru armru force-pushed the dev/fix-backup-stuck branch from e2ceb17 to 30d2b47 Compare December 16, 2025 12:39
@armru armru changed the title fix(backup): detect instance manager hash change fix(backup): use instance session ID to detect instance manager restarts Dec 16, 2025
@armru
Copy link
Member Author

armru commented Dec 16, 2025

/test

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20268263352

@mnencia mnencia force-pushed the dev/fix-backup-stuck branch from 6568549 to eea3f0b Compare December 22, 2025 13:15
@mnencia
Copy link
Member

mnencia commented Dec 22, 2025

/test

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/20432990340

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 22, 2025
@jbattiato jbattiato force-pushed the dev/fix-backup-stuck branch from cc85f68 to ac94a7c Compare December 24, 2025 17:35
@mnencia mnencia force-pushed the dev/fix-backup-stuck branch 2 times, most recently from 36534cc to 30ee553 Compare January 15, 2026 13:08
@mnencia
Copy link
Member

mnencia commented Jan 15, 2026

/test ft=backup-restore

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/21032882195

armru and others added 3 commits January 16, 2026 12:59
This fix addresses an issue where backups would hang indefinitely
during operator releases. When the operator and barman plugin were updated
simultaneously, the backup goroutine running in the instance manager would
be killed (due to the instance manager binary being replaced via exec()),
but the operator would still think the backup was running because the
container ID hadn't changed.

The fix introduces a deterministic mechanism to detect instance manager
restarts by tracking the ExecutableHash:

- Added `ExecutableHash` field to `InstanceID` struct in backup status
- When a backup starts, the instance manager's ExecutableHash is stored
- On each reconcile, the stored hash is compared with the current hash
- If the hashes differ, the instance manager was restarted/upgraded and
  the backup process is dead, marking the backup as FAILED

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
This commit improves the detection of instance manager restarts during
backup operations by introducing an instance session ID instead of relying
on the executable hash alone.

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Remove fallback that returned partial status without SessionID.
Return retriable error when status is unavailable to ensure
SessionID is always present for restart detection.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
When a backup starts before SessionID support and the operator upgrades,
the stored SessionID is empty but the current one is not. Previously
this returned false (not restarted), causing backups to hang.

Now we detect this scenario and fail safe by marking as restarted.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Remove the guard condition that prevented isInstanceManagerRestarted()
from being called when SessionID is empty. The function itself correctly
handles the empty SessionID case by detecting operator upgrades during
running backups.

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@jbattiato jbattiato force-pushed the dev/fix-backup-stuck branch from c6b757d to 82d0789 Compare January 16, 2026 11:59
@mnencia mnencia added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.27 release-1.28 and removed do not backport This PR must not be backported - it will be in the next minor release labels Jan 16, 2026
@mnencia mnencia merged commit 102d785 into main Jan 16, 2026
44 checks passed
@mnencia mnencia deleted the dev/fix-backup-stuck branch January 16, 2026 12:59
mnencia added a commit that referenced this pull request Jan 16, 2026
…rts (#9370)

This fix addresses an issue where backups would hang indefinitely during
operator releases. When the operator and barman plugin were updated
simultaneously, the backup goroutine running in the instance manager
would be killed (due to the instance manager binary being replaced via
exec()), but the operator would still think the backup was running
because the container ID hadn't changed.

The fix introduces a deterministic mechanism to detect instance manager
restarts by tracking the SessionID

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit 102d785)
mnencia added a commit that referenced this pull request Jan 16, 2026
…rts (#9370)

This fix addresses an issue where backups would hang indefinitely during
operator releases. When the operator and barman plugin were updated
simultaneously, the backup goroutine running in the instance manager
would be killed (due to the instance manager binary being replaced via
exec()), but the operator would still think the backup was running
because the container ID hadn't changed.

The fix introduces a deterministic mechanism to detect instance manager
restarts by tracking the SessionID

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit 102d785)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working lgtm This PR has been approved by a maintainer no-issue ok to merge 👌 This PR can be merged release-1.27 release-1.28 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants