Added logic for DB recovery from backup#1737
Conversation
WalkthroughReplaces Backup CR usage with VolumeSnapshot-based scheduled backups and retention. Adds recovery-first bootstrap path when DBRecovery exists, initializing and updating RecoveryStatus, stopping NooBaa pods as needed, and cleaning scheduled-backup resources. Adds helpers for snapshot listing and scheduled-backup naming; adds ObjectMeta to CNPG Backup object construction. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Reconciler as Controller
participant K8sAPI as Kubernetes API (CRs/Status)
participant VS as VolumeSnapshot API
participant CNPG as CNPG Cluster Runtime
Note over Reconciler,K8sAPI: Reconciliation loop for CNPG cluster
Reconciler->>K8sAPI: fetch CNPG cluster and DBRecovery
alt DBRecovery exists
Reconciler->>VS: listVolumeSnapshotsOrderByCreate(labels)
VS-->>Reconciler: ordered VolumeSnapshot list
Reconciler->>K8sAPI: create/update BootstrapConfiguration.Recovery (ref snapshot)
Reconciler->>K8sAPI: create/initialize RecoveryStatus (Pending → Running)
Reconciler->>K8sAPI: cleanupDBBackup(getBackupResourceName)
Reconciler->>CNPG: trigger bootstrap from VolumeSnapshot (recovery bootstrap)
CNPG-->>K8sAPI: update cluster status (Recovering → Ready)
else No DBRecovery
Reconciler->>K8sAPI: proceed with fresh bootstrap configuration
CNPG-->>K8sAPI: update cluster status (Ready)
end
Note over Reconciler,VS: Scheduled backup reconciliation and retention operate on VolumeSnapshot objects (Next/Last times, cleanup of oldest snapshots)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/cnpg/cnpg.go (2)
pkg/hac/hac.go (1)
Name(22-22)pkg/options/options.go (1)
Namespace(57-57)
🪛 GitHub Actions: KMS Test - TLS Vault, K8S ServiceAccount Auth
pkg/system/db_reconciler.go
[error] 359-359: go build failed: undefined: r.cleanupDBBackup. Reconciler has no field or method cleanupDBBackup in pkg/system/db_reconciler.go. Command 'go build -o build/_output/bin/noobaa-operator-local -mod=vendor cmd/manager/main.go' exited with code 1.
🪛 GitHub Actions: KMS Test - TLS Vault, Token Auth
pkg/system/db_reconciler.go
[error] 359-359: r.cleanupDBBackup undefined (type *Reconciler has no field or method cleanupDBBackup)
🪛 GitHub Check: run-kms-tls-sa-test
pkg/system/db_reconciler.go
[failure] 359-359:
r.cleanupDBBackup undefined (type *Reconciler has no field or method cleanupDBBackup)
🪛 GitHub Check: run-kms-tls-token-test
pkg/system/db_reconciler.go
[failure] 359-359:
r.cleanupDBBackup undefined (type *Reconciler has no field or method cleanupDBBackup)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: run-operator-tests
- GitHub Check: run-cli-tests
- GitHub Check: run-core-config-map-tests
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-admission-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: golangci-lint
- GitHub Check: run-azure-vault-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-hac-test
🔇 Additional comments (7)
pkg/cnpg/cnpg.go (1)
619-622: LGTM!The addition of
ObjectMetawithNameandNamespacealigns with how other CNPG object factory functions in this file are structured (e.g.,GetCnpgScheduledBackupObj,GetCnpgClusterObj). This ensures the Backup resource is properly identifiable when created or referenced.pkg/system/db_reconciler.go (6)
151-191: Recovery-driven bootstrap flow looks well-structured.The conditional logic correctly separates fresh DB initialization from recovery scenarios. The status tracking with
DBClusterStatusCreatingvsDBClusterStatusRecoveringand theRecoveryStatusstruct provide good observability.
319-356: Recovery setup logic is correct.The approach of stopping NooBaa pods before initiating recovery, then configuring
Bootstrap.Recoverywith theVolumeSnapshotsdata source follows the CNPG recovery pattern correctly. The error handling for pod termination is appropriate—returning an error to retry on the next reconciliation loop.
492-524: LGTM!The function correctly:
- Uses label-based filtering for CNPG cluster snapshots
- Applies name prefix filtering to isolate scheduled backup snapshots
- Sorts by creation timestamp (oldest first) which aligns with the retention deletion logic
301-306: Good defensive nil check.Adding the nil check for
VolumeSnapshotbefore setting its properties prevents a potential nil pointer dereference whenBackupexists butVolumeSnapshothasn't been initialized.
79-81: LGTM!Correctly updates the recovery status to completed when the CNPG cluster becomes ready, with proper nil-check protection.
407-409: Verify the naming convention aligns with CNPG snapshot naming.The suffix changed from
-backupto-scheduled-backup. Ensure this matches the naming pattern CNPG uses when creating VolumeSnapshots from ScheduledBackup resources, aslistVolumeSnapshotsOrderByCreatefilters snapshots by this prefix.
4e78f18 to
6840ac2
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/system/db_reconciler.go (1)
451-500: Retention logic: typos and recovery-snapshot protection still need tighteningTwo earlier concerns still appear here:
Typos in variable names –
avaialableItemsandavaiallableSnapshotsare still misspelled; this is minor but noisy and easy to fix.Recovery snapshot protection and MaxSnapshots semantics – deletion is skipped whenever
DBRecoveryis configured and the snapshot name matchesDBRecovery.VolumeSnapshotName, regardless of whether recovery has already completed:if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { // ... }Given you now track
RecoveryStatus.Status(Pending/Running/Completed), this will keep the recovery snapshot indefinitely (and effectively allowMaxSnapshots+1snapshots) even long after a successful recovery, as long asDBRecoveryremains set in the spec.A more precise condition would only protect the snapshot while recovery is actually pending/running, not once it’s completed. For example:
- for _, snapshot := range snapshots[:numToDelete] { - if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { + recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus + isRecoveryActive := recoveryStatus != nil && recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted + for _, snapshot := range snapshots[:numToDelete] { + if isRecoveryActive && + r.NooBaa.Spec.DBSpec.DBRecovery != nil && + r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name) avaialableItems = append(avaialableItems, snapshot) continue }And optionally clean up the DBRecovery spec once recovery is fully completed if you don’t want it to keep influencing retention.
Also, consider fixing the typos in the same area:
- avaialableItems := snapshots + availableItems := snapshots ... - avaialableItems = avaialableItems[numToDelete:] + availableItems = availableItems[numToDelete:] ... - avaialableItems = append(avaialableItems, snapshot) + availableItems = append(availableItems, snapshot) ... - avaiallableSnapshots := []string{} - for _, item := range avaialableItems { - avaiallableSnapshots = append(avaiallableSnapshots, item.Name) - } - r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots + availableSnapshots := []string{} + for _, item := range availableItems { + availableSnapshots = append(availableSnapshots, item.Name) + } + r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = availableSnapshots
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(10 hunks)
✅ Files skipped from review due to trivial changes (1)
- pkg/cnpg/cnpg.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-kms-tls-token-test
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-operator-tests
- GitHub Check: run-kms-dev-test
- GitHub Check: golangci-lint
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-cli-tests
- GitHub Check: run-hac-test
- GitHub Check: run-admission-test
- GitHub Check: run-azure-vault-test
- GitHub Check: run-kms-kmip-test
🔇 Additional comments (6)
pkg/system/db_reconciler.go (6)
3-23: Recovery completion wiring and new imports look correctThe added
timeandlabelsimports are used appropriately, and updatingRecoveryStatus.StatustoDBRecoveryStatusCompletedwhenisClusterReadyis true cleanly closes the recovery state machine without risking nil-pointer access (guarded by the nil-check).Also applies to: 79-81
287-306: Backup spec reconciliation for VolumeSnapshot mode looks consistentEnsuring
Spec.Backup.VolumeSnapshotis non-nil and forcingOnlinetofalsevia a singleofflineBackupvariable keeps the CNPG backup configuration in sync with the NooBaa spec and avoids nil-pointer issues when DBBackup is toggled on/off.
319-330: cleanupDBBackup implementation matches intended behaviorDefining
cleanupDBBackuptoKubeCheckQuietthe scheduled backup and delete it if present (ignoring NotFound) matches the usage sites and resolves the earlier build failures where this method was missing. This should be safe to call from both recovery and “backup disabled” flows.
332-380: Recovery bootstrap flow is generally solid; consider requeue semanticsThe recovery setup sequence—marking
RecoveryStatusas Pending, stopping core/endpoint pods, short‑circuiting until all pods are gone, then wiringBootstrap.Recoveryand cleaning up the scheduled backup—is coherent and re‑entrant. Once pods terminate, the reconcile proceeds to cluster creation with a consistent recovery bootstrap configuration. No functional issues spotted here.
505-537: VolumeSnapshot listing and ordering logic looks sound
listVolumeSnapshotsOrderByCreate()correctly:
- Lists VolumeSnapshots in the cluster namespace, filtered by
cnpg.io/cluster: <cluster-name>.- Narrows to those whose names start with
getBackupResourceName()(aligning with your scheduled backup naming).- Sorts the filtered slice by
CreationTimestampbefore returning.This should give
reconcileBackupRetentiona stable oldest→newest ordering over just the scheduled-backup snapshots.
420-422: Backup resource name change may affect upgrade/cleanup behaviorSwitching
getBackupResourceName()to"<cluster>-scheduled-backup"changes both theScheduledBackupname and the VolumeSnapshot name prefix used elsewhere. If existing clusters previously used a different backup resource name, those oldScheduledBackupCRs and their snapshots will no longer be cleaned up bycleanupDBBackuporreconcileBackupRetention.If backward compatibility is needed, consider:
- Keeping the old name for existing clusters and only using the new suffix for fresh installs, or
- Extending
cleanupDBBackup/ retention logic to handle both the old and new naming schemes during an upgrade window.
6840ac2 to
353122f
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/system/db_reconciler.go (1)
502-502: Fix typo in variable name.
avaiallableSnapshotsshould beavailableSnapshots.- avaiallableSnapshots := []string{} + availableSnapshots := []string{}Also update the reference on line 506.
♻️ Duplicate comments (2)
pkg/system/db_reconciler.go (2)
471-471: Fix typo in variable name.
avaialableItemsshould beavailableItems.- avaialableItems := snapshots + availableItems := snapshotsAlso update all references to this variable (lines 474, 480, 503).
478-481: Recovery snapshot protected indefinitely beyond configured retention.The recovery snapshot is excluded from deletion whenever
DBRecoveryis configured, regardless of whether recovery has completed. After a successful recovery, theDBRecoveryconfig typically remains in the spec, causing the snapshot to be retained indefinitely and preventingMaxSnapshotsenforcement from working correctly.Consider checking the recovery status to only protect snapshots during active recovery:
- if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { - r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name) - avaialableItems = append(avaialableItems, snapshot) - continue + // Only protect the recovery snapshot if recovery is still in progress + recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus + isRecoveryActive := recoveryStatus != nil && recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted + if isRecoveryActive && r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { + r.cnpgLog("skipping deletion of recovery snapshot %s (recovery in progress)", snapshot.Name) + availableItems = append(availableItems, snapshot) + continue
🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)
361-361: Follow Go naming conventions for unexported variables.The variable
VolSnapshotAPIGroupshould start with a lowercase letter since it's an unexported (local) variable.- VolSnapshotAPIGroup := storagesnapshotv1.GroupName + volSnapshotAPIGroup := storagesnapshotv1.GroupNameAnd update its usage on line 371:
- APIGroup: &VolSnapshotAPIGroup, + APIGroup: &volSnapshotAPIGroup,
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/cnpg/cnpg.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: run-azure-vault-test
- GitHub Check: run-kms-tls-token-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-operator-tests
- GitHub Check: golangci-lint
- GitHub Check: run-hac-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-admission-test
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-cli-tests
🔇 Additional comments (5)
pkg/system/db_reconciler.go (5)
79-81: LGTM!The logic correctly transitions recovery status to Completed when the cluster becomes ready, with proper nil-checking.
150-171: LGTM!The cluster creation logic properly handles both fresh installation and recovery scenarios with appropriate nil-checking for DBRecovery.
182-198: LGTM!The defensive nil-checking pattern correctly handles edge cases where Bootstrap.Recovery might be set from a previous reconciliation while DBRecovery is cleared, preventing nil dereferences.
325-336: LGTM!The method correctly implements the cleanup logic with appropriate error handling, ignoring NotFound errors as expected.
511-543: LGTM!The method correctly lists, filters, and sorts VolumeSnapshots by creation time, using appropriate label selectors and name prefix filtering.
353122f to
0c92392
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
pkg/system/db_reconciler.go (2)
502-502: Fix typo in variable name.The variable name
avaiallableSnapshotscontains a typo. This was flagged in previous reviews but remains unfixed.- avaiallableSnapshots := []string{} + availableSnapshots := []string{} for _, item := range availableItems { - avaiallableSnapshots = append(avaiallableSnapshots, item.Name) + availableSnapshots = append(availableSnapshots, item.Name) } - r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots + r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = availableSnapshots
478-481: Recovery snapshot protected indefinitely after completion.The recovery snapshot is currently protected whenever
DBRecoveryis configured, but this doesn't check whether the recovery has already completed. After a successful recovery, the snapshot will be retained indefinitely, preventingMaxSnapshotsenforcement from working correctly.As noted in previous reviews, the protection should only apply while recovery is active:
- if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { + recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus + isRecoveryActive := recoveryStatus != nil && recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted + if isRecoveryActive && r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name) availableItems = append(availableItems, snapshot) continue }
🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)
338-387: Recovery flow is sound, but minor naming convention issue.The recovery orchestration correctly:
- Sets recovery status to pending
- Stops NooBaa pods and waits for termination
- Configures Bootstrap.Recovery with the VolumeSnapshot
- Cleans up scheduled backups to prevent conflicts
However, the local variable at line 361 should follow Go naming conventions:
- VolSnapshotAPIGroup := storagesnapshotv1.GroupName + volSnapshotAPIGroup := storagesnapshotv1.GroupName //setup once the pods are terminated, set recovery in the bootstrap configuration and continue to create the cluster r.CNPGCluster.Spec.Bootstrap = &cnpgv1.BootstrapConfiguration{ Recovery: &cnpgv1.BootstrapRecovery{ Database: noobaaDBName, Owner: noobaaDBUser, VolumeSnapshots: &cnpgv1.DataSource{ Storage: corev1.TypedLocalObjectReference{ Kind: "VolumeSnapshot", Name: r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName, - APIGroup: &VolSnapshotAPIGroup, + APIGroup: &volSnapshotAPIGroup, }, }, }, }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/cnpg/cnpg.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/system/db_reconciler.go (2)
pkg/cnpg/cnpg.go (2)
GetCnpgScheduledBackupObj(598-611)GetCnpgBackupObj(613-625)pkg/util/util.go (1)
KubeCheckQuiet(620-633)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: run-operator-tests
- GitHub Check: run-cli-tests
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-kms-dev-test
- GitHub Check: run-hac-test
- GitHub Check: golangci-lint
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-azure-vault-test
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-admission-test
- GitHub Check: run-kms-tls-token-test
🔇 Additional comments (8)
pkg/system/db_reconciler.go (8)
8-8: LGTM: Necessary imports added.The
timeandlabelsimports are properly used for recovery timestamp tracking (line 196) and VolumeSnapshot label filtering (line 523).Also applies to: 21-21
79-81: LGTM: Recovery completion properly tracked.The recovery status is correctly updated to
Completedwhen the cluster becomes ready, with appropriate nil guard.
150-171: LGTM: Cluster creation paths properly separated.The logic correctly distinguishes between fresh DB initialization and recovery from backup based on the presence of
DBRecoveryconfiguration.
182-198: LGTM: Defensive status initialization implemented.The status update logic now properly handles edge cases where
DBRecoverymight be cleared between reconciliation attempts, falling back to previously recorded snapshot names. This addresses the concern raised in previous reviews.
307-311: LGTM: Safe initialization of backup configuration.The VolumeSnapshot configuration is properly initialized with nil checks before setting properties.
325-336: LGTM: Missing method now implemented.The
cleanupDBBackupmethod is now properly implemented, addressing the build failure flagged in previous reviews. The implementation correctly usesKubeCheckQuietand handlesNotFounderrors appropriately.
427-427: LGTM: Resource naming aligned with ScheduledBackup.The backup resource name change from
-backupto-scheduled-backupcorrectly reflects the use of CNPG's ScheduledBackup resource.
511-543: LGTM: VolumeSnapshot listing properly implemented.The method correctly:
- Uses CNPG's standard cluster label for filtering
- Filters by scheduled backup name prefix to avoid unrelated snapshots
- Sorts chronologically by creation timestamp
This provides the foundation for retention policy enforcement.
- When `dbRecovery` is configured with a snapshot name, a recovery from backup is pending - The recovery is triggered by a manual deletion of the Cluster resource by the user - The recovered volume is excluded from the automatic volume cleanup Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
0c92392 to
72fb2eb
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/system/db_reconciler.go (1)
464-481: Recovery snapshot retention and typo inavaiallableSnapshots
- The retention logic always skips deletion when
DBRecoveryis non-nil and the snapshot name matches, regardless of whether recovery has already completed:if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { ... }That means as long as the user leaves
DBRecoveryset, this snapshot will never be aged out, even afterRecoveryStatus.StatusbecomesCompleted. If the intent is to protect only while recovery is pending/running, consider tightening this to something like:- if r.NooBaa.Spec.DBSpec.DBRecovery != nil && - r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { + recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus + isRecoveryActive := recoveryStatus != nil && + recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted + if isRecoveryActive && + r.NooBaa.Spec.DBSpec.DBRecovery != nil && + r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name { r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name) availableItems = append(availableItems, snapshot) continue }This keeps the snapshot during active recovery but lets normal MaxSnapshots enforcement apply once recovery is done. (This is the same concern raised in an earlier review.)
- Minor naming nit:
avaiallableSnapshotsis misspelled and slightly confusing:avaiallableSnapshots := []string{} for _, item := range availableItems { avaiallableSnapshots = append(avaiallableSnapshots, item.Name) } r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshotsConsider renaming to
availableSnapshots:- avaiallableSnapshots := []string{} - for _, item := range availableItems { - avaiallableSnapshots = append(avaiallableSnapshots, item.Name) - } - r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots + availableSnapshots := make([]string, 0, len(availableItems)) + for _, item := range availableItems { + availableSnapshots = append(availableSnapshots, item.Name) + } + r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = availableSnapshotsAlso applies to: 502-507
🧹 Nitpick comments (3)
pkg/system/db_reconciler.go (3)
153-171: Recovery bootstrap path is clean, but consider gating on “active” recovery, not justDBRecoverypresenceThe split between:
- “no
DBRecovery→ InitDB bootstrap”, and- “
DBRecoverypresent →reconcileClusterRecovery+Bootstrap.Recovery+DBClusterStatusRecovering/RecoveryStatus{Pending→Running→Completed}is clear and avoids nil derefs on
DBRecovery.One semantic concern: as coded, any future deletion of the CNPG Cluster while
Spec.DBSpec.DBRecoveryis still set will trigger another recovery from the same snapshot, potentially discarding newer data even after a successful recovery.You may want to gate the recovery bootstrap on an “active” recovery condition (e.g.
RecoveryStatus == nilorStatus != Completed) instead of justDBRecovery != nil, so a completed recovery isn’t re-triggered accidentally. For example:- if r.NooBaa.Spec.DBSpec.DBRecovery == nil { + recoveryCfg := r.NooBaa.Spec.DBSpec.DBRecovery + recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus + isRecoveryActive := recoveryCfg != nil && + (recoveryStatus == nil || recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted) + + if !isRecoveryActive { // InitDB bootstrap ... - } else { + } else { // Recovery bootstrap if err := r.reconcileClusterRecovery(); err != nil { ... } }This keeps the current flow but makes “do another destructive recovery” an explicit action (e.g. by clearing/adjusting recovery status/spec) instead of happening on every cluster deletion while the spec field lingers.
Also applies to: 181-197, 338-347
293-313: VolumeSnapshot backup configuration guards are sound; minor nit onofflineBackupnamingThe additional checks ensuring
DBBackup.VolumeSnapshotis non-nil and thatSpec.Backup.VolumeSnapshotis initialized before assigningClassNameandOnlinelook correct and consistent withreconcileDBBackup.Minor readability nit:
offlineBackup := falsethen passed as&offlineBackupto theOnlinefield inverts semantics in the variable name. Renaming the local toonline := false(or similar) would better reflect what’s stored in the spec and avoid confusion when revisiting this code.
511-543: VolumeSnapshot listing logic is good; TypeMeta APIVersion is slightly off and unnecessaryThe
Client.Listcall with a namespace andLabelSelectorFromValidatedSet({"cnpg.io/cluster": <name>})followed by a name-prefix filter is a reasonable way to scope snapshots to this cluster’s scheduled backups.Two minor nits around the list struct:
volumeSnapshots := storagesnapshotv1.VolumeSnapshotList{ TypeMeta: metav1.TypeMeta{ APIVersion: storagesnapshotv1.GroupName, Kind: "VolumeSnapshotList", }, }
storagesnapshotv1.GroupNameis just the group (e.g."snapshot.storage.k8s.io"), not the full"snapshot.storage.k8s.io/v1"APIVersion.- For controller-runtime
Client.List, settingTypeMetais not required; the scheme already knows the correct GVK forVolumeSnapshotList.You can simplify and avoid any APIVersion confusion by dropping the
TypeMetaentirely:-func (r *Reconciler) listVolumeSnapshotsOrderByCreate() ([]storagesnapshotv1.VolumeSnapshot, error) { - volumeSnapshots := storagesnapshotv1.VolumeSnapshotList{ - TypeMeta: metav1.TypeMeta{ - APIVersion: storagesnapshotv1.GroupName, - Kind: "VolumeSnapshotList", - }, - } +func (r *Reconciler) listVolumeSnapshotsOrderByCreate() ([]storagesnapshotv1.VolumeSnapshot, error) { + volumeSnapshots := storagesnapshotv1.VolumeSnapshotList{}The rest of the function (filtering by prefix and sorting by
CreationTimestamp) looks good.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/cnpg/cnpg.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: run-kms-tls-token-test
- GitHub Check: run-operator-tests
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-admission-test
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-cli-tests
- GitHub Check: golangci-lint
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-azure-vault-test
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-hac-test
🔇 Additional comments (2)
pkg/system/db_reconciler.go (2)
73-81: RecoveryStatus transition toCompletedon ready cluster looks consistentMarking
DBStatus.RecoveryStatus.Status = DBRecoveryStatusCompletedonceisClusterReadysucceeds keeps recovery state in sync with the actual cluster readiness and avoids a permanently “Running/Pending” recovery flag. No issues here.
325-336: Support both legacy and new backup resource names during upgrades to prevent orphaned resourcesThe concern that renaming the backup resource suffix from
"-backup"to"-scheduled-backup"could break upgrades is valid. ExistingScheduledBackupobjects with the old name will not be cleaned up bycleanupDBBackup(), retention filtering will skip snapshots created by the legacy schedule, and a second schedule under the new name may be created, doubling backup operations.This requires validating:
- Whether the naming change from
"-backup"to"-scheduled-backup"has actually been deployed- Whether
reconcileScheduledBackup()creates a new resource if the old one exists- Whether snapshot retention filtering at the mentioned line ranges is vulnerable to orphaned legacy resources
If confirmed, implement backward compatibility by:
- Supporting both naming conventions in
cleanupDBBackup(),listVolumeSnapshotsOrderByCreate(), and any other filtering logic- Ensuring at least one release cycle handles both old and new names before fully migrating
- Adding migration logic or clear upgrade documentation if this is already in a released version
alphaprinz
left a comment
There was a problem hiding this comment.
LGTM.
Had a few comments/questions.
|
|
||
| // cnpg will perform the import using the PV of the new DB instance. We need to make sure the PV is large enough. | ||
| // If the used percentage is over the threshold (currently 33%), double the size of the PV. | ||
| usedSpaceThreshold := 33 |
There was a problem hiding this comment.
Volume capacity usage is no longer needed for import?
Or did you decide it's not needed regardless of this change?
There was a problem hiding this comment.
In 4.20 to 4.21 upgrade, we don't have an import anymore (both are CNPG)
| return err | ||
| } | ||
| if r.NooBaa.Status.DBStatus.RecoveryStatus != nil { | ||
| r.NooBaa.Status.DBStatus.RecoveryStatus.Status = nbv1.DBRecoveryStatusCompleted |
There was a problem hiding this comment.
Does this come with a timestamp? If not, it can be confusing.
There was a problem hiding this comment.
The RecoveryTime is set once when the recovery starts. After that, only the status is modified
| } | ||
| } | ||
| } else { | ||
| // Recovery configuration found, set bootstrap configuration to recover from the snapshot |
There was a problem hiding this comment.
This is because user explicitly deleted the cluster, right?
If so I would add it to the comment.
There was a problem hiding this comment.
sure. I will add a comment in the next PR
Explain the changes
dbRecoveryis configured with a snapshot name, a recovery from backup is pendingIssues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit
New Features
Bug Fixes / Changes
✏️ Tip: You can customize this high-level summary in your review settings.