Skip to content

Added logic for DB recovery from backup#1737

Merged
dannyzaken merged 1 commit intonoobaa:masterfrom
dannyzaken:danny-db-backup
Dec 2, 2025
Merged

Added logic for DB recovery from backup#1737
dannyzaken merged 1 commit intonoobaa:masterfrom
dannyzaken:danny-db-backup

Conversation

@dannyzaken
Copy link
Member

@dannyzaken dannyzaken commented Nov 30, 2025

Explain the changes

  • When dbRecovery is configured with a snapshot name, a recovery from backup is pending
    • The recovery is triggered by a manual deletion of the Cluster resource by the user
    • The recovered volume is excluded from the automatic volume cleanup

Issues: Fixed #xxx / Gap #xxx

  1. https://issues.redhat.com/browse/RHSTOR-8125

Testing Instructions:

  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • New Features

    • Recovery-first cluster bootstrap using a selected volume snapshot.
    • Visible recovery status in DB cluster status (Pending → Completed) with snapshot and timing details.
    • Snapshot-based scheduled backups and retention: backups ordered by snapshot creation; Next/Last times reflect snapshot status.
    • Automatic cleanup of conflicting scheduled backups during recovery.
  • Bug Fixes / Changes

    • Scheduled backup naming and metadata updated to reduce conflicts (uses "-scheduled-backup" and includes name/namespace).
    • Backup objects now include explicit name and namespace metadata.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 30, 2025

Walkthrough

Replaces Backup CR usage with VolumeSnapshot-based scheduled backups and retention. Adds recovery-first bootstrap path when DBRecovery exists, initializing and updating RecoveryStatus, stopping NooBaa pods as needed, and cleaning scheduled-backup resources. Adds helpers for snapshot listing and scheduled-backup naming; adds ObjectMeta to CNPG Backup object construction.

Changes

Cohort / File(s) Summary
Backup metadata
pkg/cnpg/cnpg.go
Adds an ObjectMeta (Name, Namespace) to the Backup object literal returned by GetCnpgBackupObj.
DB reconciler & snapshot-based backups
pkg/system/db_reconciler.go
Introduces recovery-first flow and new methods: reconcileClusterRecovery(), cleanupDBBackup(), listVolumeSnapshotsOrderByCreate(); changes getBackupResourceName() to return "<cluster>-scheduled-backup". Switches scheduled backup and retention logic to use VolumeSnapshot objects (label-based filtering, creation-time ordering), initializes RecoveryStatus, configures BootstrapConfiguration.Recovery to reference snapshots, and deletes existing scheduled-backup resources during recovery.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Reconciler as Controller
    participant K8sAPI as Kubernetes API (CRs/Status)
    participant VS as VolumeSnapshot API
    participant CNPG as CNPG Cluster Runtime

    Note over Reconciler,K8sAPI: Reconciliation loop for CNPG cluster
    Reconciler->>K8sAPI: fetch CNPG cluster and DBRecovery
    alt DBRecovery exists
        Reconciler->>VS: listVolumeSnapshotsOrderByCreate(labels)
        VS-->>Reconciler: ordered VolumeSnapshot list
        Reconciler->>K8sAPI: create/update BootstrapConfiguration.Recovery (ref snapshot)
        Reconciler->>K8sAPI: create/initialize RecoveryStatus (Pending → Running)
        Reconciler->>K8sAPI: cleanupDBBackup(getBackupResourceName)
        Reconciler->>CNPG: trigger bootstrap from VolumeSnapshot (recovery bootstrap)
        CNPG-->>K8sAPI: update cluster status (Recovering → Ready)
    else No DBRecovery
        Reconciler->>K8sAPI: proceed with fresh bootstrap configuration
        CNPG-->>K8sAPI: update cluster status (Ready)
    end
    Note over Reconciler,VS: Scheduled backup reconciliation and retention operate on VolumeSnapshot objects (Next/Last times, cleanup of oldest snapshots)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing careful review:
    • Recovery vs fresh bootstrap branching and lifecycle transitions for RecoveryStatus.
    • Correct selection/filtering and ordering of VolumeSnapshots by labels and creationTimestamp.
    • Cleanup semantics for scheduled-backup resources and skipping/deleting the recovery snapshot.
    • getBackupResourceName change and any callers relying on previous naming.

Possibly related PRs

Suggested reviewers

  • alphaprinz
  • nimrod-becker

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding logic for database recovery from backup, which matches the primary purpose of the pull request.
Description check ✅ Passed The description covers the main changes and references a relevant issue, but lacks detailed testing instructions and has unchecked documentation/testing checkboxes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7396ac6 and 4e78f18.

📒 Files selected for processing (2)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/cnpg/cnpg.go (2)
pkg/hac/hac.go (1)
  • Name (22-22)
pkg/options/options.go (1)
  • Namespace (57-57)
🪛 GitHub Actions: KMS Test - TLS Vault, K8S ServiceAccount Auth
pkg/system/db_reconciler.go

[error] 359-359: go build failed: undefined: r.cleanupDBBackup. Reconciler has no field or method cleanupDBBackup in pkg/system/db_reconciler.go. Command 'go build -o build/_output/bin/noobaa-operator-local -mod=vendor cmd/manager/main.go' exited with code 1.

🪛 GitHub Actions: KMS Test - TLS Vault, Token Auth
pkg/system/db_reconciler.go

[error] 359-359: r.cleanupDBBackup undefined (type *Reconciler has no field or method cleanupDBBackup)

🪛 GitHub Check: run-kms-tls-sa-test
pkg/system/db_reconciler.go

[failure] 359-359:
r.cleanupDBBackup undefined (type *Reconciler has no field or method cleanupDBBackup)

🪛 GitHub Check: run-kms-tls-token-test
pkg/system/db_reconciler.go

[failure] 359-359:
r.cleanupDBBackup undefined (type *Reconciler has no field or method cleanupDBBackup)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: run-operator-tests
  • GitHub Check: run-cli-tests
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-admission-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: run-hac-test
🔇 Additional comments (7)
pkg/cnpg/cnpg.go (1)

619-622: LGTM!

The addition of ObjectMeta with Name and Namespace aligns with how other CNPG object factory functions in this file are structured (e.g., GetCnpgScheduledBackupObj, GetCnpgClusterObj). This ensures the Backup resource is properly identifiable when created or referenced.

pkg/system/db_reconciler.go (6)

151-191: Recovery-driven bootstrap flow looks well-structured.

The conditional logic correctly separates fresh DB initialization from recovery scenarios. The status tracking with DBClusterStatusCreating vs DBClusterStatusRecovering and the RecoveryStatus struct provide good observability.


319-356: Recovery setup logic is correct.

The approach of stopping NooBaa pods before initiating recovery, then configuring Bootstrap.Recovery with the VolumeSnapshots data source follows the CNPG recovery pattern correctly. The error handling for pod termination is appropriate—returning an error to retry on the next reconciliation loop.


492-524: LGTM!

The function correctly:

  • Uses label-based filtering for CNPG cluster snapshots
  • Applies name prefix filtering to isolate scheduled backup snapshots
  • Sorts by creation timestamp (oldest first) which aligns with the retention deletion logic

301-306: Good defensive nil check.

Adding the nil check for VolumeSnapshot before setting its properties prevents a potential nil pointer dereference when Backup exists but VolumeSnapshot hasn't been initialized.


79-81: LGTM!

Correctly updates the recovery status to completed when the CNPG cluster becomes ready, with proper nil-check protection.


407-409: Verify the naming convention aligns with CNPG snapshot naming.

The suffix changed from -backup to -scheduled-backup. Ensure this matches the naming pattern CNPG uses when creating VolumeSnapshots from ScheduledBackup resources, as listVolumeSnapshotsOrderByCreate filters snapshots by this prefix.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/system/db_reconciler.go (1)

451-500: Retention logic: typos and recovery-snapshot protection still need tightening

Two earlier concerns still appear here:

  1. Typos in variable namesavaialableItems and avaiallableSnapshots are still misspelled; this is minor but noisy and easy to fix.

  2. Recovery snapshot protection and MaxSnapshots semantics – deletion is skipped whenever DBRecovery is configured and the snapshot name matches DBRecovery.VolumeSnapshotName, regardless of whether recovery has already completed:

if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
    // ...
}

Given you now track RecoveryStatus.Status (Pending/Running/Completed), this will keep the recovery snapshot indefinitely (and effectively allow MaxSnapshots+1 snapshots) even long after a successful recovery, as long as DBRecovery remains set in the spec.

A more precise condition would only protect the snapshot while recovery is actually pending/running, not once it’s completed. For example:

-        for _, snapshot := range snapshots[:numToDelete] {
-            if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
+        recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus
+        isRecoveryActive := recoveryStatus != nil && recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted
+        for _, snapshot := range snapshots[:numToDelete] {
+            if isRecoveryActive &&
+                r.NooBaa.Spec.DBSpec.DBRecovery != nil &&
+                r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
                 r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name)
                 avaialableItems = append(avaialableItems, snapshot)
                 continue
             }

And optionally clean up the DBRecovery spec once recovery is fully completed if you don’t want it to keep influencing retention.

Also, consider fixing the typos in the same area:

-    avaialableItems := snapshots
+    availableItems := snapshots
    ...
-        avaialableItems = avaialableItems[numToDelete:]
+        availableItems = availableItems[numToDelete:]
    ...
-                avaialableItems = append(avaialableItems, snapshot)
+                availableItems = append(availableItems, snapshot)
    ...
-    avaiallableSnapshots := []string{}
-    for _, item := range avaialableItems {
-        avaiallableSnapshots = append(avaiallableSnapshots, item.Name)
-    }
-    r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots
+    availableSnapshots := []string{}
+    for _, item := range availableItems {
+        availableSnapshots = append(availableSnapshots, item.Name)
+    }
+    r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = availableSnapshots
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e78f18 and 6840ac2.

📒 Files selected for processing (2)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (10 hunks)
✅ Files skipped from review due to trivial changes (1)
  • pkg/cnpg/cnpg.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-kms-tls-token-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-operator-tests
  • GitHub Check: run-kms-dev-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-cli-tests
  • GitHub Check: run-hac-test
  • GitHub Check: run-admission-test
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-kmip-test
🔇 Additional comments (6)
pkg/system/db_reconciler.go (6)

3-23: Recovery completion wiring and new imports look correct

The added time and labels imports are used appropriately, and updating RecoveryStatus.Status to DBRecoveryStatusCompleted when isClusterReady is true cleanly closes the recovery state machine without risking nil-pointer access (guarded by the nil-check).

Also applies to: 79-81


287-306: Backup spec reconciliation for VolumeSnapshot mode looks consistent

Ensuring Spec.Backup.VolumeSnapshot is non-nil and forcing Online to false via a single offlineBackup variable keeps the CNPG backup configuration in sync with the NooBaa spec and avoids nil-pointer issues when DBBackup is toggled on/off.


319-330: cleanupDBBackup implementation matches intended behavior

Defining cleanupDBBackup to KubeCheckQuiet the scheduled backup and delete it if present (ignoring NotFound) matches the usage sites and resolves the earlier build failures where this method was missing. This should be safe to call from both recovery and “backup disabled” flows.


332-380: Recovery bootstrap flow is generally solid; consider requeue semantics

The recovery setup sequence—marking RecoveryStatus as Pending, stopping core/endpoint pods, short‑circuiting until all pods are gone, then wiring Bootstrap.Recovery and cleaning up the scheduled backup—is coherent and re‑entrant. Once pods terminate, the reconcile proceeds to cluster creation with a consistent recovery bootstrap configuration. No functional issues spotted here.


505-537: VolumeSnapshot listing and ordering logic looks sound

listVolumeSnapshotsOrderByCreate() correctly:

  • Lists VolumeSnapshots in the cluster namespace, filtered by cnpg.io/cluster: <cluster-name>.
  • Narrows to those whose names start with getBackupResourceName() (aligning with your scheduled backup naming).
  • Sorts the filtered slice by CreationTimestamp before returning.

This should give reconcileBackupRetention a stable oldest→newest ordering over just the scheduled-backup snapshots.


420-422: Backup resource name change may affect upgrade/cleanup behavior

Switching getBackupResourceName() to "<cluster>-scheduled-backup" changes both the ScheduledBackup name and the VolumeSnapshot name prefix used elsewhere. If existing clusters previously used a different backup resource name, those old ScheduledBackup CRs and their snapshots will no longer be cleaned up by cleanupDBBackup or reconcileBackupRetention.

If backward compatibility is needed, consider:

  • Keeping the old name for existing clusters and only using the new suffix for fresh installs, or
  • Extending cleanupDBBackup / retention logic to handle both the old and new naming schemes during an upgrade window.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/system/db_reconciler.go (1)

502-502: Fix typo in variable name.

avaiallableSnapshots should be availableSnapshots.

-	avaiallableSnapshots := []string{}
+	availableSnapshots := []string{}

Also update the reference on line 506.

♻️ Duplicate comments (2)
pkg/system/db_reconciler.go (2)

471-471: Fix typo in variable name.

avaialableItems should be availableItems.

-	avaialableItems := snapshots
+	availableItems := snapshots

Also update all references to this variable (lines 474, 480, 503).


478-481: Recovery snapshot protected indefinitely beyond configured retention.

The recovery snapshot is excluded from deletion whenever DBRecovery is configured, regardless of whether recovery has completed. After a successful recovery, the DBRecovery config typically remains in the spec, causing the snapshot to be retained indefinitely and preventing MaxSnapshots enforcement from working correctly.

Consider checking the recovery status to only protect snapshots during active recovery:

-		if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
-			r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name)
-			avaialableItems = append(avaialableItems, snapshot)
-			continue
+		// Only protect the recovery snapshot if recovery is still in progress
+		recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus
+		isRecoveryActive := recoveryStatus != nil && recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted
+		if isRecoveryActive && r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
+			r.cnpgLog("skipping deletion of recovery snapshot %s (recovery in progress)", snapshot.Name)
+			availableItems = append(availableItems, snapshot)
+			continue
🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)

361-361: Follow Go naming conventions for unexported variables.

The variable VolSnapshotAPIGroup should start with a lowercase letter since it's an unexported (local) variable.

-	VolSnapshotAPIGroup := storagesnapshotv1.GroupName
+	volSnapshotAPIGroup := storagesnapshotv1.GroupName

And update its usage on line 371:

-			APIGroup: &VolSnapshotAPIGroup,
+			APIGroup: &volSnapshotAPIGroup,
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6840ac2 and 353122f.

📒 Files selected for processing (2)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (10 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cnpg/cnpg.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-tls-token-test
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-operator-tests
  • GitHub Check: golangci-lint
  • GitHub Check: run-hac-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-admission-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-cli-tests
🔇 Additional comments (5)
pkg/system/db_reconciler.go (5)

79-81: LGTM!

The logic correctly transitions recovery status to Completed when the cluster becomes ready, with proper nil-checking.


150-171: LGTM!

The cluster creation logic properly handles both fresh installation and recovery scenarios with appropriate nil-checking for DBRecovery.


182-198: LGTM!

The defensive nil-checking pattern correctly handles edge cases where Bootstrap.Recovery might be set from a previous reconciliation while DBRecovery is cleared, preventing nil dereferences.


325-336: LGTM!

The method correctly implements the cleanup logic with appropriate error handling, ignoring NotFound errors as expected.


511-543: LGTM!

The method correctly lists, filters, and sorts VolumeSnapshots by creation time, using appropriate label selectors and name prefix filtering.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
pkg/system/db_reconciler.go (2)

502-502: Fix typo in variable name.

The variable name avaiallableSnapshots contains a typo. This was flagged in previous reviews but remains unfixed.

-	avaiallableSnapshots := []string{}
+	availableSnapshots := []string{}
 	for _, item := range availableItems {
-		avaiallableSnapshots = append(avaiallableSnapshots, item.Name)
+		availableSnapshots = append(availableSnapshots, item.Name)
 	}
-	r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots
+	r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = availableSnapshots

478-481: Recovery snapshot protected indefinitely after completion.

The recovery snapshot is currently protected whenever DBRecovery is configured, but this doesn't check whether the recovery has already completed. After a successful recovery, the snapshot will be retained indefinitely, preventing MaxSnapshots enforcement from working correctly.

As noted in previous reviews, the protection should only apply while recovery is active:

-			if r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
+			recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus
+			isRecoveryActive := recoveryStatus != nil && recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted
+			if isRecoveryActive && r.NooBaa.Spec.DBSpec.DBRecovery != nil && r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
 				r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name)
 				availableItems = append(availableItems, snapshot)
 				continue
 			}
🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)

338-387: Recovery flow is sound, but minor naming convention issue.

The recovery orchestration correctly:

  1. Sets recovery status to pending
  2. Stops NooBaa pods and waits for termination
  3. Configures Bootstrap.Recovery with the VolumeSnapshot
  4. Cleans up scheduled backups to prevent conflicts

However, the local variable at line 361 should follow Go naming conventions:

-	VolSnapshotAPIGroup := storagesnapshotv1.GroupName
+	volSnapshotAPIGroup := storagesnapshotv1.GroupName
 	//setup once the pods are terminated, set recovery in the bootstrap configuration and continue to create the cluster
 	r.CNPGCluster.Spec.Bootstrap = &cnpgv1.BootstrapConfiguration{
 		Recovery: &cnpgv1.BootstrapRecovery{
 			Database: noobaaDBName,
 			Owner:    noobaaDBUser,
 			VolumeSnapshots: &cnpgv1.DataSource{
 				Storage: corev1.TypedLocalObjectReference{
 					Kind:     "VolumeSnapshot",
 					Name:     r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName,
-					APIGroup: &VolSnapshotAPIGroup,
+					APIGroup: &volSnapshotAPIGroup,
 				},
 			},
 		},
 	}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 353122f and 0c92392.

📒 Files selected for processing (2)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cnpg/cnpg.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/system/db_reconciler.go (2)
pkg/cnpg/cnpg.go (2)
  • GetCnpgScheduledBackupObj (598-611)
  • GetCnpgBackupObj (613-625)
pkg/util/util.go (1)
  • KubeCheckQuiet (620-633)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: run-operator-tests
  • GitHub Check: run-cli-tests
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-hac-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-admission-test
  • GitHub Check: run-kms-tls-token-test
🔇 Additional comments (8)
pkg/system/db_reconciler.go (8)

8-8: LGTM: Necessary imports added.

The time and labels imports are properly used for recovery timestamp tracking (line 196) and VolumeSnapshot label filtering (line 523).

Also applies to: 21-21


79-81: LGTM: Recovery completion properly tracked.

The recovery status is correctly updated to Completed when the cluster becomes ready, with appropriate nil guard.


150-171: LGTM: Cluster creation paths properly separated.

The logic correctly distinguishes between fresh DB initialization and recovery from backup based on the presence of DBRecovery configuration.


182-198: LGTM: Defensive status initialization implemented.

The status update logic now properly handles edge cases where DBRecovery might be cleared between reconciliation attempts, falling back to previously recorded snapshot names. This addresses the concern raised in previous reviews.


307-311: LGTM: Safe initialization of backup configuration.

The VolumeSnapshot configuration is properly initialized with nil checks before setting properties.


325-336: LGTM: Missing method now implemented.

The cleanupDBBackup method is now properly implemented, addressing the build failure flagged in previous reviews. The implementation correctly uses KubeCheckQuiet and handles NotFound errors appropriately.


427-427: LGTM: Resource naming aligned with ScheduledBackup.

The backup resource name change from -backup to -scheduled-backup correctly reflects the use of CNPG's ScheduledBackup resource.


511-543: LGTM: VolumeSnapshot listing properly implemented.

The method correctly:

  • Uses CNPG's standard cluster label for filtering
  • Filters by scheduled backup name prefix to avoid unrelated snapshots
  • Sorts chronologically by creation timestamp

This provides the foundation for retention policy enforcement.

- When `dbRecovery` is configured with a snapshot name, a recovery from backup is pending
  - The recovery is triggered by a manual deletion of the Cluster resource by the user
  - The recovered volume is excluded from the automatic volume cleanup

Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/system/db_reconciler.go (1)

464-481: Recovery snapshot retention and typo in avaiallableSnapshots

  1. The retention logic always skips deletion when DBRecovery is non-nil and the snapshot name matches, regardless of whether recovery has already completed:
if r.NooBaa.Spec.DBSpec.DBRecovery != nil &&
   r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
    ...
}

That means as long as the user leaves DBRecovery set, this snapshot will never be aged out, even after RecoveryStatus.Status becomes Completed. If the intent is to protect only while recovery is pending/running, consider tightening this to something like:

-        if r.NooBaa.Spec.DBSpec.DBRecovery != nil &&
-            r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
+        recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus
+        isRecoveryActive := recoveryStatus != nil &&
+            recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted
+        if isRecoveryActive &&
+            r.NooBaa.Spec.DBSpec.DBRecovery != nil &&
+            r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName == snapshot.Name {
             r.cnpgLog("skipping deletion of recovery snapshot %s", snapshot.Name)
             availableItems = append(availableItems, snapshot)
             continue
         }

This keeps the snapshot during active recovery but lets normal MaxSnapshots enforcement apply once recovery is done. (This is the same concern raised in an earlier review.)

  1. Minor naming nit: avaiallableSnapshots is misspelled and slightly confusing:
avaiallableSnapshots := []string{}
for _, item := range availableItems {
    avaiallableSnapshots = append(avaiallableSnapshots, item.Name)
}
r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots

Consider renaming to availableSnapshots:

-    avaiallableSnapshots := []string{}
-    for _, item := range availableItems {
-        avaiallableSnapshots = append(avaiallableSnapshots, item.Name)
-    }
-    r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = avaiallableSnapshots
+    availableSnapshots := make([]string, 0, len(availableItems))
+    for _, item := range availableItems {
+        availableSnapshots = append(availableSnapshots, item.Name)
+    }
+    r.NooBaa.Status.DBStatus.BackupStatus.AvailableSnapshots = availableSnapshots

Also applies to: 502-507

🧹 Nitpick comments (3)
pkg/system/db_reconciler.go (3)

153-171: Recovery bootstrap path is clean, but consider gating on “active” recovery, not just DBRecovery presence

The split between:

  • “no DBRecovery → InitDB bootstrap”, and
  • DBRecovery present → reconcileClusterRecovery + Bootstrap.Recovery + DBClusterStatusRecovering / RecoveryStatus{Pending→Running→Completed}

is clear and avoids nil derefs on DBRecovery.

One semantic concern: as coded, any future deletion of the CNPG Cluster while Spec.DBSpec.DBRecovery is still set will trigger another recovery from the same snapshot, potentially discarding newer data even after a successful recovery.

You may want to gate the recovery bootstrap on an “active” recovery condition (e.g. RecoveryStatus == nil or Status != Completed) instead of just DBRecovery != nil, so a completed recovery isn’t re-triggered accidentally. For example:

-    if r.NooBaa.Spec.DBSpec.DBRecovery == nil {
+    recoveryCfg := r.NooBaa.Spec.DBSpec.DBRecovery
+    recoveryStatus := r.NooBaa.Status.DBStatus.RecoveryStatus
+    isRecoveryActive := recoveryCfg != nil &&
+        (recoveryStatus == nil || recoveryStatus.Status != nbv1.DBRecoveryStatusCompleted)
+
+    if !isRecoveryActive {
         // InitDB bootstrap
         ...
-    } else {
+    } else {
         // Recovery bootstrap
         if err := r.reconcileClusterRecovery(); err != nil {
             ...
         }
     }

This keeps the current flow but makes “do another destructive recovery” an explicit action (e.g. by clearing/adjusting recovery status/spec) instead of happening on every cluster deletion while the spec field lingers.

Also applies to: 181-197, 338-347


293-313: VolumeSnapshot backup configuration guards are sound; minor nit on offlineBackup naming

The additional checks ensuring DBBackup.VolumeSnapshot is non-nil and that Spec.Backup.VolumeSnapshot is initialized before assigning ClassName and Online look correct and consistent with reconcileDBBackup.

Minor readability nit: offlineBackup := false then passed as &offlineBackup to the Online field inverts semantics in the variable name. Renaming the local to online := false (or similar) would better reflect what’s stored in the spec and avoid confusion when revisiting this code.


511-543: VolumeSnapshot listing logic is good; TypeMeta APIVersion is slightly off and unnecessary

The Client.List call with a namespace and LabelSelectorFromValidatedSet({"cnpg.io/cluster": <name>}) followed by a name-prefix filter is a reasonable way to scope snapshots to this cluster’s scheduled backups.

Two minor nits around the list struct:

volumeSnapshots := storagesnapshotv1.VolumeSnapshotList{
    TypeMeta: metav1.TypeMeta{
        APIVersion: storagesnapshotv1.GroupName,
        Kind:       "VolumeSnapshotList",
    },
}
  • storagesnapshotv1.GroupName is just the group (e.g. "snapshot.storage.k8s.io"), not the full "snapshot.storage.k8s.io/v1" APIVersion.
  • For controller-runtime Client.List, setting TypeMeta is not required; the scheme already knows the correct GVK for VolumeSnapshotList.

You can simplify and avoid any APIVersion confusion by dropping the TypeMeta entirely:

-func (r *Reconciler) listVolumeSnapshotsOrderByCreate() ([]storagesnapshotv1.VolumeSnapshot, error) {
-    volumeSnapshots := storagesnapshotv1.VolumeSnapshotList{
-        TypeMeta: metav1.TypeMeta{
-            APIVersion: storagesnapshotv1.GroupName,
-            Kind:       "VolumeSnapshotList",
-        },
-    }
+func (r *Reconciler) listVolumeSnapshotsOrderByCreate() ([]storagesnapshotv1.VolumeSnapshot, error) {
+    volumeSnapshots := storagesnapshotv1.VolumeSnapshotList{}

The rest of the function (filtering by prefix and sorting by CreationTimestamp) looks good.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c92392 and 72fb2eb.

📒 Files selected for processing (2)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cnpg/cnpg.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: run-kms-tls-token-test
  • GitHub Check: run-operator-tests
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-admission-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-cli-tests
  • GitHub Check: golangci-lint
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: run-azure-vault-test
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-hac-test
🔇 Additional comments (2)
pkg/system/db_reconciler.go (2)

73-81: RecoveryStatus transition to Completed on ready cluster looks consistent

Marking DBStatus.RecoveryStatus.Status = DBRecoveryStatusCompleted once isClusterReady succeeds keeps recovery state in sync with the actual cluster readiness and avoids a permanently “Running/Pending” recovery flag. No issues here.


325-336: Support both legacy and new backup resource names during upgrades to prevent orphaned resources

The concern that renaming the backup resource suffix from "-backup" to "-scheduled-backup" could break upgrades is valid. Existing ScheduledBackup objects with the old name will not be cleaned up by cleanupDBBackup(), retention filtering will skip snapshots created by the legacy schedule, and a second schedule under the new name may be created, doubling backup operations.

This requires validating:

  • Whether the naming change from "-backup" to "-scheduled-backup" has actually been deployed
  • Whether reconcileScheduledBackup() creates a new resource if the old one exists
  • Whether snapshot retention filtering at the mentioned line ranges is vulnerable to orphaned legacy resources

If confirmed, implement backward compatibility by:

  1. Supporting both naming conventions in cleanupDBBackup(), listVolumeSnapshotsOrderByCreate(), and any other filtering logic
  2. Ensuring at least one release cycle handles both old and new names before fully migrating
  3. Adding migration logic or clear upgrade documentation if this is already in a released version

Copy link
Contributor

@alphaprinz alphaprinz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Had a few comments/questions.


// cnpg will perform the import using the PV of the new DB instance. We need to make sure the PV is large enough.
// If the used percentage is over the threshold (currently 33%), double the size of the PV.
usedSpaceThreshold := 33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume capacity usage is no longer needed for import?
Or did you decide it's not needed regardless of this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 4.20 to 4.21 upgrade, we don't have an import anymore (both are CNPG)

return err
}
if r.NooBaa.Status.DBStatus.RecoveryStatus != nil {
r.NooBaa.Status.DBStatus.RecoveryStatus.Status = nbv1.DBRecoveryStatusCompleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this come with a timestamp? If not, it can be confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RecoveryTime is set once when the recovery starts. After that, only the status is modified

}
}
} else {
// Recovery configuration found, set bootstrap configuration to recover from the snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because user explicitly deleted the cluster, right?
If so I would add it to the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I will add a comment in the next PR

@dannyzaken dannyzaken merged commit da00325 into noobaa:master Dec 2, 2025
14 of 17 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Jan 13, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants