Skip to content

Added a CLI command to take a single DB backup#1740

Merged
dannyzaken merged 1 commit intonoobaa:masterfrom
dannyzaken:danny-backup-cli
Dec 2, 2025
Merged

Added a CLI command to take a single DB backup#1740
dannyzaken merged 1 commit intonoobaa:masterfrom
dannyzaken:danny-backup-cli

Conversation

@dannyzaken
Copy link
Member

@dannyzaken dannyzaken commented Dec 2, 2025

Depends on #1737

Explain the changes

  1. Added a CLI command to take a single on-demand DB backup
  2. Accepting --name flag to pass a custom name. The default name is noobaa-db-pg-cluster-backup-[date and time]. e.g.: noobaa-db-pg-cluster-backup-20251201181523

Issues: Fixed #xxx / Gap #xxx

  1. https://issues.redhat.com/browse/RHSTOR-8126

Testing Instructions:

  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • New Features

    • CLI command to create CNPG backups as volume snapshots with progress and monitoring hints.
  • Improvements

    • Recovery-aware bootstrap flow to initialize clusters from snapshots when recovery is configured.
    • Scheduled backups now target standby and existing backups are cleaned up to avoid snapshot conflicts.
    • Clearer lifecycle tracking and status transitions for creation vs. recovery.
  • Bug Fixes

    • Backup objects now initialize their spec reliably so backups are created consistently.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

Walkthrough

Initializes CNPG Backup objects’ Spec; adds a CLI subcommand to create CNPG backups via VolumeSnapshots; and implements recovery-aware DB reconciler branching to bootstrap clusters from VolumeSnapshots, including snapshot listing and cleanup during recovery.

Changes

Cohort / File(s) Summary
CNPG Backup Object Initialization
pkg/cnpg/cnpg.go
GetCnpgBackupObj now sets ObjectMeta (Name, Namespace) and initializes Spec with an empty BackupSpec in the returned Backup object.
DB Reconciliation & Recovery Workflow
pkg/system/db_reconciler.go
Adds recovery-aware bootstrap branching when DBRecovery is present (reconcileClusterRecovery), wires bootstrap.Recovery from snapshots, updates creation status handling to include Recovering, changes in-progress gating to exclude Recovering, adds scheduled-backup Target = Standby, cleans up backups during recovery, lists/filters snapshots and records RecoveryStatus metadata.
CLI Database Backup Subcommand
pkg/system/system.go
Adds CmdDBBackup() and RunDBBackup to create CNPG Backup CRs targeting standby via VolumeSnapshot, validates system/config and snapshot class, generates backup names, creates the Backup CR, polls for the resulting VolumeSnapshot, and prints monitoring/kubectl hints.

Sequence Diagram(s)

sequenceDiagram
    participant Admin as Admin CLI
    participant System as system.RunDBBackup
    participant CNPG as CNPG Backup CR (controller)
    participant Reconciler as DB Reconciler
    participant VolumeSnap as VolumeSnapshot API
    participant NooBaa as NooBaa pods
    participant Status as Cluster Status

    Admin->>System: run db-backup
    System->>CNPG: create Backup CR (target: Standby, method: VolumeSnapshot)
    CNPG->>VolumeSnap: request/create VolumeSnapshot (async)
    VolumeSnap-->>CNPG: snapshot created / ready
    CNPG-->>System: Backup status updated with snapshot ref
    System->>Admin: poll/report snapshot readiness + kubectl tips

    Note over Reconciler,Status: Recovery-aware bootstrap flow
    Reconciler->>Status: detect DBRecovery present?
    alt Recovery configured
        Reconciler->>NooBaa: ensure pods stopped for restore
        NooBaa-->>Reconciler: pods stopped
        Reconciler->>VolumeSnap: list/filter snapshots (by cluster/age)
        Reconciler->>Reconciler: cleanup scheduled backups if conflicting
        Reconciler->>CNPG: set Bootstrap.Recovery (snapshot source)
        Reconciler->>Status: set RecoveryStatus (snapshot name/time, Pending)
    else Fresh init
        Reconciler->>CNPG: set Bootstrap.InitDB/import flow
        Reconciler->>Status: set Creating
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing extra attention:
    • pkg/system/db_reconciler.go — recovery vs init branching, state transitions for RecoveryStatus, gating logic around Creating/Recovering.
    • Snapshot listing/filtering and scheduled-backup cleanup to avoid accidental data loss or conflicting backups.
    • pkg/system/system.go — polling/timeouts and VolumeSnapshot readiness detection.
    • pkg/cnpg/cnpg.go — correctness of Backup CR initialization (metadata and empty Spec).

Possibly related PRs

Suggested reviewers

  • jackyalbo
  • alphaprinz

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Added a CLI command to take a single DB backup' accurately summarizes the main change: introducing a new CLI subcommand for on-demand database backups.
Description check ✅ Passed The description addresses the template sections with explanation of changes, issue reference, and testing checklist items, though testing instructions remain empty and doc/test checkboxes are unchecked.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)

182-198: Unreachable fallback code for snapshot name.

At line 187, the condition r.NooBaa.Spec.DBSpec.DBRecovery != nil is always true in this branch because we're inside the else block that starts at line 165, which is only entered when r.NooBaa.Spec.DBSpec.DBRecovery != nil (the inverse of line 154's condition). Therefore, lines 189-192 are unreachable.

Consider simplifying:

-		snapshotName := ""
-		if r.NooBaa.Spec.DBSpec.DBRecovery != nil {
-			snapshotName = r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName
-		} else if r.NooBaa.Status.DBStatus.RecoveryStatus != nil {
-			// fall back to previously recorded snapshot name, if any
-			snapshotName = r.NooBaa.Status.DBStatus.RecoveryStatus.SnapshotName
-		}
+		snapshotName := r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName
 		r.NooBaa.Status.DBStatus.RecoveryStatus = &nbv1.DBRecoveryStatus{
 			Status:       nbv1.DBRecoveryStatusRunning,
 			SnapshotName: snapshotName,
 			RecoveryTime: &metav1.Time{Time: time.Now()},
 		}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 078a893 and 53939e9.

📒 Files selected for processing (3)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (10 hunks)
  • pkg/system/system.go (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/cnpg/cnpg.go (2)
pkg/hac/hac.go (1)
  • Name (22-22)
pkg/options/options.go (1)
  • Namespace (57-57)
pkg/system/db_reconciler.go (1)
pkg/cnpg/cnpg.go (2)
  • GetCnpgScheduledBackupObj (598-611)
  • GetCnpgBackupObj (613-626)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: run-operator-tests
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-cli-tests
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-admission-test
  • GitHub Check: run-kms-tls-token-test
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: run-azure-vault-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-hac-test
🔇 Additional comments (8)
pkg/cnpg/cnpg.go (1)

613-626: LGTM!

The initialization of ObjectMeta with Name and Namespace and the empty BackupSpec{} aligns with how similar factory functions in this file are implemented (e.g., GetCnpgScheduledBackupObj, GetCnpgClusterObj). This ensures the backup object is properly initialized for downstream usage.

pkg/system/system.go (1)

124-134: LGTM!

The command definition follows the established patterns in this file and provides a clear interface for creating database backups with an optional custom name.

pkg/system/db_reconciler.go (6)

79-81: LGTM!

The recovery status is correctly marked as completed only when the cluster is ready and a recovery was in progress.


307-312: LGTM!

Good addition of the nil check for VolumeSnapshot before accessing its fields, preventing potential nil pointer dereference.


325-336: LGTM!

The cleanup function properly handles the case where the ScheduledBackup doesn't exist by checking with KubeCheckQuiet before deletion and ignoring NotFound errors.


338-386: LGTM!

The recovery flow is well-structured:

  1. Sets initial recovery status
  2. Ensures dependent pods are stopped before recovery
  3. Configures the bootstrap with VolumeSnapshot source
  4. Cleans up scheduled backups to prevent conflicts during recovery

426-448: LGTM!

The naming change to -scheduled-backup suffix and targeting standby replicas for backups are sensible choices that minimize impact on the primary database instance.


512-544: LGTM!

The function correctly:

  1. Lists VolumeSnapshots using an efficient label selector
  2. Filters to only include scheduled backup snapshots
  3. Sorts by creation timestamp for proper retention ordering

@dannyzaken dannyzaken changed the title Danny backup cli Added a CLI command to take a single DB backup Dec 2, 2025
@dannyzaken dannyzaken force-pushed the danny-backup-cli branch 2 times, most recently from e356914 to 70def4a Compare December 2, 2025 08:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/system/db_reconciler.go (1)

474-499: Past review comment still applies.

The concern about the log message at line 477 potentially overstating the number of snapshots to be deleted (when the recovery snapshot is among them) remains valid. The actual functionality is correct, but the log message could be misleading.

🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)

209-214: Minor: Error message could be more accurate.

The error message says "cluster creation is still in progress" but the condition also covers the Recovering status. Consider updating for clarity:

 			if currentDBClusterStatus == nbv1.DBClusterStatusCreating || currentDBClusterStatus == nbv1.DBClusterStatusRecovering {
-				r.cnpgLog("the cluster spec was changed but the cluster creation is still in progress, skipping update")
-				return fmt.Errorf("cluster creation is still in progress, skipping update")
+				r.cnpgLog("the cluster spec was changed but the cluster %s is still in progress, skipping update", currentDBClusterStatus)
+				return fmt.Errorf("cluster %s is still in progress, skipping update", currentDBClusterStatus)
 			}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e356914 and 70def4a.

📒 Files selected for processing (3)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (4 hunks)
  • pkg/system/system.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/system/system.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/system/db_reconciler.go (1)
pkg/apis/noobaa/v1alpha1/noobaa_types.go (2)
  • DBClusterStatusCreating (557-557)
  • DBClusterStatusRecovering (566-566)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: run-operator-tests
  • GitHub Check: run-admission-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: run-hac-test
🔇 Additional comments (3)
pkg/cnpg/cnpg.go (1)

613-626: LGTM!

The explicit initialization of Spec: cnpgv1.BackupSpec{} is consistent with the pattern used in GetCnpgScheduledBackupObj (line 608) and provides a clear, defensive initialization.

pkg/system/db_reconciler.go (2)

339-387: LGTM!

The recovery flow is well-structured:

  1. Sets recovery status to pending
  2. Ensures core/endpoint pods are stopped before proceeding
  3. Configures bootstrap recovery from the specified VolumeSnapshot
  4. Cleans up scheduled backups to prevent conflicts during recovery

The use of storagesnapshotv1.GroupName for the API group is correct.


446-448: Cold volume-snapshot backups on the primary will require stopping it.

CloudNativePG safely falls back to the primary instance when BackupTargetStandby is specified but no standby exists. However, setting Method = VolumeSnapshot with Online = false (cold backup) means that volume snapshots will stop the primary during backup operations, impacting database availability and writes—even for single-instance clusters.

Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pkg/system/system.go (1)

955-957: Avoid hardcoding CNPG backup API group in kubectl hint

The example command:

log.Printf("kubectl -n %s get backups.postgresql.cnpg.noobaa.io %s", sys.Namespace, backupName)

hardcodes backups.postgresql.cnpg.noobaa.io. This can diverge from the actual API group used at runtime (e.g., when options.UseCnpgApiGroup is set and the group is postgresql.cnpg.io instead), while the Backup object itself relies on the CNPG helper constants.

Consider deriving the group from the CNPG package so the hint always matches the real CRD:

-	log.Printf("kubectl -n %s get backups.postgresql.cnpg.noobaa.io %s", sys.Namespace, backupName)
+	log.Printf("kubectl -n %s get backups.%s %s", sys.Namespace, cnpg.CnpgAPIGroup, backupName)

Please confirm that cnpg.CnpgAPIGroup (or the equivalent helper in your codebase) already accounts for options.UseCnpgApiGroup.

🧹 Nitpick comments (2)
pkg/system/system.go (2)

124-134: CmdDBBackup factory is consistent with existing CLI patterns

The command definition (Use, Short, Run, Args, and --name flag) matches the style of other system subcommands and should integrate smoothly.

If you want to polish UX later, consider adding a Long description with example usage and the default backup-name pattern.


913-952: RunDBBackup core flow looks sound; minor readability nit

The handler correctly:

  • Verifies the NooBaa CR exists.
  • Ensures CNPG DBSpec and DBBackup.VolumeSnapshot (with non-empty class) are configured.
  • Builds a cnpgv1.Backup targeting the CNPG cluster with method=volumeSnapshot, target=standby, and online=false via Online *bool. (pkg.go.dev)
  • Uses KubeCreateFailExisting and fails fast on create errors.

Only minor nit: offlineBackup is set to false and then assigned to backup.Spec.Online. This is logically correct (it means “offline/cold backup”), but the variable name is a bit confusing.

You might consider renaming offlineBackup to something like onlineBackup or online := false to better reflect the field it’s bound to. Please double‑check against the CNPG BackupSpec.Online semantics in the version you vendor.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 70def4a and edef022.

📒 Files selected for processing (3)
  • pkg/cnpg/cnpg.go (1 hunks)
  • pkg/system/db_reconciler.go (4 hunks)
  • pkg/system/system.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/cnpg/cnpg.go
  • pkg/system/db_reconciler.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/system/system.go (4)
pkg/cli/cli.go (1)
  • Run (61-66)
pkg/util/util.go (5)
  • Logger (905-907)
  • KubeCheck (571-592)
  • KubeCreateFailExisting (390-392)
  • Context (910-912)
  • KubeClient (271-282)
pkg/options/options.go (2)
  • SystemName (50-50)
  • Namespace (57-57)
pkg/cnpg/cnpg.go (1)
  • GetCnpgBackupObj (613-626)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: run-cli-tests
  • GitHub Check: run-kms-dev-test
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-admission-test
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-operator-tests
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-kms-tls-token-test
  • GitHub Check: run-hac-test
🔇 Additional comments (2)
pkg/system/system.go (2)

52-57: db-backup subcommand is wired correctly into system

Adding CmdDBBackup() to Cmd()'s subcommands cleanly exposes noobaa system db-backup without altering other flows.


963-985: Use label-based lookup for VolumeSnapshots instead of name-based polling, and propagate poll context to the client call

Two issues with the current polling approach:

  1. Name-based lookup is fragile

The code currently queries for a single VolumeSnapshot by name matching backupName. However, according to CNPG documentation, VolumeSnapshots are identified by labels like cnpg.io/backupName and cnpg.io/cluster, not by a predictable naming scheme. Polling by exact name risks timing out even when the backup succeeds.

  1. Poll context isn't propagated to the client

The ctx parameter from the poll function callback is not used; the Get call still uses util.Context() instead, so HTTP requests aren't tied to the polling timeout.

Replace the name-based Get with a label-based List:

-	err := wait.PollUntilContextCancel(pollCtx, interval*time.Second, true, func(ctx context.Context) (bool, error) {
-		volumeSnapshot := &storagesnapshotv1.VolumeSnapshot{}
-		volumeSnapshotErr := util.KubeClient().Get(util.Context(),
-			client.ObjectKey{Namespace: sys.Namespace, Name: backupName},
-			volumeSnapshot)
-		if errors.IsNotFound(volumeSnapshotErr) {
-			log.Printf("⏳ Volume snapshot %s not found yet", backupName)
-			return false, nil
-		}
-		if volumeSnapshotErr != nil {
-			return false, volumeSnapshotErr
-		}
-		return true, nil
-	})
+	err := wait.PollUntilContextCancel(pollCtx, interval*time.Second, true, func(ctx context.Context) (bool, error) {
+		volumeSnapshots := &storagesnapshotv1.VolumeSnapshotList{}
+		selector, _ := labels.Parse(
+			fmt.Sprintf("cnpg.io/cluster=%s,cnpg.io/backupName=%s", sys.Name+pgClusterSuffix, backupName),
+		)
+		volumeSnapshotErr := util.KubeClient().List(ctx, volumeSnapshots, &client.ListOptions{
+			Namespace:     sys.Namespace,
+			LabelSelector: selector,
+		})
+		if volumeSnapshotErr != nil {
+			return false, volumeSnapshotErr
+		}
+		if len(volumeSnapshots.Items) == 0 {
+			log.Printf("⏳ Volume snapshots for backup %s not found yet", backupName)
+			return false, nil
+		}
+		return true, nil
+	})

Also update line 985's user-facing hint to use label-based selection:

log.Printf(
  "kubectl -n %s get volumesnapshots.snapshot.storage.k8s.io -l 'cnpg.io/cluster=%s,cnpg.io/backupName=%s'",
  sys.Namespace, sys.Name+pgClusterSuffix, backupName,
)

Confirm that pgClusterSuffix is defined in your codebase and that label names match the CNPG version you are vendoring.

Copy link
Contributor

@jackyalbo jackyalbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should we add a test to the cli tests?

@dannyzaken dannyzaken merged commit 1a2ab94 into noobaa:master Dec 2, 2025
13 of 17 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Jan 13, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants