Added a CLI command to take a single DB backup#1740
Conversation
WalkthroughInitializes CNPG Backup objects’ Spec; adds a CLI subcommand to create CNPG backups via VolumeSnapshots; and implements recovery-aware DB reconciler branching to bootstrap clusters from VolumeSnapshots, including snapshot listing and cleanup during recovery. Changes
Sequence Diagram(s)sequenceDiagram
participant Admin as Admin CLI
participant System as system.RunDBBackup
participant CNPG as CNPG Backup CR (controller)
participant Reconciler as DB Reconciler
participant VolumeSnap as VolumeSnapshot API
participant NooBaa as NooBaa pods
participant Status as Cluster Status
Admin->>System: run db-backup
System->>CNPG: create Backup CR (target: Standby, method: VolumeSnapshot)
CNPG->>VolumeSnap: request/create VolumeSnapshot (async)
VolumeSnap-->>CNPG: snapshot created / ready
CNPG-->>System: Backup status updated with snapshot ref
System->>Admin: poll/report snapshot readiness + kubectl tips
Note over Reconciler,Status: Recovery-aware bootstrap flow
Reconciler->>Status: detect DBRecovery present?
alt Recovery configured
Reconciler->>NooBaa: ensure pods stopped for restore
NooBaa-->>Reconciler: pods stopped
Reconciler->>VolumeSnap: list/filter snapshots (by cluster/age)
Reconciler->>Reconciler: cleanup scheduled backups if conflicting
Reconciler->>CNPG: set Bootstrap.Recovery (snapshot source)
Reconciler->>Status: set RecoveryStatus (snapshot name/time, Pending)
else Fresh init
Reconciler->>CNPG: set Bootstrap.InitDB/import flow
Reconciler->>Status: set Creating
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
baba9e2 to
53939e9
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)
182-198: Unreachable fallback code for snapshot name.At line 187, the condition
r.NooBaa.Spec.DBSpec.DBRecovery != nilis always true in this branch because we're inside theelseblock that starts at line 165, which is only entered whenr.NooBaa.Spec.DBSpec.DBRecovery != nil(the inverse of line 154's condition). Therefore, lines 189-192 are unreachable.Consider simplifying:
- snapshotName := "" - if r.NooBaa.Spec.DBSpec.DBRecovery != nil { - snapshotName = r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName - } else if r.NooBaa.Status.DBStatus.RecoveryStatus != nil { - // fall back to previously recorded snapshot name, if any - snapshotName = r.NooBaa.Status.DBStatus.RecoveryStatus.SnapshotName - } + snapshotName := r.NooBaa.Spec.DBSpec.DBRecovery.VolumeSnapshotName r.NooBaa.Status.DBStatus.RecoveryStatus = &nbv1.DBRecoveryStatus{ Status: nbv1.DBRecoveryStatusRunning, SnapshotName: snapshotName, RecoveryTime: &metav1.Time{Time: time.Now()}, }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(10 hunks)pkg/system/system.go(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
pkg/cnpg/cnpg.go (2)
pkg/hac/hac.go (1)
Name(22-22)pkg/options/options.go (1)
Namespace(57-57)
pkg/system/db_reconciler.go (1)
pkg/cnpg/cnpg.go (2)
GetCnpgScheduledBackupObj(598-611)GetCnpgBackupObj(613-626)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: run-operator-tests
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-cli-tests
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-admission-test
- GitHub Check: run-kms-tls-token-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-azure-vault-test
- GitHub Check: golangci-lint
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-hac-test
🔇 Additional comments (8)
pkg/cnpg/cnpg.go (1)
613-626: LGTM!The initialization of
ObjectMetawithNameandNamespaceand the emptyBackupSpec{}aligns with how similar factory functions in this file are implemented (e.g.,GetCnpgScheduledBackupObj,GetCnpgClusterObj). This ensures the backup object is properly initialized for downstream usage.pkg/system/system.go (1)
124-134: LGTM!The command definition follows the established patterns in this file and provides a clear interface for creating database backups with an optional custom name.
pkg/system/db_reconciler.go (6)
79-81: LGTM!The recovery status is correctly marked as completed only when the cluster is ready and a recovery was in progress.
307-312: LGTM!Good addition of the nil check for
VolumeSnapshotbefore accessing its fields, preventing potential nil pointer dereference.
325-336: LGTM!The cleanup function properly handles the case where the
ScheduledBackupdoesn't exist by checking withKubeCheckQuietbefore deletion and ignoringNotFounderrors.
338-386: LGTM!The recovery flow is well-structured:
- Sets initial recovery status
- Ensures dependent pods are stopped before recovery
- Configures the bootstrap with VolumeSnapshot source
- Cleans up scheduled backups to prevent conflicts during recovery
426-448: LGTM!The naming change to
-scheduled-backupsuffix and targeting standby replicas for backups are sensible choices that minimize impact on the primary database instance.
512-544: LGTM!The function correctly:
- Lists VolumeSnapshots using an efficient label selector
- Filters to only include scheduled backup snapshots
- Sorts by creation timestamp for proper retention ordering
e356914 to
70def4a
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/system/db_reconciler.go (1)
474-499: Past review comment still applies.The concern about the log message at line 477 potentially overstating the number of snapshots to be deleted (when the recovery snapshot is among them) remains valid. The actual functionality is correct, but the log message could be misleading.
🧹 Nitpick comments (1)
pkg/system/db_reconciler.go (1)
209-214: Minor: Error message could be more accurate.The error message says "cluster creation is still in progress" but the condition also covers the
Recoveringstatus. Consider updating for clarity:if currentDBClusterStatus == nbv1.DBClusterStatusCreating || currentDBClusterStatus == nbv1.DBClusterStatusRecovering { - r.cnpgLog("the cluster spec was changed but the cluster creation is still in progress, skipping update") - return fmt.Errorf("cluster creation is still in progress, skipping update") + r.cnpgLog("the cluster spec was changed but the cluster %s is still in progress, skipping update", currentDBClusterStatus) + return fmt.Errorf("cluster %s is still in progress, skipping update", currentDBClusterStatus) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(4 hunks)pkg/system/system.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/system/system.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/system/db_reconciler.go (1)
pkg/apis/noobaa/v1alpha1/noobaa_types.go (2)
DBClusterStatusCreating(557-557)DBClusterStatusRecovering(566-566)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: run-operator-tests
- GitHub Check: run-admission-test
- GitHub Check: golangci-lint
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-azure-vault-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-hac-test
🔇 Additional comments (3)
pkg/cnpg/cnpg.go (1)
613-626: LGTM!The explicit initialization of
Spec: cnpgv1.BackupSpec{}is consistent with the pattern used inGetCnpgScheduledBackupObj(line 608) and provides a clear, defensive initialization.pkg/system/db_reconciler.go (2)
339-387: LGTM!The recovery flow is well-structured:
- Sets recovery status to pending
- Ensures core/endpoint pods are stopped before proceeding
- Configures bootstrap recovery from the specified VolumeSnapshot
- Cleans up scheduled backups to prevent conflicts during recovery
The use of
storagesnapshotv1.GroupNamefor the API group is correct.
446-448: Cold volume-snapshot backups on the primary will require stopping it.CloudNativePG safely falls back to the primary instance when
BackupTargetStandbyis specified but no standby exists. However, settingMethod = VolumeSnapshotwithOnline = false(cold backup) means that volume snapshots will stop the primary during backup operations, impacting database availability and writes—even for single-instance clusters.
Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
70def4a to
edef022
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pkg/system/system.go (1)
955-957: Avoid hardcoding CNPG backup API group in kubectl hintThe example command:
log.Printf("kubectl -n %s get backups.postgresql.cnpg.noobaa.io %s", sys.Namespace, backupName)hardcodes
backups.postgresql.cnpg.noobaa.io. This can diverge from the actual API group used at runtime (e.g., whenoptions.UseCnpgApiGroupis set and the group ispostgresql.cnpg.ioinstead), while the Backup object itself relies on the CNPG helper constants.Consider deriving the group from the CNPG package so the hint always matches the real CRD:
- log.Printf("kubectl -n %s get backups.postgresql.cnpg.noobaa.io %s", sys.Namespace, backupName) + log.Printf("kubectl -n %s get backups.%s %s", sys.Namespace, cnpg.CnpgAPIGroup, backupName)Please confirm that
cnpg.CnpgAPIGroup(or the equivalent helper in your codebase) already accounts foroptions.UseCnpgApiGroup.
🧹 Nitpick comments (2)
pkg/system/system.go (2)
124-134: CmdDBBackup factory is consistent with existing CLI patternsThe command definition (
Use,Short,Run,Args, and--nameflag) matches the style of other system subcommands and should integrate smoothly.If you want to polish UX later, consider adding a
Longdescription with example usage and the default backup-name pattern.
913-952: RunDBBackup core flow looks sound; minor readability nitThe handler correctly:
- Verifies the NooBaa CR exists.
- Ensures CNPG
DBSpecandDBBackup.VolumeSnapshot(with non-empty class) are configured.- Builds a
cnpgv1.Backuptargeting the CNPG cluster withmethod=volumeSnapshot,target=standby, andonline=falseviaOnline *bool. (pkg.go.dev)- Uses
KubeCreateFailExistingand fails fast on create errors.Only minor nit:
offlineBackupis set tofalseand then assigned tobackup.Spec.Online. This is logically correct (it means “offline/cold backup”), but the variable name is a bit confusing.You might consider renaming
offlineBackupto something likeonlineBackuporonline := falseto better reflect the field it’s bound to. Please double‑check against the CNPGBackupSpec.Onlinesemantics in the version you vendor.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pkg/cnpg/cnpg.go(1 hunks)pkg/system/db_reconciler.go(4 hunks)pkg/system/system.go(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/cnpg/cnpg.go
- pkg/system/db_reconciler.go
🧰 Additional context used
🧬 Code graph analysis (1)
pkg/system/system.go (4)
pkg/cli/cli.go (1)
Run(61-66)pkg/util/util.go (5)
Logger(905-907)KubeCheck(571-592)KubeCreateFailExisting(390-392)Context(910-912)KubeClient(271-282)pkg/options/options.go (2)
SystemName(50-50)Namespace(57-57)pkg/cnpg/cnpg.go (1)
GetCnpgBackupObj(613-626)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: run-cli-tests
- GitHub Check: run-kms-dev-test
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-admission-test
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-operator-tests
- GitHub Check: run-azure-vault-test
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: golangci-lint
- GitHub Check: run-kms-tls-token-test
- GitHub Check: run-hac-test
🔇 Additional comments (2)
pkg/system/system.go (2)
52-57: db-backup subcommand is wired correctly intosystemAdding
CmdDBBackup()toCmd()'s subcommands cleanly exposesnoobaa system db-backupwithout altering other flows.
963-985: Use label-based lookup for VolumeSnapshots instead of name-based polling, and propagate poll context to the client callTwo issues with the current polling approach:
- Name-based lookup is fragile
The code currently queries for a single VolumeSnapshot by name matching
backupName. However, according to CNPG documentation, VolumeSnapshots are identified by labels likecnpg.io/backupNameandcnpg.io/cluster, not by a predictable naming scheme. Polling by exact name risks timing out even when the backup succeeds.
- Poll context isn't propagated to the client
The
ctxparameter from the poll function callback is not used; theGetcall still usesutil.Context()instead, so HTTP requests aren't tied to the polling timeout.Replace the name-based
Getwith a label-basedList:- err := wait.PollUntilContextCancel(pollCtx, interval*time.Second, true, func(ctx context.Context) (bool, error) { - volumeSnapshot := &storagesnapshotv1.VolumeSnapshot{} - volumeSnapshotErr := util.KubeClient().Get(util.Context(), - client.ObjectKey{Namespace: sys.Namespace, Name: backupName}, - volumeSnapshot) - if errors.IsNotFound(volumeSnapshotErr) { - log.Printf("⏳ Volume snapshot %s not found yet", backupName) - return false, nil - } - if volumeSnapshotErr != nil { - return false, volumeSnapshotErr - } - return true, nil - }) + err := wait.PollUntilContextCancel(pollCtx, interval*time.Second, true, func(ctx context.Context) (bool, error) { + volumeSnapshots := &storagesnapshotv1.VolumeSnapshotList{} + selector, _ := labels.Parse( + fmt.Sprintf("cnpg.io/cluster=%s,cnpg.io/backupName=%s", sys.Name+pgClusterSuffix, backupName), + ) + volumeSnapshotErr := util.KubeClient().List(ctx, volumeSnapshots, &client.ListOptions{ + Namespace: sys.Namespace, + LabelSelector: selector, + }) + if volumeSnapshotErr != nil { + return false, volumeSnapshotErr + } + if len(volumeSnapshots.Items) == 0 { + log.Printf("⏳ Volume snapshots for backup %s not found yet", backupName) + return false, nil + } + return true, nil + })Also update line 985's user-facing hint to use label-based selection:
log.Printf( "kubectl -n %s get volumesnapshots.snapshot.storage.k8s.io -l 'cnpg.io/cluster=%s,cnpg.io/backupName=%s'", sys.Namespace, sys.Name+pgClusterSuffix, backupName, )Confirm that
pgClusterSuffixis defined in your codebase and that label names match the CNPG version you are vendoring.
jackyalbo
left a comment
There was a problem hiding this comment.
LGTM, should we add a test to the cli tests?
Depends on #1737
Explain the changes
--nameflag to pass a custom name. The default name isnoobaa-db-pg-cluster-backup-[date and time]. e.g.:noobaa-db-pg-cluster-backup-20251201181523Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.