Skip to content

Improve upgrade handling to sync pods restarts#1643

Merged
dannyzaken merged 1 commit intonoobaa:masterfrom
dannyzaken:danny-sync-upgrade
Jul 2, 2025
Merged

Improve upgrade handling to sync pods restarts#1643
dannyzaken merged 1 commit intonoobaa:masterfrom
dannyzaken:danny-sync-upgrade

Conversation

@dannyzaken
Copy link
Member

@dannyzaken dannyzaken commented Jul 1, 2025

Explain the changes

  • added a readiness probe to the core container in the core statefulset. the core will be ready when /version route in the webserver returns success code.
  • in phase 2 of the system reconciliation - if the noobaa image was changed, stop core and endpoint pods
  • after the pods stop, reconcile the core sts to start the pod with the new image. the core pod will run the upgrade_manager without interruptions.
  • in phase 3, wait for the core readiness before proceeding

Issues: Fixed #xxx / Gap #xxx

  1. operator side of https://issues.redhat.com/browse/DFBUGS-2457

Testing Instructions:

  1. update the core image
  2. inspect the core and endpoints pods.
  3. all pods should be terminated first. noobaa core should start first and only after the upgrade process runs, the core pod becomes ready and the endpoint pods start.
  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • New Features

    • Added a readiness check for the core application, ensuring pods are fully ready before proceeding with further operations.
    • Enhanced upgrade handling by stopping core and endpoint pods before applying updates, with clear status feedback if pods are still terminating.
    • Introduced a readiness probe on the core container to monitor its availability via the /version endpoint.
  • Bug Fixes

    • Improved consistency in pod readiness and upgrade workflows to prevent premature actions during upgrades.
  • Chores

    • Updated internal configuration and manifest checksums for improved reliability.

@coderabbitai
Copy link

coderabbitai bot commented Jul 1, 2025

Walkthrough

A readiness probe was added to the NooBaa core StatefulSet manifest. The reconciliation logic was updated to handle upgrades by stopping core and endpoint pods before applying changes, using a new method for scaling down and counting running pods. Readiness checks were also added before client initialization.

Changes

File(s) Change Summary
deploy/internal/statefulset-core.yaml Added a readinessProbe to the core container spec (HTTP GET /version on port 8080).
pkg/bundle/deploy.go Updated SHA256 checksum constant for the core StatefulSet manifest.
pkg/system/phase2_creating.go Added pre-upgrade logic: checks for image changes, stops core/endpoint pods before reconciling core app.
pkg/system/reconciler.go Added method to scale down core and endpoint pods and count running pods.
pkg/system/db_reconciler.go Removed method for stopping pods and counting running pods; updated condition for running pods check.
pkg/system/phase3_connecting.go Added readiness check: verifies all core StatefulSet pods are ready before initializing the client.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Reconciler
    participant CoreApp (StatefulSet)
    participant Endpoint (Deployment)
    participant K8s API

    User->>Reconciler: Trigger reconciliation
    Reconciler->>CoreApp (StatefulSet): Check desired vs actual image
    alt Image changed (upgrade needed)
        Reconciler->>CoreApp (StatefulSet): Scale replicas to 0
        Reconciler->>Endpoint (Deployment): Scale replicas to 0
        Reconciler->>K8s API: List core and endpoint pods
        K8s API-->>Reconciler: Return running pods count
        alt Pods still running
            Reconciler-->>User: Return error (waiting for pods to terminate)
        else No pods running
            Reconciler->>CoreApp (StatefulSet): Proceed with upgrade
        end
    else No upgrade needed
        Reconciler->>CoreApp (StatefulSet): Reconcile as usual
    end
Loading
sequenceDiagram
    participant Reconciler
    participant CoreApp (StatefulSet)
    participant User

    Reconciler->>CoreApp (StatefulSet): Check ReadyReplicas vs Spec.Replicas
    alt Not all pods ready
        Reconciler-->>User: Return error (waiting for core pods to be ready)
    else All pods ready
        Reconciler->>Reconciler: InitNBClient()
    end
Loading

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ef0384 and 81c325e.

📒 Files selected for processing (6)
  • deploy/internal/statefulset-core.yaml (1 hunks)
  • pkg/bundle/deploy.go (2 hunks)
  • pkg/system/db_reconciler.go (1 hunks)
  • pkg/system/phase2_creating.go (2 hunks)
  • pkg/system/phase3_connecting.go (1 hunks)
  • pkg/system/reconciler.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • deploy/internal/statefulset-core.yaml
  • pkg/system/phase3_connecting.go
  • pkg/system/db_reconciler.go
  • pkg/system/phase2_creating.go
  • pkg/system/reconciler.go
  • pkg/bundle/deploy.go
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: run-operator-tests
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-hac-test
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/bundle/deploy.go (1)

5102-5105: Consider adding probe timing configurations for production robustness.

The readiness probe configuration is functional but minimal. For production deployments, consider adding timing configurations to handle varying startup times and network conditions.

  readinessProbe:
    httpGet:
      path: /version
      port: 8080
+   initialDelaySeconds: 30
+   periodSeconds: 10
+   timeoutSeconds: 5
+   failureThreshold: 3

This would provide:

  • initialDelaySeconds: Grace period before first probe
  • periodSeconds: Frequency of probe checks
  • timeoutSeconds: Timeout for each probe request
  • failureThreshold: Number of failures before marking as not ready
pkg/system/reconciler.go (1)

710-717: Use appropriate logging method for non-CNPG operations.

The method uses r.cnpgLogError for logging, but this function handles general NooBaa pod operations, not specifically CNPG cluster operations.

-		r.cnpgLogError("got error stopping noobaa-core pods. error: %v", err)
+		r.Logger.Errorf("got error stopping noobaa-core pods. error: %v", err)
-		r.cnpgLogError("got error stopping noobaa-endpoints pods. error: %v", err)
+		r.Logger.Errorf("got error stopping noobaa-endpoints pods. error: %v", err)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 975f7ef and ce216e5.

📒 Files selected for processing (6)
  • deploy/internal/statefulset-core.yaml (1 hunks)
  • pkg/bundle/deploy.go (2 hunks)
  • pkg/system/db_reconciler.go (1 hunks)
  • pkg/system/phase2_creating.go (1 hunks)
  • pkg/system/phase3_connecting.go (1 hunks)
  • pkg/system/reconciler.go (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
pkg/system/phase3_connecting.go (1)
Learnt from: Neon-White
PR: noobaa/noobaa-operator#1636
File: pkg/backingstore/reconciler.go:718-724
Timestamp: 2025-06-25T15:28:44.080Z
Learning: In pkg/backingstore/reconciler.go, the switch refactor for S3Compatible.SignatureVersion and IBMCos.SignatureVersion maintains the same behavior as the original if-else if logic - when SignatureVersion is empty or unspecified, conn.AuthMethod is left unset. There was no default to V4 in the original code.
pkg/bundle/deploy.go (1)

undefined

<retrieved_learning>
Learnt from: Neon-White
PR: #1636
File: pkg/backingstore/reconciler.go:718-724
Timestamp: 2025-06-25T15:28:44.080Z
Learning: In pkg/backingstore/reconciler.go, the switch refactor for S3Compatible.SignatureVersion and IBMCos.SignatureVersion maintains the same behavior as the original if-else if logic - when SignatureVersion is empty or unspecified, conn.AuthMethod is left unset. There was no default to V4 in the original code.
</retrieved_learning>

🧬 Code Graph Analysis (1)
pkg/system/reconciler.go (2)
pkg/util/util.go (1)
  • KubeList (639-660)
pkg/options/options.go (1)
  • Namespace (60-60)
⏰ Context from checks skipped due to timeout of 90000ms (13)
  • GitHub Check: run-cli-tests
  • GitHub Check: run-kms-tls-token-test
  • GitHub Check: golangci-lint
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: run-core-config-map-tests
  • GitHub Check: run-admission-test
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-operator-tests
  • GitHub Check: run-azure-vault-test
  • GitHub Check: run-hac-test
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-kms-kmip-test
🔇 Additional comments (4)
pkg/bundle/deploy.go (1)

5047-5047: LGTM: SHA256 checksum updated correctly.

The checksum update reflects the addition of the readiness probe to the StatefulSet manifest, which is necessary to maintain integrity validation.

pkg/system/phase2_creating.go (1)

179-192: Excellent upgrade handling implementation.

The pre-upgrade logic properly detects image changes and ensures pods are terminated before proceeding with the upgrade. This aligns well with the PR objective of improving upgrade handling to sync pod restarts.

pkg/system/db_reconciler.go (1)

299-299: Improved condition for robust error handling.

Changing from > 0 to != 0 is more explicit and handles the error case where stopNoobaaPodsAndGetNumRunningPods returns -1, ensuring the system properly waits when pod counting fails.

pkg/system/reconciler.go (1)

720-728: Label selectors are consistent with StatefulSet and Deployment specs

The selectors noobaa-core: noobaa and noobaa-s3: noobaa exactly match the labels defined in deploy/internal/statefulset-core.yaml and deploy/internal/deployment-endpoint.yaml. No changes required.

@dannyzaken dannyzaken force-pushed the danny-sync-upgrade branch from ce216e5 to 11a21a2 Compare July 1, 2025 18:00
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jul 1, 2025
@dannyzaken dannyzaken force-pushed the danny-sync-upgrade branch 5 times, most recently from 915bd3e to 5ef0384 Compare July 2, 2025 07:49
- added a readiness probe to the core container in the core statefulset. the core will be ready when `/version` route in the webserver returns success code.
- in phase 2 of the system reconciliation - if the noobaa image was changed, stop core and endpoint pods
- after the pods stop, reconcile the core sts to start the pod with the new image. the core pod will run the upgrade_manager without interruptions.
- in phase 3, wait for the core readiness before proceeding

Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
@dannyzaken dannyzaken force-pushed the danny-sync-upgrade branch from 5ef0384 to 81c325e Compare July 2, 2025 07:53
@dannyzaken dannyzaken requested a review from jackyalbo July 2, 2025 09:01
@dannyzaken dannyzaken merged commit fc22ee8 into noobaa:master Jul 2, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants