Improve upgrade handling to sync pods restarts#1643
Conversation
WalkthroughA readiness probe was added to the NooBaa core StatefulSet manifest. The reconciliation logic was updated to handle upgrades by stopping core and endpoint pods before applying changes, using a new method for scaling down and counting running pods. Readiness checks were also added before client initialization. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Reconciler
participant CoreApp (StatefulSet)
participant Endpoint (Deployment)
participant K8s API
User->>Reconciler: Trigger reconciliation
Reconciler->>CoreApp (StatefulSet): Check desired vs actual image
alt Image changed (upgrade needed)
Reconciler->>CoreApp (StatefulSet): Scale replicas to 0
Reconciler->>Endpoint (Deployment): Scale replicas to 0
Reconciler->>K8s API: List core and endpoint pods
K8s API-->>Reconciler: Return running pods count
alt Pods still running
Reconciler-->>User: Return error (waiting for pods to terminate)
else No pods running
Reconciler->>CoreApp (StatefulSet): Proceed with upgrade
end
else No upgrade needed
Reconciler->>CoreApp (StatefulSet): Reconcile as usual
end
sequenceDiagram
participant Reconciler
participant CoreApp (StatefulSet)
participant User
Reconciler->>CoreApp (StatefulSet): Check ReadyReplicas vs Spec.Replicas
alt Not all pods ready
Reconciler-->>User: Return error (waiting for core pods to be ready)
else All pods ready
Reconciler->>Reconciler: InitNBClient()
end
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (6)
🚧 Files skipped from review as they are similar to previous changes (6)
⏰ Context from checks skipped due to timeout of 90000ms (6)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
pkg/bundle/deploy.go (1)
5102-5105: Consider adding probe timing configurations for production robustness.The readiness probe configuration is functional but minimal. For production deployments, consider adding timing configurations to handle varying startup times and network conditions.
readinessProbe: httpGet: path: /version port: 8080 + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 3This would provide:
initialDelaySeconds: Grace period before first probeperiodSeconds: Frequency of probe checkstimeoutSeconds: Timeout for each probe requestfailureThreshold: Number of failures before marking as not readypkg/system/reconciler.go (1)
710-717: Use appropriate logging method for non-CNPG operations.The method uses
r.cnpgLogErrorfor logging, but this function handles general NooBaa pod operations, not specifically CNPG cluster operations.- r.cnpgLogError("got error stopping noobaa-core pods. error: %v", err) + r.Logger.Errorf("got error stopping noobaa-core pods. error: %v", err)- r.cnpgLogError("got error stopping noobaa-endpoints pods. error: %v", err) + r.Logger.Errorf("got error stopping noobaa-endpoints pods. error: %v", err)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
deploy/internal/statefulset-core.yaml(1 hunks)pkg/bundle/deploy.go(2 hunks)pkg/system/db_reconciler.go(1 hunks)pkg/system/phase2_creating.go(1 hunks)pkg/system/phase3_connecting.go(1 hunks)pkg/system/reconciler.go(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
pkg/system/phase3_connecting.go (1)
Learnt from: Neon-White
PR: noobaa/noobaa-operator#1636
File: pkg/backingstore/reconciler.go:718-724
Timestamp: 2025-06-25T15:28:44.080Z
Learning: In pkg/backingstore/reconciler.go, the switch refactor for S3Compatible.SignatureVersion and IBMCos.SignatureVersion maintains the same behavior as the original if-else if logic - when SignatureVersion is empty or unspecified, conn.AuthMethod is left unset. There was no default to V4 in the original code.
pkg/bundle/deploy.go (1)
undefined
<retrieved_learning>
Learnt from: Neon-White
PR: #1636
File: pkg/backingstore/reconciler.go:718-724
Timestamp: 2025-06-25T15:28:44.080Z
Learning: In pkg/backingstore/reconciler.go, the switch refactor for S3Compatible.SignatureVersion and IBMCos.SignatureVersion maintains the same behavior as the original if-else if logic - when SignatureVersion is empty or unspecified, conn.AuthMethod is left unset. There was no default to V4 in the original code.
</retrieved_learning>
🧬 Code Graph Analysis (1)
pkg/system/reconciler.go (2)
pkg/util/util.go (1)
KubeList(639-660)pkg/options/options.go (1)
Namespace(60-60)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: run-cli-tests
- GitHub Check: run-kms-tls-token-test
- GitHub Check: golangci-lint
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-core-config-map-tests
- GitHub Check: run-admission-test
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-operator-tests
- GitHub Check: run-azure-vault-test
- GitHub Check: run-hac-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-kms-kmip-test
🔇 Additional comments (4)
pkg/bundle/deploy.go (1)
5047-5047: LGTM: SHA256 checksum updated correctly.The checksum update reflects the addition of the readiness probe to the StatefulSet manifest, which is necessary to maintain integrity validation.
pkg/system/phase2_creating.go (1)
179-192: Excellent upgrade handling implementation.The pre-upgrade logic properly detects image changes and ensures pods are terminated before proceeding with the upgrade. This aligns well with the PR objective of improving upgrade handling to sync pod restarts.
pkg/system/db_reconciler.go (1)
299-299: Improved condition for robust error handling.Changing from
> 0to!= 0is more explicit and handles the error case wherestopNoobaaPodsAndGetNumRunningPodsreturns -1, ensuring the system properly waits when pod counting fails.pkg/system/reconciler.go (1)
720-728: Label selectors are consistent with StatefulSet and Deployment specsThe selectors
noobaa-core: noobaaandnoobaa-s3: noobaaexactly match the labels defined indeploy/internal/statefulset-core.yamlanddeploy/internal/deployment-endpoint.yaml. No changes required.
ce216e5 to
11a21a2
Compare
915bd3e to
5ef0384
Compare
- added a readiness probe to the core container in the core statefulset. the core will be ready when `/version` route in the webserver returns success code. - in phase 2 of the system reconciliation - if the noobaa image was changed, stop core and endpoint pods - after the pods stop, reconcile the core sts to start the pod with the new image. the core pod will run the upgrade_manager without interruptions. - in phase 3, wait for the core readiness before proceeding Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
5ef0384 to
81c325e
Compare
Explain the changes
/versionroute in the webserver returns success code.Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit
New Features
/versionendpoint.Bug Fixes
Chores