Fix upgrade failure due to readiness probe#1731
Conversation
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
WalkthroughModifies operator deployment configuration by changing the pod update strategy to Recreate mode, updates the deployment bundle manifest with the new strategy configuration, and changes the health probe binding address to listen on all network interfaces instead of localhost only. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
pkg/operator/manager.go (1)
56-60: Consider removing leader election since Recreate strategy enforces single instance.The PR description states that the Recreate deployment strategy "replaces prior leader-election-before-server-start behavior." However, the leader election code is still present here. With the Recreate strategy, Kubernetes guarantees only one operator pod exists at any time, making leader election redundant and adding unnecessary startup delay during upgrades.
If you intend to rely on the Recreate strategy for single-instance enforcement, consider removing this leader election block:
- // Become the leader before proceeding - err := leader.Become(util.Context(), "noobaa-operator-lock") - if err != nil { - log.Fatalf("Failed to become leader: %s", err) - } - // Create a new Cmd to provide shared dependencies and start componentsAlternatively, if you want to keep leader election as a defense-in-depth measure, please clarify the rationale in the code comments.
deploy/operator.yaml (1)
67-74: Consider using /healthz for the liveness probe path.The liveness probe currently uses
/readyz, but per best practices for controller-runtime operators, the liveness probe should use/healthzwhile readiness and startup probes use/readyz. This separation helps avoid masking unhealthy-but-ready states.Based on learnings, for controller-runtime operators, use livenessProbe path /healthz and readiness/startup path /readyz.
Apply this diff to align with best practices:
livenessProbe: httpGet: - path: /readyz + path: /healthz port: readyz initialDelaySeconds: 15Note: You may also want to consider renaming the port from
readyzto something more generic likehealthorprobessince it now serves multiple probe types, though this is optional.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
deploy/operator.yaml(1 hunks)pkg/bundle/deploy.go(1 hunks)pkg/operator/manager.go(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-26T14:17:08.013Z
Learnt from: OdedViner
Repo: noobaa/noobaa-operator PR: 1684
File: pkg/bundle/deploy.go:6259-6266
Timestamp: 2025-10-26T14:17:08.013Z
Learning: For controller-runtime operators, use livenessProbe path /healthz and readiness/startup path /readyz (port :8081) to avoid masking unhealthy-but-ready states.
Applied to files:
pkg/operator/manager.go
📚 Learning: 2025-07-14T10:59:21.959Z
Learnt from: Neon-White
Repo: noobaa/noobaa-operator PR: 1587
File: pkg/system/reconciler.go:287-291
Timestamp: 2025-07-14T10:59:21.959Z
Learning: NooBaa operator design constraint: Multiple NooBaa instances are not supported in the same namespace, only across different namespaces. This means hard-coded resource names like ConfigMaps are acceptable within a namespace scope.
Applied to files:
deploy/operator.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: run-kms-kmip-test
- GitHub Check: auto-update
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-cli-tests
- GitHub Check: run-operator-tests
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-azure-vault-test
- GitHub Check: golangci-lint
- GitHub Check: run-hac-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-kms-tls-token-test
🔇 Additional comments (4)
deploy/operator.yaml (1)
6-8: Good change to ensure single operator instance during upgrades.The Recreate strategy ensures that the old operator pod is terminated before the new one starts, eliminating the possibility of concurrent operator instances. This is appropriate for a singleton operator design.
Note that with this strategy, correct health probe configuration becomes critical—if the new pod fails its readiness checks, the upgrade will fail with downtime since the old pod has already been terminated.
pkg/operator/manager.go (1)
48-52: The binding address change is correct and necessary — the original review comment's premise is inaccurate.The web search reveals that
:8081and0.0.0.0:8081are not functionally equivalent in controller-runtime. The:8081syntax causes controller-runtime to bind to the IPv6 wildcard[::]:8081, which on some systems fails to properly dual-stack IPv4 connections. Since Kubernetes kubelet probes target the Pod IP (IPv4 in most clusters), an IPv6-only or misconfigured dual-stack binding causes probe failures and operator unhealthiness.Explicitly binding to
0.0.0.0:8081ensures the health probe is reachable from the kubelet via the Pod's IPv4 address across all environments, making this a valid and necessary fix—not just explicit documentation as the review suggested.pkg/bundle/deploy.go (2)
6394-6394: Sha256 update looks consistent with the inlined operator Deployment manifestThe updated
Sha256_deploy_operator_yamlcorrectly reflects thatFile_deploy_operator_yamlhas changed (strategy block). Nothing else to flag here; just keep in mind that any future edits todeploy/operator.yamlmust be mirrored in this embedded copy and its hash to avoid mismatches in bundle validation paths.
6396-6457: Switching operator Deployment strategy toRecreateis reasonable for enforcing singleton behaviorChanging the operator Deployment to:
spec: replicas: 1 strategy: type: Recreateensures the old operator pod is fully terminated before a new one is brought up during upgrades, which matches the PR goal of preventing overlapping instances when readiness/leadership can’t be fully trusted. The trade‑off is a brief period with no operator pod during rollouts, which is typically acceptable for this kind of control-plane component.
Explain the changes
This PR attempts to fix upgrade failure due to readiness probe. This PR attempts 2 address to reasons:
Recreateto ensure only 1 instance is created by k8s.:8081which binded to local interface only hence the probes would fail. This PR binds it to all the interfaces hence fixing the probe.Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.