Fix upgrade failure due to readiness probe by tangledbytes · Pull Request #1731 · noobaa/noobaa-operator

tangledbytes · 2025-11-26T08:59:49Z

Explain the changes

This PR attempts to fix upgrade failure due to readiness probe. This PR attempts 2 address to reasons:

NooBaa operator attempts to become a leader before starting the operator server this ensures that only 1 instance of operator can be running in the namespace. To fix this, the deployment strategy was changed to Recreate to ensure only 1 instance is created by k8s.
The readiness probe port started with :8081 which binded to local interface only hence the probes would fail. This PR binds it to all the interfaces hence fixing the probe.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

Deploy 5.20.
Deploy 5.21 and see it succeeds.

Doc added/updated
Tests added

Summary by CodeRabbit

Release Notes

Improvements
- Updated deployment rollout strategy to recreate pods during updates for more predictable deployments.
- Modified health probe binding to listen on all network interfaces, improving accessibility of operator health checks.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

coderabbitai · 2025-11-26T09:00:00Z

Walkthrough

Modifies operator deployment configuration by changing the pod update strategy to Recreate mode, updates the deployment bundle manifest with the new strategy configuration, and changes the health probe binding address to listen on all network interfaces instead of localhost only.

Changes

Cohort / File(s)	Change Summary
Operator manifest and bundle updates `deploy/operator.yaml`, `pkg/bundle/deploy.go`	Added `strategy: {type: Recreate}` to the Deployment spec in the operator manifest. Updated the corresponding SHA256 hash and inlined deployment YAML in the bundle to reflect the new strategy configuration.
Health probe configuration `pkg/operator/manager.go`	Changed default health probe binding address from `:8081` (localhost) to `0.0.0.0:8081` (all interfaces) when the HEALTH_PROBE_BIND_ADDRESS environment variable is not set.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Verify the SHA256 hash in pkg/bundle/deploy.go matches the updated operator manifest
Review the operational implications of changing health probe binding from :8081 to 0.0.0.0:8081 (now exposed on all network interfaces)
Confirm the Recreate deployment strategy aligns with operational requirements

Suggested labels

size/L

Suggested reviewers

dannyzaken
liranmauda
jackyalbo

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly relates to the main change: fixing upgrade failure caused by readiness probe configuration issues.
Description check	✅ Passed	The description follows the template structure with all required sections: explanation of changes, issues (though placeholder), and testing instructions. Main change explanations are clear.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

pkg/operator/manager.go (1)
56-60: Consider removing leader election since Recreate strategy enforces single instance.

The PR description states that the Recreate deployment strategy "replaces prior leader-election-before-server-start behavior." However, the leader election code is still present here. With the Recreate strategy, Kubernetes guarantees only one operator pod exists at any time, making leader election redundant and adding unnecessary startup delay during upgrades.

If you intend to rely on the Recreate strategy for single-instance enforcement, consider removing this leader election block:
-	// Become the leader before proceeding
-	err := leader.Become(util.Context(), "noobaa-operator-lock")
-	if err != nil {
-		log.Fatalf("Failed to become leader: %s", err)
-	}
-
 	// Create a new Cmd to provide shared dependencies and start components
Alternatively, if you want to keep leader election as a defense-in-depth measure, please clarify the rationale in the code comments.
deploy/operator.yaml (1)
67-74: Consider using /healthz for the liveness probe path.

The liveness probe currently uses /readyz, but per best practices for controller-runtime operators, the liveness probe should use /healthz while readiness and startup probes use /readyz. This separation helps avoid masking unhealthy-but-ready states.

Based on learnings, for controller-runtime operators, use livenessProbe path /healthz and readiness/startup path /readyz.

Apply this diff to align with best practices:
 livenessProbe:
   httpGet:
-    path: /readyz
+    path: /healthz
     port: readyz
   initialDelaySeconds: 15
Note: You may also want to consider renaming the port from readyz to something more generic like health or probes since it now serves multiple probe types, though this is optional.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef270ff and ec61475.

📒 Files selected for processing (3)

deploy/operator.yaml (1 hunks)
pkg/bundle/deploy.go (1 hunks)
pkg/operator/manager.go (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-26T14:17:08.013Z

Learnt from: OdedViner
Repo: noobaa/noobaa-operator PR: 1684
File: pkg/bundle/deploy.go:6259-6266
Timestamp: 2025-10-26T14:17:08.013Z
Learning: For controller-runtime operators, use livenessProbe path /healthz and readiness/startup path /readyz (port :8081) to avoid masking unhealthy-but-ready states.

Applied to files:

pkg/operator/manager.go

📚 Learning: 2025-07-14T10:59:21.959Z

Learnt from: Neon-White
Repo: noobaa/noobaa-operator PR: 1587
File: pkg/system/reconciler.go:287-291
Timestamp: 2025-07-14T10:59:21.959Z
Learning: NooBaa operator design constraint: Multiple NooBaa instances are not supported in the same namespace, only across different namespaces. This means hard-coded resource names like ConfigMaps are acceptable within a namespace scope.

Applied to files:

deploy/operator.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: run-kms-kmip-test
GitHub Check: auto-update
GitHub Check: run-kms-tls-sa-test
GitHub Check: run-cli-tests
GitHub Check: run-operator-tests
GitHub Check: run-kms-key-rotate-test
GitHub Check: cnpg-deployment-test
GitHub Check: run-azure-vault-test
GitHub Check: golangci-lint
GitHub Check: run-hac-test
GitHub Check: run-kms-dev-test
GitHub Check: run-kms-tls-token-test

🔇 Additional comments (4)

deploy/operator.yaml (1)

6-8: Good change to ensure single operator instance during upgrades.

The Recreate strategy ensures that the old operator pod is terminated before the new one starts, eliminating the possibility of concurrent operator instances. This is appropriate for a singleton operator design.

Note that with this strategy, correct health probe configuration becomes critical—if the new pod fails its readiness checks, the upgrade will fail with downtime since the old pod has already been terminated.

pkg/operator/manager.go (1)

48-52: The binding address change is correct and necessary — the original review comment's premise is inaccurate.

The web search reveals that :8081 and 0.0.0.0:8081 are not functionally equivalent in controller-runtime. The :8081 syntax causes controller-runtime to bind to the IPv6 wildcard [::]:8081, which on some systems fails to properly dual-stack IPv4 connections. Since Kubernetes kubelet probes target the Pod IP (IPv4 in most clusters), an IPv6-only or misconfigured dual-stack binding causes probe failures and operator unhealthiness.

Explicitly binding to 0.0.0.0:8081 ensures the health probe is reachable from the kubelet via the Pod's IPv4 address across all environments, making this a valid and necessary fix—not just explicit documentation as the review suggested.
pkg/bundle/deploy.go (2)
6394-6394: Sha256 update looks consistent with the inlined operator Deployment manifest

The updated Sha256_deploy_operator_yaml correctly reflects that File_deploy_operator_yaml has changed (strategy block). Nothing else to flag here; just keep in mind that any future edits to deploy/operator.yaml must be mirrored in this embedded copy and its hash to avoid mismatches in bundle validation paths.

6396-6457: Switching operator Deployment strategy to Recreate is reasonable for enforcing singleton behavior

Changing the operator Deployment to:
spec:
  replicas: 1
  strategy:
    type: Recreate
ensures the old operator pod is fully terminated before a new one is brought up during upgrades, which matches the PR goal of preventing overlapping instances when readiness/leadership can’t be fully trusted. The trade‑off is a brief period with no operator pod during rollouts, which is typically acceptable for this kind of control-plane component.

Fix upgrade failure due to readiness probe

ec61475

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

pull-request-size bot added the size/XS label Nov 26, 2025

tangledbytes requested a review from dannyzaken November 26, 2025 09:00

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

dannyzaken approved these changes Nov 26, 2025

View reviewed changes

tangledbytes merged commit 6de7e1a into noobaa:master Nov 26, 2025
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix upgrade failure due to readiness probe#1731

Fix upgrade failure due to readiness probe#1731
tangledbytes merged 1 commit intonoobaa:masterfrom
tangledbytes:utkarsh/fix/probe-upgrade-fail

tangledbytes commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangledbytes commented Nov 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explain the changes

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tangledbytes commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading