Skip to content

Fix upgrade failure due to readiness probe#1731

Merged
tangledbytes merged 1 commit intonoobaa:masterfrom
tangledbytes:utkarsh/fix/probe-upgrade-fail
Nov 26, 2025
Merged

Fix upgrade failure due to readiness probe#1731
tangledbytes merged 1 commit intonoobaa:masterfrom
tangledbytes:utkarsh/fix/probe-upgrade-fail

Conversation

@tangledbytes
Copy link
Member

@tangledbytes tangledbytes commented Nov 26, 2025

Explain the changes

This PR attempts to fix upgrade failure due to readiness probe. This PR attempts 2 address to reasons:

  1. NooBaa operator attempts to become a leader before starting the operator server this ensures that only 1 instance of operator can be running in the namespace. To fix this, the deployment strategy was changed to Recreate to ensure only 1 instance is created by k8s.
  2. The readiness probe port started with :8081 which binded to local interface only hence the probes would fail. This PR binds it to all the interfaces hence fixing the probe.

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

  1. Deploy 5.20.
  2. Deploy 5.21 and see it succeeds.
  • Doc added/updated
  • Tests added

Summary by CodeRabbit

Release Notes

  • Improvements
    • Updated deployment rollout strategy to recreate pods during updates for more predictable deployments.
    • Modified health probe binding to listen on all network interfaces, improving accessibility of operator health checks.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
@coderabbitai
Copy link

coderabbitai bot commented Nov 26, 2025

Walkthrough

Modifies operator deployment configuration by changing the pod update strategy to Recreate mode, updates the deployment bundle manifest with the new strategy configuration, and changes the health probe binding address to listen on all network interfaces instead of localhost only.

Changes

Cohort / File(s) Change Summary
Operator manifest and bundle updates
deploy/operator.yaml, pkg/bundle/deploy.go
Added strategy: {type: Recreate} to the Deployment spec in the operator manifest. Updated the corresponding SHA256 hash and inlined deployment YAML in the bundle to reflect the new strategy configuration.
Health probe configuration
pkg/operator/manager.go
Changed default health probe binding address from :8081 (localhost) to 0.0.0.0:8081 (all interfaces) when the HEALTH_PROBE_BIND_ADDRESS environment variable is not set.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

  • Verify the SHA256 hash in pkg/bundle/deploy.go matches the updated operator manifest
  • Review the operational implications of changing health probe binding from :8081 to 0.0.0.0:8081 (now exposed on all network interfaces)
  • Confirm the Recreate deployment strategy aligns with operational requirements

Suggested labels

size/L

Suggested reviewers

  • dannyzaken
  • liranmauda
  • jackyalbo

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly relates to the main change: fixing upgrade failure caused by readiness probe configuration issues.
Description check ✅ Passed The description follows the template structure with all required sections: explanation of changes, issues (though placeholder), and testing instructions. Main change explanations are clear.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
pkg/operator/manager.go (1)

56-60: Consider removing leader election since Recreate strategy enforces single instance.

The PR description states that the Recreate deployment strategy "replaces prior leader-election-before-server-start behavior." However, the leader election code is still present here. With the Recreate strategy, Kubernetes guarantees only one operator pod exists at any time, making leader election redundant and adding unnecessary startup delay during upgrades.

If you intend to rely on the Recreate strategy for single-instance enforcement, consider removing this leader election block:

-	// Become the leader before proceeding
-	err := leader.Become(util.Context(), "noobaa-operator-lock")
-	if err != nil {
-		log.Fatalf("Failed to become leader: %s", err)
-	}
-
 	// Create a new Cmd to provide shared dependencies and start components

Alternatively, if you want to keep leader election as a defense-in-depth measure, please clarify the rationale in the code comments.

deploy/operator.yaml (1)

67-74: Consider using /healthz for the liveness probe path.

The liveness probe currently uses /readyz, but per best practices for controller-runtime operators, the liveness probe should use /healthz while readiness and startup probes use /readyz. This separation helps avoid masking unhealthy-but-ready states.

Based on learnings, for controller-runtime operators, use livenessProbe path /healthz and readiness/startup path /readyz.

Apply this diff to align with best practices:

 livenessProbe:
   httpGet:
-    path: /readyz
+    path: /healthz
     port: readyz
   initialDelaySeconds: 15

Note: You may also want to consider renaming the port from readyz to something more generic like health or probes since it now serves multiple probe types, though this is optional.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef270ff and ec61475.

📒 Files selected for processing (3)
  • deploy/operator.yaml (1 hunks)
  • pkg/bundle/deploy.go (1 hunks)
  • pkg/operator/manager.go (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-26T14:17:08.013Z
Learnt from: OdedViner
Repo: noobaa/noobaa-operator PR: 1684
File: pkg/bundle/deploy.go:6259-6266
Timestamp: 2025-10-26T14:17:08.013Z
Learning: For controller-runtime operators, use livenessProbe path /healthz and readiness/startup path /readyz (port :8081) to avoid masking unhealthy-but-ready states.

Applied to files:

  • pkg/operator/manager.go
📚 Learning: 2025-07-14T10:59:21.959Z
Learnt from: Neon-White
Repo: noobaa/noobaa-operator PR: 1587
File: pkg/system/reconciler.go:287-291
Timestamp: 2025-07-14T10:59:21.959Z
Learning: NooBaa operator design constraint: Multiple NooBaa instances are not supported in the same namespace, only across different namespaces. This means hard-coded resource names like ConfigMaps are acceptable within a namespace scope.

Applied to files:

  • deploy/operator.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: run-kms-kmip-test
  • GitHub Check: auto-update
  • GitHub Check: run-kms-tls-sa-test
  • GitHub Check: run-cli-tests
  • GitHub Check: run-operator-tests
  • GitHub Check: run-kms-key-rotate-test
  • GitHub Check: cnpg-deployment-test
  • GitHub Check: run-azure-vault-test
  • GitHub Check: golangci-lint
  • GitHub Check: run-hac-test
  • GitHub Check: run-kms-dev-test
  • GitHub Check: run-kms-tls-token-test
🔇 Additional comments (4)
deploy/operator.yaml (1)

6-8: Good change to ensure single operator instance during upgrades.

The Recreate strategy ensures that the old operator pod is terminated before the new one starts, eliminating the possibility of concurrent operator instances. This is appropriate for a singleton operator design.

Note that with this strategy, correct health probe configuration becomes critical—if the new pod fails its readiness checks, the upgrade will fail with downtime since the old pod has already been terminated.

pkg/operator/manager.go (1)

48-52: The binding address change is correct and necessary — the original review comment's premise is inaccurate.

The web search reveals that :8081 and 0.0.0.0:8081 are not functionally equivalent in controller-runtime. The :8081 syntax causes controller-runtime to bind to the IPv6 wildcard [::]:8081, which on some systems fails to properly dual-stack IPv4 connections. Since Kubernetes kubelet probes target the Pod IP (IPv4 in most clusters), an IPv6-only or misconfigured dual-stack binding causes probe failures and operator unhealthiness.

Explicitly binding to 0.0.0.0:8081 ensures the health probe is reachable from the kubelet via the Pod's IPv4 address across all environments, making this a valid and necessary fix—not just explicit documentation as the review suggested.

pkg/bundle/deploy.go (2)

6394-6394: Sha256 update looks consistent with the inlined operator Deployment manifest

The updated Sha256_deploy_operator_yaml correctly reflects that File_deploy_operator_yaml has changed (strategy block). Nothing else to flag here; just keep in mind that any future edits to deploy/operator.yaml must be mirrored in this embedded copy and its hash to avoid mismatches in bundle validation paths.


6396-6457: Switching operator Deployment strategy to Recreate is reasonable for enforcing singleton behavior

Changing the operator Deployment to:

spec:
  replicas: 1
  strategy:
    type: Recreate

ensures the old operator pod is fully terminated before a new one is brought up during upgrades, which matches the PR goal of preventing overlapping instances when readiness/leadership can’t be fully trusted. The trade‑off is a brief period with no operator pod during rollouts, which is typically acceptable for this kind of control-plane component.

@tangledbytes tangledbytes merged commit 6de7e1a into noobaa:master Nov 26, 2025
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants