Re-enable HAC for components other than DB by tangledbytes · Pull Request #1810 · noobaa/noobaa-operator

tangledbytes · 2026-02-19T10:38:39Z

Explain the changes

When HAC is disabled (the latest default behaviour), in that case if a node is lost suddenly due to any reason, none of the noobaa pods which have a RWO volume mounted to them (sts or a deployment + volume) stays in "Terminating" state. This is inline with the k8s behaviour. However this creates a problem that some of the NooBaa component do not need that safeguard (like noobaa-core or noobaa-endpoint).

Issues: Fixed #xxx / Gap #xxx

This PR makes the HAC enabled by default but does not touches the DB pods as we have noticed in the past that forceful deletion of the DB pod lead to corruption.
This PR reverts the grace period back to 0 as it is OK to immediately terminate the core (also, if grace period is > 0 then k8s still doesn't relocate the pod).

Testing Instructions:

Create a multi-node cluster (I used kind)
Install noobaa
Shutdown one of the node on which noobaa-core is running
Expect that the pod other than the DB pod which were running on the affected node is relocated to another node by k8s.

Doc added/updated
Tests added

Summary by CodeRabbit

Release Notes

Bug Fixes
- Added protection to prevent accidental deletion of database-related pods during cleanup operations, reducing risk of data loss.
Changes
- HAC controller now enabled by default unless explicitly disabled via configuration, improving feature accessibility.
- Pod termination grace period updated for improved system responsiveness.

coderabbitai · 2026-02-19T10:39:03Z

📝 Walkthrough

Walkthrough

HAC initialization now enables HAC unless the environment variable HAC_ENABLED is explicitly "false". Pod-deletion logic was simplified to always use a 0-second grace period and skips deleting pods whose names start with noobaa-db-pg-.

Changes

Cohort / File(s)	Summary
HAC Controller Initialization `pkg/controller/add_hac.go`	Changed HAC enablement check: previously enabled only when `HAC_ENABLED == "true"`; now enabled unless `HAC_ENABLED == "false"`. Appends `hac.Add` more permissively.
HAC Pod Deletion Logic `pkg/hac/hac.go`	Removed `os` import; switched to `strings`. Removed conditional grace-period logic and environment toggle; grace period fixed at 0s. Added exclusion to skip pods with names starting `noobaa-db-pg-`. Logs per-pod skip/success/error messages.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Manager
participant HAC_Controller as HAC
participant K8sAPI as Kubernetes_API
Manager->>HAC: register Add function (enabled unless HAC_ENABLED == "false")
HAC->>K8sAPI: list pods
loop per pod
HAC->>HAC: check pod name prefix
alt name starts with "noobaa-db-pg-"
HAC->>HAC: log skip
else
HAC->>K8sAPI: delete pod (gracePeriodSeconds=0)
K8sAPI-->>HAC: deletion success/error
HAC->>HAC: log result
end
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

alphaprinz
nimrod-becker

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: modifying HAC enablement logic and excluding database-related pods from deletion.
Description check	✅ Passed	The pull request description provides clear explanations of changes, specific testing instructions, and references issues, though documentation and test completion is not marked.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/hac/hac.go (1)
57-60: Early return err skips remaining pods when any single deletion fails.

If deleting the first pod in the list fails (e.g., transient API server hiccup), all subsequent pods are left on the failed node for the entire reconciliation round. While the reconciler will retry on the returned error, a more resilient pattern is to continue iterating, accumulate errors, and return a combined error afterward — ensuring a best-effort attempt to delete every eligible pod in one pass.
♻️ Proposed refactor to continue-on-error
+	var errs []error
 	for _, pod := range podList.Items {
 		if strings.HasPrefix(pod.Name, "noobaa-db-pg-") {
-			log.Warningln("HAC skipping noobaa-db-pg-* pod")
+			log.Warningf("HAC skipping DB pod %v/%v — force deletion of DB pods can cause data corruption", pod.Namespace, pod.Name)
 			continue
 		}
 		if err := pd.Client.Delete(context.Background(), &pod, &deleteOpts); err != nil {
 			log.Warningf("❌ pod %v/%v, node %v: deletion failed: %v", pod.Namespace, pod.Name, pd.NodeName, err)
-			return err
+			errs = append(errs, err)
+			continue
 		}
 		log.Infof("✅ pod %v/%v, node %v: deletion succeeded", pod.Namespace, pod.Name, pd.NodeName)
 	}
+
+	return errors.Errorf("%v", errs) // or use a multi-error combiner
Alternatively, use k8s.io/apimachinery/pkg/util/errors (utilerrors.NewAggregate) to combine errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/hac/hac.go` around lines 57 - 60, The current loop returns early on the
first failure from pd.Client.Delete which stops attempting deletion of remaining
pods; change the loop that calls pd.Client.Delete (referencing pod, deleteOpts,
pd.Client and pd.NodeName) to instead log each per-pod failure and append the
error to a slice (or use k8s.io/apimachinery/pkg/util/errors
utilerrors.NewAggregate) so the loop continues trying all pods, and after the
loop return either nil if no errors or the aggregated error; ensure logging
still includes pod.Namespace, pod.Name and pd.NodeName for each failure.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/add_hac.go`:
- Around line 11-14: The PR flips HAC_ENABLED semantics causing unset → enabled;
update documentation, release notes, and runtime logging to avoid surprise:
modify doc/high-availability-controller.md to describe the new opt-out behavior
and explicitly state that operators must set HAC_ENABLED=false to disable HAC;
add an upgrade/release note describing the behavioral change and migration steps
for clusters that relied on the previous default; and add a startup log line in
the manager initialization where AddToManagerFuncs is appended (referencing
AddToManagerFuncs and hac.Add) that logs the observed HAC_ENABLED value and
whether the HAC controller was enabled so operators can easily see the active
setting on boot.

In `@pkg/hac/hac.go`:
- Around line 52-56: The warning log and pod-filtering are fragile: replace the
prefix check strings.HasPrefix(pod.Name, "noobaa-db-pg-") with a label-based
check that reads the pod's labels (e.g., pod.Labels["noobaa-db"] == "postgres")
and skip pods matching that label, and update the log.Warningln call to include
the actual pod name (pod.Name) so the message identifies which pod was skipped;
change the filter logic and the log site where strings.HasPrefix and pod.Name
are used to reference pod.Labels and include pod.Name in the warning text.

---

Nitpick comments:
In `@pkg/hac/hac.go`:
- Around line 57-60: The current loop returns early on the first failure from
pd.Client.Delete which stops attempting deletion of remaining pods; change the
loop that calls pd.Client.Delete (referencing pod, deleteOpts, pd.Client and
pd.NodeName) to instead log each per-pod failure and append the error to a slice
(or use k8s.io/apimachinery/pkg/util/errors utilerrors.NewAggregate) so the loop
continues trying all pods, and after the loop return either nil if no errors or
the aggregated error; ensure logging still includes pod.Namespace, pod.Name and
pd.NodeName for each failure.

pkg/controller/add_hac.go

pkg/hac/hac.go

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

coderabbitai

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/hac/hac.go`:
- Around line 52-56: Replace the fragile prefix check that uses
strings.HasPrefix(pod.Name, "noobaa-db-pg-") with a label-based check on
pod.Labels (e.g., if v, ok := pod.Labels["noobaa-db"]; ok && v == "postgres" {
... }) and ensure you nil-check the Labels map before reading; also change the
log call from log.Infof to the appropriate warning-level logger (e.g.,
log.Warningf or log.Warningln) while keeping the pod.Name in the message so
skipping DB pods is clearly visible to operators.

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

pull-request-size bot added the size/S label Feb 19, 2026

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

pkg/controller/add_hac.go Show resolved Hide resolved

pkg/hac/hac.go Show resolved Hide resolved

re-enable HAC for components other than DB

fc6500e

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

tangledbytes force-pushed the utkarsh/fix/hac-db branch from f56f026 to fc6500e Compare February 19, 2026 11:11

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

tangledbytes requested review from dannyzaken, jackyalbo, naveenpaul1, nimrod-becker and romayalon February 19, 2026 11:16

nimrod-becker approved these changes Feb 19, 2026

View reviewed changes

nimrod-becker merged commit 5af50ed into noobaa:master Feb 19, 2026
15 checks passed

tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (noobaa#1810)

74baecd

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (noobaa#1810)

46c886b

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (noobaa#1810)

1d7db59

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (noobaa#1810)

7e3f576

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (noobaa#1810)

3dea25f

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (noobaa#1810)

e5f6a03

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)

nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (#1810) (#1811)

dca14b5

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (#1810) (#1813)

511b243

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (#1810) (#1816)

6e56444

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (#1810) (#1814)

33b6eb4

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (#1810) (#1815)

c8d8e3b

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026

re-enable HAC for components other than DB (#1810) (#1812)

f3fe7bb

(cherry picked from commit 5af50ed) Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable HAC for components other than DB#1810

Re-enable HAC for components other than DB#1810
nimrod-becker merged 1 commit intonoobaa:masterfrom
tangledbytes:utkarsh/fix/hac-db

tangledbytes commented Feb 19, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangledbytes commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Explain the changes

Issues: Fixed #xxx / Gap #xxx

Testing Instructions:

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tangledbytes commented Feb 19, 2026 •

edited

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading