Skip to content

Re-enable HAC for components other than DB#1810

Merged
nimrod-becker merged 1 commit intonoobaa:masterfrom
tangledbytes:utkarsh/fix/hac-db
Feb 19, 2026
Merged

Re-enable HAC for components other than DB#1810
nimrod-becker merged 1 commit intonoobaa:masterfrom
tangledbytes:utkarsh/fix/hac-db

Conversation

@tangledbytes
Copy link
Member

@tangledbytes tangledbytes commented Feb 19, 2026

Explain the changes

When HAC is disabled (the latest default behaviour), in that case if a node is lost suddenly due to any reason, none of the noobaa pods which have a RWO volume mounted to them (sts or a deployment + volume) stays in "Terminating" state. This is inline with the k8s behaviour. However this creates a problem that some of the NooBaa component do not need that safeguard (like noobaa-core or noobaa-endpoint).

Issues: Fixed #xxx / Gap #xxx

  • This PR makes the HAC enabled by default but does not touches the DB pods as we have noticed in the past that forceful deletion of the DB pod lead to corruption.
  • This PR reverts the grace period back to 0 as it is OK to immediately terminate the core (also, if grace period is > 0 then k8s still doesn't relocate the pod).

Testing Instructions:

  1. Create a multi-node cluster (I used kind)
  2. Install noobaa
  3. Shutdown one of the node on which noobaa-core is running
  4. Expect that the pod other than the DB pod which were running on the affected node is relocated to another node by k8s.
  • Doc added/updated
  • Tests added

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Added protection to prevent accidental deletion of database-related pods during cleanup operations, reducing risk of data loss.
  • Changes

    • HAC controller now enabled by default unless explicitly disabled via configuration, improving feature accessibility.
    • Pod termination grace period updated for improved system responsiveness.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

HAC initialization now enables HAC unless the environment variable HAC_ENABLED is explicitly "false". Pod-deletion logic was simplified to always use a 0-second grace period and skips deleting pods whose names start with noobaa-db-pg-.

Changes

Cohort / File(s) Summary
HAC Controller Initialization
pkg/controller/add_hac.go
Changed HAC enablement check: previously enabled only when HAC_ENABLED == "true"; now enabled unless HAC_ENABLED == "false". Appends hac.Add more permissively.
HAC Pod Deletion Logic
pkg/hac/hac.go
Removed os import; switched to strings. Removed conditional grace-period logic and environment toggle; grace period fixed at 0s. Added exclusion to skip pods with names starting noobaa-db-pg-. Logs per-pod skip/success/error messages.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Manager
participant HAC_Controller as HAC
participant K8sAPI as Kubernetes_API
Manager->>HAC: register Add function (enabled unless HAC_ENABLED == "false")
HAC->>K8sAPI: list pods
loop per pod
HAC->>HAC: check pod name prefix
alt name starts with "noobaa-db-pg-"
HAC->>HAC: log skip
else
HAC->>K8sAPI: delete pod (gracePeriodSeconds=0)
K8sAPI-->>HAC: deletion success/error
HAC->>HAC: log result
end
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • alphaprinz
  • nimrod-becker
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: modifying HAC enablement logic and excluding database-related pods from deletion.
Description check ✅ Passed The pull request description provides clear explanations of changes, specific testing instructions, and references issues, though documentation and test completion is not marked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/hac/hac.go (1)

57-60: Early return err skips remaining pods when any single deletion fails.

If deleting the first pod in the list fails (e.g., transient API server hiccup), all subsequent pods are left on the failed node for the entire reconciliation round. While the reconciler will retry on the returned error, a more resilient pattern is to continue iterating, accumulate errors, and return a combined error afterward — ensuring a best-effort attempt to delete every eligible pod in one pass.

♻️ Proposed refactor to continue-on-error
+	var errs []error
 	for _, pod := range podList.Items {
 		if strings.HasPrefix(pod.Name, "noobaa-db-pg-") {
-			log.Warningln("HAC skipping noobaa-db-pg-* pod")
+			log.Warningf("HAC skipping DB pod %v/%v — force deletion of DB pods can cause data corruption", pod.Namespace, pod.Name)
 			continue
 		}
 		if err := pd.Client.Delete(context.Background(), &pod, &deleteOpts); err != nil {
 			log.Warningf("❌ pod %v/%v, node %v: deletion failed: %v", pod.Namespace, pod.Name, pd.NodeName, err)
-			return err
+			errs = append(errs, err)
+			continue
 		}
 		log.Infof("✅ pod %v/%v, node %v: deletion succeeded", pod.Namespace, pod.Name, pd.NodeName)
 	}
+
+	return errors.Errorf("%v", errs) // or use a multi-error combiner

Alternatively, use k8s.io/apimachinery/pkg/util/errors (utilerrors.NewAggregate) to combine errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/hac/hac.go` around lines 57 - 60, The current loop returns early on the
first failure from pd.Client.Delete which stops attempting deletion of remaining
pods; change the loop that calls pd.Client.Delete (referencing pod, deleteOpts,
pd.Client and pd.NodeName) to instead log each per-pod failure and append the
error to a slice (or use k8s.io/apimachinery/pkg/util/errors
utilerrors.NewAggregate) so the loop continues trying all pods, and after the
loop return either nil if no errors or the aggregated error; ensure logging
still includes pod.Namespace, pod.Name and pd.NodeName for each failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/controller/add_hac.go`:
- Around line 11-14: The PR flips HAC_ENABLED semantics causing unset → enabled;
update documentation, release notes, and runtime logging to avoid surprise:
modify doc/high-availability-controller.md to describe the new opt-out behavior
and explicitly state that operators must set HAC_ENABLED=false to disable HAC;
add an upgrade/release note describing the behavioral change and migration steps
for clusters that relied on the previous default; and add a startup log line in
the manager initialization where AddToManagerFuncs is appended (referencing
AddToManagerFuncs and hac.Add) that logs the observed HAC_ENABLED value and
whether the HAC controller was enabled so operators can easily see the active
setting on boot.

In `@pkg/hac/hac.go`:
- Around line 52-56: The warning log and pod-filtering are fragile: replace the
prefix check strings.HasPrefix(pod.Name, "noobaa-db-pg-") with a label-based
check that reads the pod's labels (e.g., pod.Labels["noobaa-db"] == "postgres")
and skip pods matching that label, and update the log.Warningln call to include
the actual pod name (pod.Name) so the message identifies which pod was skipped;
change the filter logic and the log site where strings.HasPrefix and pod.Name
are used to reference pod.Labels and include pod.Name in the warning text.

---

Nitpick comments:
In `@pkg/hac/hac.go`:
- Around line 57-60: The current loop returns early on the first failure from
pd.Client.Delete which stops attempting deletion of remaining pods; change the
loop that calls pd.Client.Delete (referencing pod, deleteOpts, pd.Client and
pd.NodeName) to instead log each per-pod failure and append the error to a slice
(or use k8s.io/apimachinery/pkg/util/errors utilerrors.NewAggregate) so the loop
continues trying all pods, and after the loop return either nil if no errors or
the aggregated error; ensure logging still includes pod.Namespace, pod.Name and
pd.NodeName for each failure.

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/hac/hac.go`:
- Around line 52-56: Replace the fragile prefix check that uses
strings.HasPrefix(pod.Name, "noobaa-db-pg-") with a label-based check on
pod.Labels (e.g., if v, ok := pod.Labels["noobaa-db"]; ok && v == "postgres" {
... }) and ensure you nil-check the Labels map before reading; also change the
log call from log.Infof to the appropriate warning-level logger (e.g.,
log.Warningf or log.Warningln) while keeping the pod.Name in the message so
skipping DB pods is clearly visible to operators.

@nimrod-becker nimrod-becker merged commit 5af50ed into noobaa:master Feb 19, 2026
15 checks passed
tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
(cherry picked from commit 5af50ed)
tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
(cherry picked from commit 5af50ed)
tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
(cherry picked from commit 5af50ed)
tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
(cherry picked from commit 5af50ed)
tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
(cherry picked from commit 5af50ed)
tangledbytes added a commit to tangledbytes/noobaa-operator that referenced this pull request Feb 19, 2026
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
(cherry picked from commit 5af50ed)
nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026
(cherry picked from commit 5af50ed)

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026
(cherry picked from commit 5af50ed)

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026
(cherry picked from commit 5af50ed)

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026
(cherry picked from commit 5af50ed)

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026
(cherry picked from commit 5af50ed)

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
nimrod-becker pushed a commit that referenced this pull request Feb 19, 2026
(cherry picked from commit 5af50ed)

Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants