Re-enable HAC for components other than DB#1810
Conversation
📝 WalkthroughWalkthroughHAC initialization now enables HAC unless the environment variable HAC_ENABLED is explicitly "false". Pod-deletion logic was simplified to always use a 0-second grace period and skips deleting pods whose names start with Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/hac/hac.go (1)
57-60: Earlyreturn errskips remaining pods when any single deletion fails.If deleting the first pod in the list fails (e.g., transient API server hiccup), all subsequent pods are left on the failed node for the entire reconciliation round. While the reconciler will retry on the returned error, a more resilient pattern is to continue iterating, accumulate errors, and return a combined error afterward — ensuring a best-effort attempt to delete every eligible pod in one pass.
♻️ Proposed refactor to continue-on-error
+ var errs []error for _, pod := range podList.Items { if strings.HasPrefix(pod.Name, "noobaa-db-pg-") { - log.Warningln("HAC skipping noobaa-db-pg-* pod") + log.Warningf("HAC skipping DB pod %v/%v — force deletion of DB pods can cause data corruption", pod.Namespace, pod.Name) continue } if err := pd.Client.Delete(context.Background(), &pod, &deleteOpts); err != nil { log.Warningf("❌ pod %v/%v, node %v: deletion failed: %v", pod.Namespace, pod.Name, pd.NodeName, err) - return err + errs = append(errs, err) + continue } log.Infof("✅ pod %v/%v, node %v: deletion succeeded", pod.Namespace, pod.Name, pd.NodeName) } + + return errors.Errorf("%v", errs) // or use a multi-error combinerAlternatively, use
k8s.io/apimachinery/pkg/util/errors(utilerrors.NewAggregate) to combine errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/hac/hac.go` around lines 57 - 60, The current loop returns early on the first failure from pd.Client.Delete which stops attempting deletion of remaining pods; change the loop that calls pd.Client.Delete (referencing pod, deleteOpts, pd.Client and pd.NodeName) to instead log each per-pod failure and append the error to a slice (or use k8s.io/apimachinery/pkg/util/errors utilerrors.NewAggregate) so the loop continues trying all pods, and after the loop return either nil if no errors or the aggregated error; ensure logging still includes pod.Namespace, pod.Name and pd.NodeName for each failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/add_hac.go`:
- Around line 11-14: The PR flips HAC_ENABLED semantics causing unset → enabled;
update documentation, release notes, and runtime logging to avoid surprise:
modify doc/high-availability-controller.md to describe the new opt-out behavior
and explicitly state that operators must set HAC_ENABLED=false to disable HAC;
add an upgrade/release note describing the behavioral change and migration steps
for clusters that relied on the previous default; and add a startup log line in
the manager initialization where AddToManagerFuncs is appended (referencing
AddToManagerFuncs and hac.Add) that logs the observed HAC_ENABLED value and
whether the HAC controller was enabled so operators can easily see the active
setting on boot.
In `@pkg/hac/hac.go`:
- Around line 52-56: The warning log and pod-filtering are fragile: replace the
prefix check strings.HasPrefix(pod.Name, "noobaa-db-pg-") with a label-based
check that reads the pod's labels (e.g., pod.Labels["noobaa-db"] == "postgres")
and skip pods matching that label, and update the log.Warningln call to include
the actual pod name (pod.Name) so the message identifies which pod was skipped;
change the filter logic and the log site where strings.HasPrefix and pod.Name
are used to reference pod.Labels and include pod.Name in the warning text.
---
Nitpick comments:
In `@pkg/hac/hac.go`:
- Around line 57-60: The current loop returns early on the first failure from
pd.Client.Delete which stops attempting deletion of remaining pods; change the
loop that calls pd.Client.Delete (referencing pod, deleteOpts, pd.Client and
pd.NodeName) to instead log each per-pod failure and append the error to a slice
(or use k8s.io/apimachinery/pkg/util/errors utilerrors.NewAggregate) so the loop
continues trying all pods, and after the loop return either nil if no errors or
the aggregated error; ensure logging still includes pod.Namespace, pod.Name and
pd.NodeName for each failure.
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com>
f56f026 to
fc6500e
Compare
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/hac/hac.go`:
- Around line 52-56: Replace the fragile prefix check that uses
strings.HasPrefix(pod.Name, "noobaa-db-pg-") with a label-based check on
pod.Labels (e.g., if v, ok := pod.Labels["noobaa-db"]; ok && v == "postgres" {
... }) and ensure you nil-check the Labels map before reading; also change the
log call from log.Infof to the appropriate warning-level logger (e.g.,
log.Warningf or log.Warningln) while keeping the pod.Name in the message so
skipping DB pods is clearly visible to operators.
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)
Signed-off-by: Utkarsh Srivastava <srivastavautkarsh8097@gmail.com> (cherry picked from commit 5af50ed)
Explain the changes
When HAC is disabled (the latest default behaviour), in that case if a node is lost suddenly due to any reason, none of the noobaa pods which have a RWO volume mounted to them (sts or a deployment + volume) stays in "Terminating" state. This is inline with the k8s behaviour. However this creates a problem that some of the NooBaa component do not need that safeguard (like noobaa-core or noobaa-endpoint).
Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit
Release Notes
Bug Fixes
Changes