Conversation
Signed-off-by: Aayush <aayush@li-2392f9cc-287a-11b2-a85c-9fcc04b05da6.ibm.com> (cherry picked from commit c9ed6a2)
Signed-off-by: Aayush Chouhan <achouhan@redhat.com> (cherry picked from commit ab5864b)
Change the kubecheck of standaloneDBPod to be quiet Signed-off-by: liranmauda <liran.mauda@gmail.com> (cherry picked from commit 589e89c)
- changed the alert rules to take into account the WAL size in addition to the database size. - refactored the alerts to be more generic instead of alert per DB pod. - changed the description and messages of the alerts. Signed-off-by: Danny Zaken <dannyzaken@gmail.com> (cherry picked from commit 9d98e31)
WalkthroughConsolidates Prometheus DB capacity alerts to pod-aggregated rules and updates the embedded bundle/hash. Automates the core-tag update workflow with scheduled runs, auto-generated tags, branch/PR creation, and cleanup. Adjusts a readiness check in the DB reconciler to use a quiet kube check. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Scheduler as GitHub Scheduler
participant WF as Workflow: update-noobaa-core-tag
participant Gen as Generate Tag
participant Git as Git Ops
participant PR as PR Creation
participant Gate as Check for Changes
participant Cleanup as Cleanup (on failure)
Scheduler->>WF: Trigger (monthly)
WF->>Gen: Generate master-YYYYMMDD (yesterday)
Gen-->>WF: container_image_tag output
WF->>Gate: Detect diffs (pkg/options/options.go)
alt No changes
Gate-->>WF: No diffs
WF-->>Scheduler: Cancel run
else Changes detected
WF->>Git: Create/force branch update-core-tag-<tag>\ncommit with -s
Git-->>WF: Pushed branch
WF->>PR: Create PR (base: master, head: update-core-tag-<tag>)
PR-->>WF: PR URL
note over WF,PR: Downstream wait-for-PR-checks/merge (unchanged)
end
opt Failure at any step
WF->>Cleanup: Delete branch update-core-tag-<tag> if exists
Cleanup-->>WF: Branch removed
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/update-noobaa-core-tag.yml (1)
49-57: Exit early after cancel to avoid running subsequent steps when no changesAfter calling gh run cancel in the “Check for changes” step, this job may still continue executing until GitHub cancels it, causing follow-up steps (commit/push/PR) to run and sometimes fail. Explicitly exit 0 right after cancel.
Suggested update to the “Check for changes” step (outside the selected range):
- name: Check for changes run: | if git diff --quiet pkg/options/options.go; then echo "Noobaa Core image tag is already up to date" echo "No changes needed - canceling workflow" gh run cancel ${{ github.run_id }} exit 0 fi
🧹 Nitpick comments (4)
.github/workflows/update-noobaa-core-tag.yml (2)
18-18: Prefer GH_TOKEN for gh CLI to avoid shadowing Actions’ GITHUB_TOKENUsing GITHUB_TOKEN for a custom PAT can be confusing and may interfere with other actions’ expectations. gh respects GH_TOKEN as well. Recommend switching to GH_TOKEN for clarity.
Apply this diff:
- GITHUB_TOKEN: ${{ secrets.GHACTION_TOKEN_PR_PERMS }} + GH_TOKEN: ${{ secrets.GHACTION_TOKEN_PR_PERMS }}
26-32: Make tag generation timezone-stable (UTC) to avoid edge casesUsing the runner’s default timezone can produce different tags around midnight. Force UTC so “yesterday” is deterministic.
- tag="master-$(date -d 'yesterday' +%Y%m%d)" # using yesterday's date to ensure the image is available + tag="master-$(date -u -d 'yesterday' +%Y%m%d)" # use UTC to avoid TZ edge casesdeploy/internal/prometheus-rules.yaml (1)
265-265: Trailing whitespace violates yamllintThere are trailing spaces after the value.
- severity: warning + severity: warningpkg/bundle/deploy.go (1)
4806-4806: Nit: trailing spaces in YAML value.Line has extra trailing spaces after “warning”. Clean it to avoid noisy diffs/lint.
- severity: warning + severity: warning
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
.github/workflows/update-noobaa-core-tag.yml(4 hunks)deploy/internal/prometheus-rules.yaml(1 hunks)pkg/bundle/deploy.go(2 hunks)pkg/system/db_reconciler.go(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pkg/system/db_reconciler.go (1)
pkg/util/util.go (1)
KubeCheckQuiet(617-630)
🪛 YAMLlint (1.37.1)
deploy/internal/prometheus-rules.yaml
[error] 265-265: trailing spaces
(trailing-spaces)
🔇 Additional comments (4)
pkg/system/db_reconciler.go (1)
80-91: Switching to KubeCheckQuiet reduces log noise; verify panic-on-transient-error trade-offutil.KubeCheckQuiet returns false on common "not found" cases (good) but Panics on other errors (e.g., API timeouts). That can crash reconciliation. If the intent is simply "scale down only if the pod exists", this is fine, but please confirm we're comfortable panicking on transient client errors here.
If we want to avoid a crash on transient errors while keeping logs quiet, consider a helper that returns (bool, error) and treats transient errors as a soft false instead of Panic.
.github/workflows/update-noobaa-core-tag.yml (1)
79-79: Confirm permissions for --admin merge and workflow cancellationgh pr merge --admin and gh run cancel require appropriate scopes/permissions on the token in use. With the switch to a custom secret token, ensure it has:
- repo (to create branches/PRs and merge)
- workflow (to cancel runs)
If you keep using GITHUB_TOKEN (the default Actions token) for git pushes, ensure repository settings allow it to merge PRs per branch protection rules (or keep using the PAT everywhere consistently).
pkg/bundle/deploy.go (2)
4797-4799: Double-check CNPG metric names and label cards.Confirm cnpg_collector_pg_wal{value="size"} and cnpg_pg_database_size_bytes{datname="nbcore"} exist in your CNPG version and are labeled by pod. Some CNPG exporters expose WAL metrics under different time series or labels.
Run these example checks in your Prometheus to validate series and labels:
- cnpg_collector_pg_wal{value="size"}
- sum by(pod)(cnpg_pg_database_size_bytes)
- kube_pod_spec_volumes_persistentvolumeclaims_info
4540-4540: Ensure the embedded SHA matches the updated rules content.The Sha256_deploy_internal_prometheus_rules_yaml changed. Make sure the computed SHA matches the final YAML to avoid bundle integrity mismatches at runtime.
If you generate these files via a script, please re-run the generator and commit both the content and the new digest in the same change.
| sum by (pod) ( | ||
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | ||
| ))) * 100 > 80 | ||
| for: 5m |
There was a problem hiding this comment.
Fix label_replace: PVC name ≠ pod name; division won’t match series and alerts won’t fire
label_replace is currently copying the entire PVC name into the pod label. StatefulSet PVCs are typically "-", so this won’t match the numerator’s pod label and the division will produce no results.
Strip the prefix before the first dash to recover the pod name:
- label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)"
+ label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "^[^-]+-(.*)"
))) * 100 > 80🤖 Prompt for AI Agents
In deploy/internal/prometheus-rules.yaml around lines 260 to 263, the
label_replace is copying the full PVC name into the pod label which breaks the
division because PVC names are "<volumeClaimTemplate>-<podname>"; update the
label_replace regex to strip the prefix up to the first dash and capture only
the pod name (e.g. use a pattern that captures the part after the first '-' and
use that capture as the replacement), keeping the target label "pod" and other
arguments unchanged so the numerator and denominator pod labels match.
| sum by (pod) ( | ||
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | ||
| ))) * 100 > 90 | ||
| for: 1m |
There was a problem hiding this comment.
Apply the same label_replace fix to the critical alert
Mirror the regex correction so the denominator aligns with the numerator by pod:
- label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)"
+ label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "^[^-]+-(.*)"
))) * 100 > 90📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sum by (pod) ( | |
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | |
| ))) * 100 > 90 | |
| for: 1m | |
| sum by (pod) ( | |
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "^[^-]+-(.*)" | |
| ))) * 100 > 90 | |
| for: 1m |
🤖 Prompt for AI Agents
In deploy/internal/prometheus-rules.yaml around lines 276 to 279, the critical
alert's denominator is not grouped by pod and needs the same label_replace
transformation as the warning alert; update the critical alert's expression to
wrap
kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}
with label_replace(..., "pod", "$1", "persistentvolumeclaim", "(.*)") so the
generated "pod" label matches the numerator, ensuring both numerator and
denominator are summed "by (pod)" and the percentage calculation is correct.
| expr: | | ||
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | ||
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | ||
| / | ||
| sum by (pod) ( | ||
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | ||
| ))) * 100 > 80 | ||
| for: 5m |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Alert math will silently break: denominator is per-PVC but is being (mis)labeled as per-pod.
label_replace sets pod="$1" from the PVC name, which won’t match the CNPG metrics’ pod label (e.g., pod=noobaa-db-pg-0 vs PVC=db-noobaa-db-pg-0). The vector division will have no label match, yielding no data and the alert never firing.
Fix by joining PVC requests to pods with kube-state-metrics’ PVC→Pod mapping metric instead of inventing a pod label:
- Remove the hard-coded namespace.
- Join on (namespace,persistentvolumeclaim) and group_left(pod).
- Sum by (pod) after the join so numerator/denominator labels are aligned.
Apply this diff to the first rule’s expr:
- ((sum by (pod) (cnpg_collector_pg_wal{value="size"})
- + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}))
- /
- sum by (pod) (
- label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)"
- ))) * 100 > 80
+ (
+ sum by (pod) (cnpg_collector_pg_wal{value="size"})
+ + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})
+ )
+ /
+ sum by (pod) (
+ kube_persistentvolumeclaim_resource_requests_storage_bytes
+ * on (namespace, persistentvolumeclaim)
+ group_left(pod)
+ kube_pod_spec_volumes_persistentvolumeclaims_info
+ )
+ * 100 > 80Note: verify kube_pod_spec_volumes_persistentvolumeclaims_info exists in your kube-state-metrics. If your metric name differs, adapt accordingly.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expr: | | |
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | |
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | |
| / | |
| sum by (pod) ( | |
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | |
| ))) * 100 > 80 | |
| for: 5m | |
| expr: | | |
| ( | |
| sum by (pod) (cnpg_collector_pg_wal{value="size"}) | |
| sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}) | |
| ) | |
| / | |
| sum by (pod) ( | |
| kube_persistentvolumeclaim_resource_requests_storage_bytes | |
| * on (namespace, persistentvolumeclaim) | |
| group_left(pod) | |
| kube_pod_spec_volumes_persistentvolumeclaims_info | |
| ) | |
| * 100 > 80 | |
| for: 5m |
🤖 Prompt for AI Agents
In pkg/bundle/deploy.go around lines 4797 to 4804 the alert expression mislabels
PVC requests as pod-level by using label_replace, causing no label match with
CNPG pod metrics and the alert to never fire; replace the label_replace approach
by joining kube_pod_spec_volumes_persistentvolumeclaims_info (or your
kube-state-metrics equivalent) to map (namespace,persistentvolumeclaim) → pod
using a vector join with group_left(pod) (do not hard-code the namespace), then
sum the joined denominator by(pod) so the numerator and denominator share the
same pod label before performing the division and percentage comparison, and
verify/adjust the kube-state-metrics metric name if necessary.
| expr: | | ||
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | ||
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | ||
| / | ||
| sum by (pod) ( | ||
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | ||
| ))) * 100 > 80 |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Remove hard-coded namespace; alerts won’t work outside “openshift-storage”.
Hard-coding namespace="openshift-storage" in the denominator limits the alert to that namespace and breaks clusters that install NooBaa elsewhere (most do). The join proposed in the other comment already aligns on namespace, so the filter is unnecessary.
Would you like me to update the rule to remove that namespace filter across both alerts and keep the namespace alignment via the join?
🏁 Script executed:
#!/bin/bash
set -e
echo "Context around lines 4780-4820 in pkg/bundle/deploy.go:"
sed -n '4780,4820p' pkg/bundle/deploy.go || true
echo -e "\nSearching for PromQL join usage in pkg/bundle/deploy.go:"
rg -nP --type=go 'join\(' -C3 pkg/bundle/deploy.go || echo "No 'join(' found"Length of output: 2347
Remove hard-coded namespace filter from NooBaa PVC alerts
Hard-coding namespace="openshift-storage" in the denominator restricts these alerts to that namespace and will break clusters where NooBaa PVCs live elsewhere. Since both numerator and denominator are aggregated by pod, they already align on the pod label—no explicit namespace filter (or join) is needed.
Please update both rules in pkg/bundle/deploy.go:
• Lines 4797–4803 (NooBaaDatabaseReachingCapacity)
• Lines 4809–4815 (NooBaaDatabaseStorageFull)
Suggested diff:
- sum by (pod) (
- label_replace(
- kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"},
- "pod", "$1", "persistentvolumeclaim", "(.*)"
- )
- )
+ sum by (pod) (
+ label_replace(
+ kube_persistentvolumeclaim_resource_requests_storage_bytes,
+ "pod", "$1", "persistentvolumeclaim", "(.*)"
+ )
+ )| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | ||
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | ||
| / | ||
| sum by (pod) ( | ||
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | ||
| ))) * 100 > 90 | ||
| for: 1m |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Same denominator-label mismatch affects the critical alert.
The storage-full rule repeats the same label_replace issue, so it will also never match the numerator.
Apply this diff to the second rule’s expr:
- ((sum by (pod) (cnpg_collector_pg_wal{value="size"})
- + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}))
- /
- sum by (pod) (
- label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)"
- ))) * 100 > 90
+ (
+ sum by (pod) (cnpg_collector_pg_wal{value="size"})
+ + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})
+ )
+ /
+ sum by (pod) (
+ kube_persistentvolumeclaim_resource_requests_storage_bytes
+ * on (namespace, persistentvolumeclaim)
+ group_left(pod)
+ kube_pod_spec_volumes_persistentvolumeclaims_info
+ )
+ * 100 > 90📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | |
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | |
| / | |
| sum by (pod) ( | |
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | |
| ))) * 100 > 90 | |
| for: 1m | |
| ( | |
| sum by (pod) (cnpg_collector_pg_wal{value="size"}) | |
| sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}) | |
| ) | |
| / | |
| sum by (pod) ( | |
| kube_persistentvolumeclaim_resource_requests_storage_bytes | |
| * on (namespace, persistentvolumeclaim) | |
| group_left(pod) | |
| kube_pod_spec_volumes_persistentvolumeclaims_info | |
| ) | |
| * 100 > 90 | |
| for: 1m |
🤖 Prompt for AI Agents
In pkg/bundle/deploy.go around lines 4814 to 4820, the second (critical)
storage-full alert uses the same label_replace mismatch as the warning rule so
its numerator and denominator labels never align; update the critical rule's
expr to apply the same label_replace on
kube_persistentvolumeclaim_resource_requests_storage_bytes (adding
label_replace(..., "pod", "$1", "persistentvolumeclaim", "(.*)")) so the
denominator uses the pod label matching the numerator, ensuring the division can
evaluate and the alert can fire.
Explain the changes
Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit
New Features
Chores