Conversation
WalkthroughReplaced four cluster-specific NooBaa DB alerts with two generalized, pod-scoped alerts; updated deploy/internal/prometheus-rules.yaml and pkg/bundle/deploy.go to compute per-pod DB usage from WAL+db size vs. pod PVC requests, adjusted thresholds (80%/90%), evaluation windows (5m/1m), and templated annotations with {{ $labels.pod }}. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
deploy/internal/prometheus-rules.yaml (5)
265-265: Trailing spaces flagged by yamllint.Remove the trailing spaces after the “warning” value.
Apply this diff:
- severity: warning + severity: warning
252-255: Annotation copy tweaks for clarity and consistency.Texts look good. Consider adding the current utilization percentage to the message for on-call clarity.
For example:
- message: The NooBaa database on pod {{ $labels.pod }} is consuming 80% of its PVC capacity. Plan to increase the PVC size soon to prevent service impact. + message: The NooBaa database on pod {{ $labels.pod }} is consuming {{ printf "%0.1f" $value }}% of its PVC capacity. Plan to increase the PVC size soon to prevent service impact.
268-271: Minor grammar/punctuation and include the actual percent used.Add a period at the end of description and include the computed percentage in the message for better context.
Apply this diff:
- description: The NooBaa database on pod {{ $labels.pod }} has exceeded 90% of its PVC capacity. Immediate action is required + description: The NooBaa database on pod {{ $labels.pod }} has exceeded 90% of its PVC capacity. Immediate action is required. - message: The NooBaa database on pod {{ $labels.pod }} has exceeded 90% of its PVC capacity. Expand the PVC size now to avoid imminent service disruption. + message: The NooBaa database on pod {{ $labels.pod }} has exceeded {{ printf "%0.1f" $value }}% of its PVC capacity. Expand the PVC size now to avoid imminent service disruption.
261-261: Avoid hard-coding the namespace if possible.Hard-coding namespace="openshift-storage" limits portability (e.g., non-ODF installs or custom namespaces). Prefer deriving the namespace from the pod metrics and joining through to PVC metrics.
One approach:
- Compute per-pod namespace from a pod metric (e.g., kube_pod_info or directly from cnpg metrics if they include namespace).
- Join to kube_pod_spec_volumes_persistentvolumeclaims_info on (namespace,pod), then to PVC metrics on (namespace,persistentvolumeclaim).
I can propose a full expression using kube_pod_info if you confirm which pod/namespace metrics are available in your Prometheus.
Also applies to: 277-277
257-263: Consider robustness to transient spikes and absent metrics.WAL can spike briefly during checkpoints/replication catch-up. If you see alert flapping:
- Smooth numerator with avg_over_time over 5–10m.
- Or increase the “for” duration slightly.
Also consider guarding for absent WAL metrics on certain pods:
- Wrap each term with “or on(pod) vector(0)” to avoid dropping the whole vector when one component is absent.
Example for the warning alert numerator:
( (avg_over_time(sum by (pod) (cnpg_collector_pg_wal{value="size"})[10m]) or on(pod) vector(0)) + (avg_over_time(sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})[10m]) or on(pod) vector(0)) )Please confirm whether cnpg_collector_pg_wal is present on all pods (primary/replicas) in your clusters; if not, the guard is advisable.
Also applies to: 273-279
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
deploy/internal/prometheus-rules.yaml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/internal/prometheus-rules.yaml
[error] 265-265: trailing spaces
(trailing-spaces)
🔇 Additional comments (1)
deploy/internal/prometheus-rules.yaml (1)
257-263: Fix denominator join to use real pod labels
The rule’s denominator currently copies the PVC name into apodlabel vialabel_replace, but StatefulSet PVCs (pgdata-<podName>) won’t match the realpodlabel on your CNPG metrics—vector matching fails and alerts never fire. Usekube_pod_spec_volumes_persistentvolumeclaims_infoto map PVCs to pods before summing:Apply this diff:
- ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) - + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) - / - sum by (pod) ( - label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" - ))) * 100 > 80 + ( + (sum by (pod) (cnpg_collector_pg_wal{value="size"}) + + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) + / + sum by (pod) ( + max by (namespace, persistentvolumeclaim, pod) ( + kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"} + * on (namespace, persistentvolumeclaim) + group_left(pod) + kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"} + ) + ) + ) * 100 > 80Optional capacity-based variant to avoid estimation drift:
sum by (pod) ( max by (namespace, persistentvolumeclaim, pod) ( kubelet_volume_stats_capacity_bytes{namespace="openshift-storage"} * on (namespace, persistentvolumeclaim) group_left(pod) kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"} ) )Please confirm:
cnpg_collector_pg_walexposes avalue=~"size|size_bytes"series in bytes.kube_pod_spec_volumes_persistentvolumeclaims_infois present in your kube-state-metrics.- The joined time series exist for all NooBaa-DB pods.
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | ||
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | ||
| / | ||
| sum by (pod) ( | ||
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | ||
| ))) * 100 > 90 | ||
| for: 1m |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Same vector-matching issue in the critical alert — fix the denominator join here too.
This alert repeats the same label_replace problem; apply the same join pattern to ensure proper pod-keyed matching.
Apply this diff:
- ((sum by (pod) (cnpg_collector_pg_wal{value="size"})
- + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}))
- /
- sum by (pod) (
- label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)"
- ))) * 100 > 90
+ (
+ (sum by (pod) (cnpg_collector_pg_wal{value="size"})
+ + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}))
+ /
+ sum by (pod) (
+ max by (namespace, persistentvolumeclaim, pod) (
+ kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}
+ * on (namespace, persistentvolumeclaim)
+ group_left(pod)
+ kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"}
+ )
+ )
+ ) * 100 > 90If available, prefer kubelet_volume_stats_capacity_bytes as noted in the previous comment.
🏁 Script executed:
#!/bin/bash
set -e
echo "== cnpg_collector_pg_wal occurrences =="
rg -n "cnpg_collector_pg_wal" deploy/internal/prometheus-rules.yaml
echo
echo "== max by (namespace, persistentvolumeclaim, pod) occurrences =="
rg -n "max by \(namespace, persistentvolumeclaim, pod\)" deploy/internal/prometheus-rules.yaml
echo
echo "== label_replace on PVC requests occurrences =="
rg -n "label_replace\(kube_persistentvolumeclaim_resource_requests_storage_bytes" deploy/internal/prometheus-rules.yamlLength of output: 539
Ensure Proper Pod-PVC Join in Critical Alert Denominator
The denominator currently uses label_replace on PVC requests without joining PVCs back to pods, which can misalign metrics. Replace it with a pod-aware join:
- sum by (pod) (
- label_replace(
- kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"},
- "pod", "$1", "persistentvolumeclaim", "(.*)"
- )
- )
+ sum by (pod) (
+ max by (namespace, persistentvolumeclaim, pod) (
+ kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}
+ * on (namespace, persistentvolumeclaim)
+ group_left(pod)
+ kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"}
+ )
+ )If available, prefer using kubelet_volume_stats_capacity_bytes for capacity metrics.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | |
| + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | |
| / | |
| sum by (pod) ( | |
| label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" | |
| ))) * 100 > 90 | |
| for: 1m | |
| ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) | |
| sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) | |
| / | |
| sum by (pod) ( | |
| max by (namespace, persistentvolumeclaim, pod) ( | |
| kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"} | |
| * on (namespace, persistentvolumeclaim) | |
| group_left(pod) | |
| kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"} | |
| ) | |
| ) | |
| ))) * 100 > 90 | |
| for: 1m |
🤖 Prompt for AI Agents
In deploy/internal/prometheus-rules.yaml around lines 273-279, the alert
denominator sums PVC request bytes without joining those PVC metrics back to
pods, which can misalign with the pod-scoped numerator; fix by mapping PVCs to
their owning pods (e.g., use kube_persistentvolumeclaim_info or kube_pod_info
and label_replace/label_join to add a "pod" label or derive pod ownership via
ownerReferences) and then sum the capacity/requests by(pod) so the denominator
aligns with the numerator; if your cluster exposes
kubelet_volume_stats_capacity_bytes, prefer using that metric aggregated by pod
instead of PVC request bytes.
- changed the alert rules to take into account the WAL size in addition to the database size. - refactored the alerts to be more generic instead of alert per DB pod. - changed the description and messages of the alerts. Signed-off-by: Danny Zaken <dannyzaken@gmail.com>
029a308 to
9d98e31
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
deploy/internal/prometheus-rules.yaml (2)
273-279: Same broken denominator join as the warning alert; fix here tooThis repeats the incorrect label_replace approach. Use the same PVC→Pod join fix to ensure proper pod-keyed matching.
Apply this diff:
- sum by (pod) ( - label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" - ))) * 100 > 90 + sum by (pod) ( + max by (namespace, persistentvolumeclaim, pod) ( + kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"} + * on (namespace, persistentvolumeclaim) + group_left(pod) + kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"} + ) + ))) * 100 > 90
257-263: Broken pod-PVC matching in denominator; alerts likely won't fireThe denominator manufactures a "pod" label via label_replace from the PVC name, which does not equal the real pod label. This misaligns labels between numerator and denominator, causing one-to-one vector matching to fail. Replace with a proper join from PVC → Pod using kube_pod_spec_volumes_persistentvolumeclaims_info.
Apply this diff to fix the join:
- sum by (pod) ( - label_replace(kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" - ))) * 100 > 80 + sum by (pod) ( + max by (namespace, persistentvolumeclaim, pod) ( + kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"} + * on (namespace, persistentvolumeclaim) + group_left(pod) + kube_pod_spec_volumes_persistentvolumeclaims_info{namespace="openshift-storage"} + ) + ))) * 100 > 80
🧹 Nitpick comments (4)
deploy/internal/prometheus-rules.yaml (4)
268-269: Polish alert text for accuracy and punctuation
- The threshold is “> 80%”, not exactly “80%”. Suggest “exceeded 80%” / “more than 80%”.
- End the critical description with a period for consistency.
Apply this text-only tweak:
- description: The NooBaa database on pod {{ $labels.pod }} has reached 80% of its PVC capacity. - message: The NooBaa database on pod {{ $labels.pod }} is consuming 80% of its PVC capacity. Plan to increase the PVC size soon to prevent service impact. + description: The NooBaa database on pod {{ $labels.pod }} has exceeded 80% of its PVC capacity. + message: The NooBaa database on pod {{ $labels.pod }} is consuming more than 80% of its PVC capacity. Plan to increase the PVC size soon to prevent service impact.- description: The NooBaa database on pod {{ $labels.pod }} has exceeded 90% of its PVC capacity. Immediate action is required + description: The NooBaa database on pod {{ $labels.pod }} has exceeded 90% of its PVC capacity. Immediate action is required.
265-265: Trim trailing whitespace (yamllint error)YAMLlint flagged trailing spaces on this line.
Apply this diff:
- severity: warning + severity: warning
257-263: Consider actual PVC capacity over requests, and filter to the DB volumeRequests can drift from actual capacity post-expansion. If available in your cluster, prefer kubelet_volume_stats_capacity_bytes joined to pods, and optionally filter to the specific DB PVC/volume to avoid summing unrelated mounts.
Example replacement for the denominator (both alerts), using capacity:
sum by (pod) ( max by (namespace, persistentvolumeclaim, pod) ( kubelet_volume_stats_capacity_bytes * on (namespace, persistentvolumeclaim) group_left(pod) kube_pod_spec_volumes_persistentvolumeclaims_info ) )If the DB PVC can be uniquely identified (e.g., by claim name pattern or volume label), add a selector to restrict the join (e.g., kube_pod_spec_volumes_persistentvolumeclaims_info{volumename="pgdata"}).
Also applies to: 273-279
257-262: Resilience: avoid numerator dropouts when one metric is temporarily missingAdding two separately aggregated series requires both to be present per pod; if either metric is missing, the entire numerator for that pod drops. You can wrap each with a zero-fallback to reduce flapping.
Example pattern:
( (sum by (pod) (cnpg_collector_pg_wal{value="size"}) or on(pod) vector(0)) + (sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"}) or on(pod) vector(0)) )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
deploy/internal/prometheus-rules.yaml(1 hunks)pkg/bundle/deploy.go(2 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
deploy/internal/prometheus-rules.yaml
[error] 265-265: trailing spaces
(trailing-spaces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: run-kms-tls-sa-test
- GitHub Check: run-cli-tests
- GitHub Check: run-azure-vault-test
- GitHub Check: run-hac-test
- GitHub Check: run-operator-tests
- GitHub Check: run-core-config-map-tests
- GitHub Check: cnpg-deployment-test
- GitHub Check: run-kms-tls-token-test
- GitHub Check: run-kms-key-rotate-test
- GitHub Check: run-kms-dev-test
- GitHub Check: run-kms-kmip-test
- GitHub Check: run-admission-test
🔇 Additional comments (2)
deploy/internal/prometheus-rules.yaml (1)
261-262: Namespace hardcoding may miss metrics outside “openshift-storage”Both denominators filter to namespace="openshift-storage". If NooBaa is deployed to a different namespace, these alerts will silently fail. Confirm this is always the operator’s namespace or parameterize in the bundle generation.
Would you like me to generate a follow-up commit that derives the namespace dynamically in pkg/bundle/deploy.go (e.g., text/template with the operator namespace) and embeds the rendered rules?
Also applies to: 277-278
pkg/bundle/deploy.go (1)
4791-4822: DB capacity alerts: fix PVC→Pod mapping and remove hardcoded namespaceThe current denominator uses
label_replace( kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, "pod", "$1", "persistentvolumeclaim", "(.*)" )This causes the
podlabel to be the entire PVC name (e.g.db-noobaa-db-pg-0), which will never match the real pod label (noobaa-db-pg-0), yielding empty results. Hard-codingnamespace="openshift-storage"also breaks portability if your DB runs elsewhere.Recommended refactor: use a Prometheus join on the PVC→pod mapping via
kube_pod_spec_volumes_persistentvolumeclaim_info(ensure your kube-state-metrics version exposes it withnamespace,persistentvolumeclaim,pod):- expr: | - ((sum by (pod) (cnpg_collector_pg_wal{value="size"}) - + sum by (pod) (cnpg_pg_database_size_bytes{datname="nbcore"})) - / - sum by (pod) ( - label_replace( - kube_persistentvolumeclaim_resource_requests_storage_bytes{namespace="openshift-storage"}, - "pod","$1","persistentvolumeclaim","(.*)" - ) - ) - ) * 100 > 80 + expr: | + ( + ( + sum by (namespace,pod)(cnpg_collector_pg_wal{value="size"}) + + sum by (namespace,pod)(cnpg_pg_database_size_bytes{datname="nbcore"}) + ) + / + sum by (namespace,pod) ( + kube_persistentvolumeclaim_resource_requests_storage_bytes + * on(namespace,persistentvolumeclaim) group_left(pod) + kube_pod_spec_volumes_persistentvolumeclaim_info + ) + ) * 100 > 80Repeat for the 90% alert.
If you can’t rely on the join metric:
- Change the regex to
".*-(.*)"so it extracts just the pod suffix.- Remove or parameterize the hard-coded namespace filter.
Locations to update:
- pkg/bundle/deploy.go lines 4802–4822
Please verify in your Prometheus console that:
kube_pod_spec_volumes_persistentvolumeclaim_infoexists with the correct labels.- The refactored expressions yield a single percentage series per DB pod and correctly track WAL/DB growth.
Explain the changes
Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
Summary by CodeRabbit