Skip to content

fix: use update-in-place for PrometheusRule ConfigMaps#8290

Merged
simonpasquier merged 1 commit intoprometheus-operator:mainfrom
Sanchit2662:fix/prometheusrule-configmap-sync
Jan 19, 2026
Merged

fix: use update-in-place for PrometheusRule ConfigMaps#8290
simonpasquier merged 1 commit intoprometheus-operator:mainfrom
Sanchit2662:fix/prometheusrule-configmap-sync

Conversation

@Sanchit2662
Copy link
Contributor

Description

This PR fixes a non-atomic reconciliation pattern in PrometheusRuleSyncer.Sync() that could cause temporary and silent loss of alerting and recording rules during PrometheusRule updates.

Previously, the operator deleted all existing rule ConfigMaps before recreating them. This introduced a time window where no rule ConfigMaps existed, allowing the config-reloader sidecar to trigger a Prometheus reload with missing or partial rules. The issue is triggered during normal reconciliation and can impact production alerting without any visible errors.

This change makes PrometheusRule reconciliation continuous and safe by switching to update-in-place semantics.


Type of change

  • BUGFIX
  • ENHANCEMENT

What was wrong (root cause)

PrometheusRuleSyncer.Sync() used a delete-then-create pattern for rule ConfigMaps:

// Delete and recreate the configmap(s).
for _, cm := range current {
    err := prs.cmClient.Delete(ctx, cm.Name, metav1.DeleteOptions{})
    if err != nil {
        return nil, err
    }
}

for _, cm := range configMaps {
    _, err := prs.cmClient.Create(ctx, &cm, metav1.CreateOptions{})
    if err != nil {
        return nil, err
    }
}

During this deletion window:

  • ConfigMaps briefly do not exist
  • The config-reloader detects the change and reloads Prometheus
  • Prometheus reloads with missing alerting / recording rules
  • No error is surfaced, resulting in a silent monitoring gap

This happens during any PrometheusRule update, including label or selector changes.


Impact

Before this fix:

  • Alerting rules can temporarily disappear → alerts cannot fire
  • Recording rules can disappear → dashboards and derived metrics break
  • Failures are silent (no logs, no reconciliation errors)
  • HA setups do not help (replicas reconcile simultaneously)

After this fix:

  • Rules remain continuously available during reconciliation
  • Config reloads always see a valid rule set
  • Partial failures no longer result in empty rules
  • Alerting behavior is predictable and safe

Fix summary

1. Added CreateOrUpdateConfigMap helper

A new utility function was added to atomically create or update ConfigMaps:

func CreateOrUpdateConfigMap(
    ctx context.Context,
    cmClient clientv1.ConfigMapInterface,
    desired *v1.ConfigMap,
) error

Key properties:

  • Uses retry.RetryOnConflict
  • Creates the ConfigMap if it does not exist
  • Updates in place if it already exists
  • Preserves external labels and annotations

2. Rewrote PrometheusRuleSyncer.Sync()

The reconciliation order was changed to avoid rule disappearance:

Before

  1. Delete all ConfigMaps
  2. Create new ConfigMaps
  3. ❌ Rules temporarily missing

After

  1. Create or update desired ConfigMaps
  2. Delete only excess ConfigMaps that are no longer needed
  3. ✅ Rules always present

This mirrors the existing CreateOrUpdateSecret pattern used for Alertmanager configuration.


Verification

  • ✅ Code builds successfully
  • ✅ Existing PrometheusRule sync tests pass
  • ✅ No behavior change outside rule reconciliation logic

Signed-off-by: Sanchit2662 <sanchit2662@gmail.com>
@Sanchit2662 Sanchit2662 requested a review from a team as a code owner January 17, 2026 16:36
@Sanchit2662
Copy link
Contributor Author

Hi @simonpasquier ,
This PR fixes a non-atomic delete-then-create pattern in PrometheusRule syncing that can briefly cause alerting and recording rules to disappear during reconciliation.
I’d really appreciate it if you could take some time to review it when you get a chance. Thanks a lot!

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The process isn't 100% atomic and safe (e.g. the config-reloader might see a in-progress update of the configmaps or a to-be-deleted configmap) but I don't see a (simple) way to make it better.

@simonpasquier simonpasquier merged commit 4fa9d25 into prometheus-operator:main Jan 19, 2026
22 checks passed
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Feb 6, 2026
…r to v0.89.0 (#3775)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator) | minor | `v0.88.1` → `v0.89.0` |

---

### Release Notes

<details>
<summary>prometheus-operator/prometheus-operator (prometheus-operator/prometheus-operator)</summary>

### [`v0.89.0`](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.89.0): 0.89.0 / 2026-02-05

[Compare Source](prometheus-operator/prometheus-operator@v0.88.1...v0.89.0)

- \[ENHANCEMENT] Add `hostNetwork` field to the `Alertmanager` CRD. [#&#8203;8281](prometheus-operator/prometheus-operator#8281)
- \[ENHANCEMENT] Add the `crds` and `full-crds` commands to the operator's binary. [#&#8203;8251](prometheus-operator/prometheus-operator#8251)
- \[ENHANCEMENT] Report deprecated field usage in the `Reconciled` condition type. [#&#8203;8236](prometheus-operator/prometheus-operator#8236)
- \[ENHANCEMENT] Avoid unnecessary reconciliation upon creation of the `ThanosRuler` StatefulSet. [#&#8203;8347](prometheus-operator/prometheus-operator#8347)
- \[ENHANCEMENT] Add `bodySizeLimit` to the ScrapeConfig CRD. [#&#8203;8348](prometheus-operator/prometheus-operator#8348)
- \[ENHANCEMENT] Support `http_headers` field in the Alertmanager Secret. [#&#8203;8357](prometheus-operator/prometheus-operator#8357)
- \[ENHANCEMENT] Add the `-kubelet-http-metrics` flag to enable/disable the HTTP metrics port in the Kubelet endpoint (default=enabled). [#&#8203;8350](prometheus-operator/prometheus-operator#8350)
- \[ENHANCEMENT] Include `operator.prometheus.io/version` annotation in the full version of CRDs. [#&#8203;8279](prometheus-operator/prometheus-operator#8279)
- \[BUGFIX] Validate VictorOps global configuration in the `Alertmanager` CRD. [#&#8203;8020](prometheus-operator/prometheus-operator#8020)
- \[BUGFIX] Validate Jira global configuration in the `Alertmanager` CRD. [#&#8203;8265](prometheus-operator/prometheus-operator#8265)
- \[BUGFIX] Validate VictorOps receiver's URL in the `AlertmanagerConfig` CRD. [#&#8203;8258](prometheus-operator/prometheus-operator#8258)
- \[BUGFIX] Validate Webex receiver's URL in the `AlertmanagerConfig` CRD. [#&#8203;8255](prometheus-operator/prometheus-operator#8255)
- \[BUGFIX] Validate Jira receiver's URL configuration in the `AlertmanagerConfig` CRD. [#&#8203;8230](prometheus-operator/prometheus-operator#8230)
- \[BUGFIX] Validate OpsGenie receiver configuration in the `AlertmanagerConfig` CRD. [#&#8203;8267](prometheus-operator/prometheus-operator#8267)
- \[BUGFIX] Validate WeChat receiver configuration in the `AlertmanagerConfig` CRD. [#&#8203;8271](prometheus-operator/prometheus-operator#8271)
- \[BUGFIX] Validate SNS receiver configuration in the `AlertmanagerConfig` CRD. [#&#8203;8217](prometheus-operator/prometheus-operator#8217)
- \[BUGFIX] Validate Webex global configuration in the `Alertmanager` CRD. [#&#8203;7979](prometheus-operator/prometheus-operator#7979)
- \[BUGFIX] Validate Telegram global configuration in the `Alertmanager` CRD. [#&#8203;8268](prometheus-operator/prometheus-operator#8268)
- \[BUGFIX] Restore statefulset's labels if the creation fails with AlreadyExists. [#&#8203;8343](prometheus-operator/prometheus-operator#8343)
- \[BUGFIX] Fix potential panic due to informer cache races. [#&#8203;8310](prometheus-operator/prometheus-operator#8310)
- \[BUGFIX] Support probers defined with IPv6 addresses in the `Probe` CRD. [#&#8203;8354](prometheus-operator/prometheus-operator#8354)
- \[BUGFIX] Prevent group and repeat intervals with zero duration from breaking Alertmanager. [#&#8203;8126](prometheus-operator/prometheus-operator#8126)
- \[BUGFIX] Propagate all supported RocketChat attributes for `AlertmanagerConfig` CRD. [#&#8203;8016](prometheus-operator/prometheus-operator#8016)
- \[BUGFIX] Add URL validation for WeChat receiver. [#&#8203;8256](prometheus-operator/prometheus-operator#8256)
- \[BUGFIX] Add URL validation for SNS receiver. [#&#8203;8259](prometheus-operator/prometheus-operator#8259)
- \[BUGFIX] Fix GCE service discovery for the `ScrapeConfig` CRD. [#&#8203;8284](prometheus-operator/prometheus-operator#8284)
- \[BUGFIX] Avoid stale conditions in `Alertmanager`, `ThanosRuler`, `Prometheus` and `PrometheusAgent` resources. [#&#8203;8304](prometheus-operator/prometheus-operator#8304)
- \[BUGFIX] Fix race condition when updating rule ConfigMaps. [#&#8203;8290](prometheus-operator/prometheus-operator#8290)
- \[BUGFIX] Fix race condition when patching finalizers. [#&#8203;8323](prometheus-operator/prometheus-operator#8323)
- \[BUGFIX] Reconcile `ScrapeConfig` resources when namespace selection changes. [#&#8203;8334](prometheus-operator/prometheus-operator#8334)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4zLjYiLCJ1cGRhdGVkSW5WZXIiOiI0My4zLjYiLCJ0YXJnZXRCcmFuY2giOiJtYWluIiwibGFiZWxzIjpbImltYWdlIl19-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/3775
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
renovate bot added a commit to sdwilsh/ansible-playbooks that referenced this pull request Feb 21, 2026
…r to v0.89.0

##### [\`v0.89.0\`](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.89.0)

- \[ENHANCEMENT] Add `hostNetwork` field to the `Alertmanager` CRD. [#8281](prometheus-operator/prometheus-operator#8281)
- \[ENHANCEMENT] Add the `crds` and `full-crds` commands to the operator's binary. [#8251](prometheus-operator/prometheus-operator#8251)
- \[ENHANCEMENT] Report deprecated field usage in the `Reconciled` condition type. [#8236](prometheus-operator/prometheus-operator#8236)
- \[ENHANCEMENT] Avoid unnecessary reconciliation upon creation of the `ThanosRuler` StatefulSet. [#8347](prometheus-operator/prometheus-operator#8347)
- \[ENHANCEMENT] Add `bodySizeLimit` to the ScrapeConfig CRD. [#8348](prometheus-operator/prometheus-operator#8348)
- \[ENHANCEMENT] Support `http_headers` field in the Alertmanager Secret. [#8357](prometheus-operator/prometheus-operator#8357)
- \[ENHANCEMENT] Add the `-kubelet-http-metrics` flag to enable/disable the HTTP metrics port in the Kubelet endpoint (default=enabled). [#8350](prometheus-operator/prometheus-operator#8350)
- \[ENHANCEMENT] Include `operator.prometheus.io/version` annotation in the full version of CRDs. [#8279](prometheus-operator/prometheus-operator#8279)
- \[BUGFIX] Validate VictorOps global configuration in the `Alertmanager` CRD. [#8020](prometheus-operator/prometheus-operator#8020)
- \[BUGFIX] Validate Jira global configuration in the `Alertmanager` CRD. [#8265](prometheus-operator/prometheus-operator#8265)
- \[BUGFIX] Validate VictorOps receiver's URL in the `AlertmanagerConfig` CRD. [#8258](prometheus-operator/prometheus-operator#8258)
- \[BUGFIX] Validate Webex receiver's URL in the `AlertmanagerConfig` CRD. [#8255](prometheus-operator/prometheus-operator#8255)
- \[BUGFIX] Validate Jira receiver's URL configuration in the `AlertmanagerConfig` CRD. [#8230](prometheus-operator/prometheus-operator#8230)
- \[BUGFIX] Validate OpsGenie receiver configuration in the `AlertmanagerConfig` CRD. [#8267](prometheus-operator/prometheus-operator#8267)
- \[BUGFIX] Validate WeChat receiver configuration in the `AlertmanagerConfig` CRD. [#8271](prometheus-operator/prometheus-operator#8271)
- \[BUGFIX] Validate SNS receiver configuration in the `AlertmanagerConfig` CRD. [#8217](prometheus-operator/prometheus-operator#8217)
- \[BUGFIX] Validate Webex global configuration in the `Alertmanager` CRD. [#7979](prometheus-operator/prometheus-operator#7979)
- \[BUGFIX] Validate Telegram global configuration in the `Alertmanager` CRD. [#8268](prometheus-operator/prometheus-operator#8268)
- \[BUGFIX] Restore statefulset's labels if the creation fails with AlreadyExists. [#8343](prometheus-operator/prometheus-operator#8343)
- \[BUGFIX] Fix potential panic due to informer cache races. [#8310](prometheus-operator/prometheus-operator#8310)
- \[BUGFIX] Support probers defined with IPv6 addresses in the `Probe` CRD. [#8354](prometheus-operator/prometheus-operator#8354)
- \[BUGFIX] Prevent group and repeat intervals with zero duration from breaking Alertmanager. [#8126](prometheus-operator/prometheus-operator#8126)
- \[BUGFIX] Propagate all supported RocketChat attributes for `AlertmanagerConfig` CRD. [#8016](prometheus-operator/prometheus-operator#8016)
- \[BUGFIX] Add URL validation for WeChat receiver. [#8256](prometheus-operator/prometheus-operator#8256)
- \[BUGFIX] Add URL validation for SNS receiver. [#8259](prometheus-operator/prometheus-operator#8259)
- \[BUGFIX] Fix GCE service discovery for the `ScrapeConfig` CRD. [#8284](prometheus-operator/prometheus-operator#8284)
- \[BUGFIX] Avoid stale conditions in `Alertmanager`, `ThanosRuler`, `Prometheus` and `PrometheusAgent` resources. [#8304](prometheus-operator/prometheus-operator#8304)
- \[BUGFIX] Fix race condition when updating rule ConfigMaps. [#8290](prometheus-operator/prometheus-operator#8290)
- \[BUGFIX] Fix race condition when patching finalizers. [#8323](prometheus-operator/prometheus-operator#8323)
- \[BUGFIX] Reconcile `ScrapeConfig` resources when namespace selection changes. [#8334](prometheus-operator/prometheus-operator#8334)

---
##### [\`v0.88.1\`](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.88.1)

- \[BUGFIX] Validate `webhookURL` secret for `MSTeams` receiver in `AlertmanagerConfig` CRD. [#8294](prometheus-operator/prometheus-operator#8294)
- \[BUGFIX] Revert maximum version check for `EC2/Lightsail` SD in `ScrapeConfig` CRD. [#8308](prometheus-operator/prometheus-operator#8308)
- \[BUGFIX] Relax URL validation in `Slack` receiver in AlertmanagerConfig CRD to support Go templates. [#8299](prometheus-operator/prometheus-operator#8299) [#8331](prometheus-operator/prometheus-operator#8331)
- \[BUGFIX] Relax URL validation in `PagerDuty` in AlertmanagerConfig CRD to support Go templates. [#8319](prometheus-operator/prometheus-operator#8319)
- \[BUGFIX] Relax URL validation in `WebhookConfig` in AlertmanagerConfig CRD to support Go templates. [#8307](prometheus-operator/prometheus-operator#8307) [#8317](prometheus-operator/prometheus-operator#8317)
- \[BUGFIX] Relax URL validation in `RocketChat` receiver in AlertmanagerConfig CRD to support Go templates. [#8318](prometheus-operator/prometheus-operator#8318)
- \[BUGFIX] Relax URL validation in `Pushover` receiver in AlertmanagerConfig CRD to support Go templates. [#8307](prometheus-operator/prometheus-operator#8307) [#8316](prometheus-operator/prometheus-operator#8316)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants