Skip to content

feat: implement repair policy for StatefulSets#8443

Merged
simonpasquier merged 5 commits intoprometheus-operator:mainfrom
simonpasquier:unclog-sts
Mar 17, 2026
Merged

feat: implement repair policy for StatefulSets#8443
simonpasquier merged 5 commits intoprometheus-operator:mainfrom
simonpasquier:unclog-sts

Conversation

@simonpasquier
Copy link
Copy Markdown
Contributor

@simonpasquier simonpasquier commented Mar 11, 2026

Description

This commit adds a new CLI argument (--repair-policy-for-statefulsets)
to the operator binary. The argument defines the policy to use when the
operator detects statefulset's pods which are stuck at a bad revision
requiring a manual intervention to unblock the roll-out [1].

It supports 3 values:

  • none in which case the operator only logs a warning if it detects
    stuck pods.
  • delete in which case the operator will delete the pod with the
    highest-ordinal which is at an incorrect revision.
  • evict in which case the operator will evict the pod with the
    highest-ordinal which is at an incorrect revision.

The policy applies globally to all workload resources which are based
on StatefulSets: Alertmanager, ThanosRuler, Prometheus and PrometheusAgent.

For context, setting .spec.managementPodPolicy: Parallel at the
statefulset level used to be a workaround to avoid the manual
intervention but since Kubernetes 1.35, it doesn't work anymore.

To avoid surprises, the default policy is none but users on Kubernetes
v1.35 are encouraged to use delete or evict (the latter is "safer"
because it will take into account the pod disruption budget).

[1] https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback

Closes #8205

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

  • CHANGE (fix or feature that would cause existing functionality to not work as expected)
  • FEATURE (non-breaking change which adds functionality)
  • BUGFIX (non-breaking change which fixes an issue)
  • ENHANCEMENT (non-breaking change which improves existing functionality)
  • NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Verification

Please check the Prometheus-Operator testing guidelines for recommendations about automated tests.

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.


@simonpasquier simonpasquier force-pushed the unclog-sts branch 2 times, most recently from 3db8a1b to 5178620 Compare March 12, 2026 09:36
@simonpasquier simonpasquier changed the title Unclog sts feat: implement repair policy for StatefulSets Mar 12, 2026
@simonpasquier simonpasquier force-pushed the unclog-sts branch 10 times, most recently from 908d136 to c1e7b64 Compare March 13, 2026 15:27
@simonpasquier simonpasquier requested a review from slashpai March 13, 2026 19:33
@simonpasquier simonpasquier marked this pull request as ready for review March 13, 2026 19:34
@simonpasquier simonpasquier requested a review from a team as a code owner March 13, 2026 19:34
simonpasquier and others added 4 commits March 16, 2026 09:52
This commit adds a new CLI argument (`--repair-policy-for-statefulsets`)
to the operator binary. The argument defines the policy to use when the
oeprator detects statefulset's pods which are stuck at a bad revision
requiring a manual intervention to unblock the roll-out [1].

It supports 3 values:
- `none` in which case the operator only logs a warning if it detects
  stuck pods.
- `delete` in which case the operator will delete the pod with the
  highest-ordinal which is at an incorrect revision.
- `evict` in which case the operator will evict the pod with the
  highest-ordinal which is at an incorrect revision.

The policy applies globally to all workload resources which are based
on StatefulSets: Alertmanager, ThanosRuler, Prometheus and PrometheusAgent.

For context, setting `.spec.managementPodPolicy: Parallel` at the
statefulset level used to be a workaround to avoid the manual
intervention but since Kubernetes 1.35, it doesn't work anymore.

To avoid surprises, the default policy is `none` but users on Kubernetes
v1.35 are encouraged to use `delete` or `evict` (the latter is "safer"
because it will take into account the pod disruption budget).

[1] https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback

Closes prometheus-operator#8205

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Co-authored-by: Tomlin7 <billydevbusiness@gmail.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Signed-off-by: Simon Pasquier <spasquie@redhat.com>
Copy link
Copy Markdown
Contributor

@slashpai slashpai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of nits
lgtm

also we need to update docs also about new flag?

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier simonpasquier enabled auto-merge March 17, 2026 09:55
@simonpasquier simonpasquier merged commit 230b0ab into prometheus-operator:main Mar 17, 2026
22 checks passed
alexlebens pushed a commit to alexlebens/infrastructure that referenced this pull request Mar 20, 2026
…r to v0.90.0 (#4885)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator) | minor | `v0.89.0` → `v0.90.0` |

---

> ⚠️ **Warning**
>
> Some dependencies could not be looked up. Check the [Dependency Dashboard](issues/2) for more information.

---

### Release Notes

<details>
<summary>prometheus-operator/prometheus-operator (prometheus-operator/prometheus-operator)</summary>

### [`v0.90.0`](https://github.com/prometheus-operator/prometheus-operator/releases/tag/v0.90.0): 0.90.0 / 2026-03-19

[Compare Source](prometheus-operator/prometheus-operator@v0.89.0...v0.90.0)

- \[CHANGE/BUGFIX] Validate that the remote-write URL scheme is either `http` or `https`. [#&#8203;8455](prometheus-operator/prometheus-operator#8455)
- \[FEATURE] Add `--repair-policy-for-statefulsets` CLI argument to the operator. It defines how the operator manages StatefulSet's pods stuck at an incorrect revision. Users running Kubernetes v1.35+ are encouraged to enable this feature (see [troubleshooting guide](https://prometheus-operator.dev/docs/platform/troubleshooting/#statefulset-rollout-stuck-after-a-bad-update)). [#&#8203;8443](prometheus-operator/prometheus-operator#8443)
- \[FEATURE] Add `schedulerName` support to the `Prometheus`, `PrometheusAgent`, `Alertmanager` and `ThanosRuler` CRDs. [#&#8203;8451](prometheus-operator/prometheus-operator#8451)
- \[ENHANCEMENT] Add `--web.tls-curves` CLI argument to the operator and admission-webhook binaries. [#&#8203;8385](prometheus-operator/prometheus-operator#8385)
- \[ENHANCEMENT] Support minimum TLS version for Thanos gRPC servers. [#&#8203;8438](prometheus-operator/prometheus-operator#8438)
- \[ENHANCEMENT] Add version label to `ThanosRuler` pods. [#&#8203;8441](prometheus-operator/prometheus-operator#8441)
- \[ENHANCEMENT] Add `messageText` support for Slack receiver in `AlertmanagerConfig` CRD. [#&#8203;8374](prometheus-operator/prometheus-operator#8374)
- \[ENHANCEMENT] Add `messageText` support for Slack receiver in Alertmanager secret config. [#&#8203;8375](prometheus-operator/prometheus-operator#8375)
- \[ENHANCEMENT] Add `forceImplicitTLS` support for SMTP email config in Alertmanager secret config. [#&#8203;8384](prometheus-operator/prometheus-operator#8384) [#&#8203;8404](prometheus-operator/prometheus-operator#8404)
- \[ENHANCEMENT] Add `forceImplicitTLS` support for SMTP email config in `AlertmanagerConfig` CRD. [#&#8203;8386](prometheus-operator/prometheus-operator#8386)
- \[ENHANCEMENT] Add `forceImplicitTLS` support for SMTP global config in Alertmanager secret config. [#&#8203;8405](prometheus-operator/prometheus-operator#8405)
- \[ENHANCEMENT] Add `forceImplicitTLS` support for SMTP global config in `Alertmanager` CRD. [#&#8203;8406](prometheus-operator/prometheus-operator#8406)
- \[ENHANCEMENT] Add support for global Telegram bot token in `Alertmanager` CRD. [#&#8203;8372](prometheus-operator/prometheus-operator#8372)
- \[ENHANCEMENT] Add `chatIDFile` support for Telegram receiver in Alertmanager secret config. [#&#8203;8376](prometheus-operator/prometheus-operator#8376)
- \[ENHANCEMENT] Add `wechatAPISecretFile` support in Alertmanager global config. [#&#8203;8377](prometheus-operator/prometheus-operator#8377)
- \[ENHANCEMENT] Add `authSecretFile` support for email config in Alertmanager secret config. [#&#8203;8396](prometheus-operator/prometheus-operator#8396)
- \[ENHANCEMENT] Add nested field support for PagerDuty description in Alertmanager secret config. [#&#8203;8402](prometheus-operator/prometheus-operator#8402)
- \[ENHANCEMENT] Add email threading support in Alertmanager secret config. [#&#8203;8388](prometheus-operator/prometheus-operator#8388)
- \[ENHANCEMENT] Add field and label selectors for ConfigMap watches. [#&#8203;8368](prometheus-operator/prometheus-operator#8368)
- \[ENHANCEMENT] Improve ScrapeConfig API consistency and validation. [#&#8203;8422](prometheus-operator/prometheus-operator#8422)
- \[BUGFIX] Fix `ThanosRuler` config resource status not being updated on initial StatefulSet creation. [#&#8203;8358](prometheus-operator/prometheus-operator#8358)
- \[BUGFIX] Preserve `LastTransitionTime` in Prometheus status conditions. [#&#8203;8346](prometheus-operator/prometheus-operator#8346)
- \[BUGFIX] Make Mattermost `text` field optional in `AlertmanagerConfig` CRD. [#&#8203;8363](prometheus-operator/prometheus-operator#8363)
- \[BUGFIX] Remove nil error wrapping in v1alpha1 duplicate receiver validation. [#&#8203;8379](prometheus-operator/prometheus-operator#8379)
- \[BUGFIX] Aggregate `Available` condition across Prometheus shards. [#&#8203;8434](prometheus-operator/prometheus-operator#8434)
- \[BUGFIX] Reconcile resources with inconsistent status. [#&#8203;8397](prometheus-operator/prometheus-operator#8397)
- \[BUGFIX] Fix namespace lister/watcher compatibility with Kubernetes v1.35 client-go. [#&#8203;8431](prometheus-operator/prometheus-operator#8431)
- \[BUGFIX] Fix missing OAuth2 field in IonosSDConfig generation. [#&#8203;8433](prometheus-operator/prometheus-operator#8433)
- \[BUGFIX] Fix missing fields in AzureSDConfig. [#&#8203;8444](prometheus-operator/prometheus-operator#8444)
- \[BUGFIX] Validate Microsoft Teams V2 URL in `AlertmanagerConfig` CRD. [#&#8203;8227](prometheus-operator/prometheus-operator#8227)
- \[BUGFIX] Fix `labelmap` relabel action rejecting valid replacement values with template variables for Prometheus 2.x. [#&#8203;8337](prometheus-operator/prometheus-operator#8337)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My41OS4yIiwidXBkYXRlZEluVmVyIjoiNDMuNTkuMiIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiaW1hZ2UiXX0=-->

Reviewed-on: https://gitea.alexlebens.dev/alexlebens/infrastructure/pulls/4885
Co-authored-by: Renovate Bot <renovate-bot@alexlebens.net>
Co-committed-by: Renovate Bot <renovate-bot@alexlebens.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement custom update strategy for statefulset to avoid stuck rollouts

2 participants