Implement custom update strategy for statefulset to avoid stuck rollouts

### Component(s)

Prometheus, PrometheusAgent, AlertManager, ThanosRuler

### What is missing? Please describe.

It's a long story which boils down to https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback and https://github.com/kubernetes/kubernetes/issues/67250.

In short if a statefulset rollout gets stuck (e.g. pod crashlooping because of an unsupported CLI argument) and the user updates the resource definition to resolve the issue (e.g. the user removes the invalid argument from `.spec.additionalArgs`), the statefulset controller won't apply the latest changes. Instead it will wait for the pod to be deleted before proceeding. In Kubernetes < 1.35, the "[workaround](https://github.com/prometheus-operator/prometheus-operator/pull/2676)" was to set `.spec.podManagementPolicy: Parallel` but when the `MaxUnavailableStatefulSet` feature is enabled (which is the case by default starting with Kubernetes 1.35), the statefulset controller won't delete anymore the stuck pod(s) after a configuration rollback and it doesn't seem that upstream is willing to change this.

To alleviate the issue, we can extend the statefulset's update strategy and let the Prometheus operator emulate the end-user's operations. Meaning that when a pod isn't ready with a spec definition which doesn't match the statefulset's revision, the operator would delete (or rather evict) the pod to unblock the rollout.

```yaml
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate: { ... }
    # Defines how to deal with stuck rollouts.
    # Default is None.
    repairPolicy: EvictNotReadyPods|DeleteNotReadyPods|None
```

Ideally the resource's conditions should surface that some  statefulset's pods are stuck on a bad revision (probably via the existing Available condition with a custom Reason) in case someone prefers to delete pods manually in such situations.

### Describe alternatives you've considered.

The alternative is to delegate the resolution to the end-user which isn't great UX.

### Environment Information.

## Environment
Kubernetes Version: 1.35
Prometheus-Operator Version: N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement custom update strategy for statefulset to avoid stuck rollouts #8205

Component(s)

What is missing? Please describe.

Describe alternatives you've considered.

Environment Information.

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement custom update strategy for statefulset to avoid stuck rollouts #8205

Description

Component(s)

What is missing? Please describe.

Describe alternatives you've considered.

Environment Information.

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions