Component(s)
Prometheus, PrometheusAgent, AlertManager, ThanosRuler
What is missing? Please describe.
It's a long story which boils down to https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback and kubernetes/kubernetes#67250.
In short if a statefulset rollout gets stuck (e.g. pod crashlooping because of an unsupported CLI argument) and the user updates the resource definition to resolve the issue (e.g. the user removes the invalid argument from .spec.additionalArgs), the statefulset controller won't apply the latest changes. Instead it will wait for the pod to be deleted before proceeding. In Kubernetes < 1.35, the "workaround" was to set .spec.podManagementPolicy: Parallel but when the MaxUnavailableStatefulSet feature is enabled (which is the case by default starting with Kubernetes 1.35), the statefulset controller won't delete anymore the stuck pod(s) after a configuration rollback and it doesn't seem that upstream is willing to change this.
To alleviate the issue, we can extend the statefulset's update strategy and let the Prometheus operator emulate the end-user's operations. Meaning that when a pod isn't ready with a spec definition which doesn't match the statefulset's revision, the operator would delete (or rather evict) the pod to unblock the rollout.
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate: { ... }
# Defines how to deal with stuck rollouts.
# Default is None.
repairPolicy: EvictNotReadyPods|DeleteNotReadyPods|None
Ideally the resource's conditions should surface that some statefulset's pods are stuck on a bad revision (probably via the existing Available condition with a custom Reason) in case someone prefers to delete pods manually in such situations.
Describe alternatives you've considered.
The alternative is to delegate the resolution to the end-user which isn't great UX.
Environment Information.
Environment
Kubernetes Version: 1.35
Prometheus-Operator Version: N/A
Component(s)
Prometheus, PrometheusAgent, AlertManager, ThanosRuler
What is missing? Please describe.
It's a long story which boils down to https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback and kubernetes/kubernetes#67250.
In short if a statefulset rollout gets stuck (e.g. pod crashlooping because of an unsupported CLI argument) and the user updates the resource definition to resolve the issue (e.g. the user removes the invalid argument from
.spec.additionalArgs), the statefulset controller won't apply the latest changes. Instead it will wait for the pod to be deleted before proceeding. In Kubernetes < 1.35, the "workaround" was to set.spec.podManagementPolicy: Parallelbut when theMaxUnavailableStatefulSetfeature is enabled (which is the case by default starting with Kubernetes 1.35), the statefulset controller won't delete anymore the stuck pod(s) after a configuration rollback and it doesn't seem that upstream is willing to change this.To alleviate the issue, we can extend the statefulset's update strategy and let the Prometheus operator emulate the end-user's operations. Meaning that when a pod isn't ready with a spec definition which doesn't match the statefulset's revision, the operator would delete (or rather evict) the pod to unblock the rollout.
Ideally the resource's conditions should surface that some statefulset's pods are stuck on a bad revision (probably via the existing Available condition with a custom Reason) in case someone prefers to delete pods manually in such situations.
Describe alternatives you've considered.
The alternative is to delegate the resolution to the end-user which isn't great UX.
Environment Information.
Environment
Kubernetes Version: 1.35
Prometheus-Operator Version: N/A