pkg/alertmanager: change podManagementPolicy to parallel to prevent statefulset reconciliation from hanging#2676
Conversation
|
Very good catch! We should probably, i.e. in the commit message, reference some context to make the rationale of this change more clear: |
|
I think this could even be added as source code comments in your PR as well as in https://github.com/coreos/prometheus-operator/blob/7c448d90c1bbf8740b5f524937209d4ec56958ca/pkg/prometheus/statefulset.go#L802 |
30301ad to
7e912ae
Compare
|
could you also add the same comment in the prometheus statefulset? |
|
but generally very lgtm, thanks for catching! |
…statefulset reconciliation from hanging When using default podManagementPolicy it is possible to create a situation where alertmanager pods objects won't be reconciled with a statefulset and thus preventing am from being deployed. One of such cases is when am was deployed and afterwards admin applied taints to all nodes causing pod eviction. Next tolerations were applied however due to OrderedReady policy one am pod was still left in Pending state preventing reconciliation. This is needed to provide a workaround for a bug in kubernetes detailed in kubernetes/kubernetes#60164. It is also one of the knows limitations of StetafulSets mentioned in docs https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#limitations
7e912ae to
ac7626c
Compare
|
Added comment to prometheus StatefulSet. |
|
I am not totally sure we want to have this in parallel. |
This is correct. We intentionally did not introduce the parallel pod management policy here because we want the pods to be rolled out sequentially so that shutting down members in a cluster can make sure to sync the cluster with the state they have and there being enough time that a new pod can sync the state from the cluster before all other pods are rolled. I think in general we can change our messaging around this and say "you must have a PV for alertmanager" and be done with it. 🤷♂️ What do you think? |
|
So we basically have two options how we can proceed and both are ok by me. Parallel pod management policyWe would need to notify that you always need PV for alertmanager. This in turn can affect how we are creating cluster as 3rd alertmanager won't be needed anymore for HA setup. OrderedReady pod management policyWe would need to notify users that in case of stuck alertmanager rollout they need to remove all am pods as this is just how StatefulSets are (not) working. Or maybe we could somehow add this logic into prometheus-operator? This way we could still have a setup where PVs are not needed. In either way we need to acknowledge that this can happen because StatefulSets work this way (maybe in next release notes?). Of course there is 3rd option: fix StatefulSets in k8s |
|
I say let's go for the parallel management policy. Feel free to merge whenever you're ready. lgtm 👍 |
|
OK, let's proceed with "Parallel" management policy. |
When using default podManagementPolicy (
OrderedReady) it is possible to create asituation where alertmanager pod objects won't be reconciled with a
statefulset and thus preventing alertmanager from being deployed.
One of such cases is when alertmanager was deployed and afterwards admin
applied taints to all nodes causing pod eviction. Next tolerations were
applied however due to OrderedReady policy one alertmanager pod was still left in
Pending state preventing reconciliation.
While testing manually with
podManagementPolicy: Parallelalertmanager pods were scheduled correctly.We are already using
podManagementPolicy: Parallelfor prometheus statefulset (https://github.com/coreos/prometheus-operator/blob/master/pkg/prometheus/statefulset.go#L805), but this wasn't propagated to alertmanager for reasons unknown to me./cc @brancz @s-urbaniak