Make termination grace seconds configurable#4681
Make termination grace seconds configurable#4681yeya24 wants to merge 4 commits intoprometheus-operator:mainfrom
Conversation
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
pkg/apis/monitoring/v1/types.go
Outdated
| // Set this value longer than the expected cleanup time for your process. | ||
| // Defaults to 600 seconds. | ||
| // +optional | ||
| TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty"` |
There was a problem hiding this comment.
Should this be better to use uint to avoid negative values
There was a problem hiding this comment.
Also add default value using validation marker so we can remove the conditional from statefulset code
// +kubebuilder:default:="600"
There was a problem hiding this comment.
Done. Thanks for the review.
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
slashpai
left a comment
There was a problem hiding this comment.
Can you also add tests?
pkg/apis/monitoring/v1/types.go
Outdated
| // Defaults to 600 seconds. | ||
| // +optional | ||
| TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty"` | ||
| // +kubebuilder:default:="600" |
There was a problem hiding this comment.
Since its int type we don't need quotes
// +kubebuilder:default:=600
| var minReadySeconds int32 | ||
| var ( | ||
| minReadySeconds int32 | ||
| terminationGracePeriod int64 |
There was a problem hiding this comment.
No it is casted from uint64. The pod spec needs int64 not uint64.
|
I have added tests. Can you please help take a look? Thanks! |
|
👋 @yeya24 can you share more details about your use case? |
|
As I mentioned in thanos-io/thanos#5255, the issue is on our CNI side. It enables graceful termination on the IPVS side so the connection will remain for 10min in prometheus operator case (the termination seconds is 600s hardcoded). If the promethues itself is down somehow, then thanos sidecar is still available, causing a lot of partial query errors from our Thanos Query for 10m. |
|
@yeya24 sorry for the late follow-up but I'm not sure to understand exactly the scenario. You want to configure a shorter termination grace period in case Prometheus is stuck? |
Yes. If it is stuch in case the backend storage like Ceph is not responsive, then it doesn't make sense to wait for 10m to ensure data is written successfully because the storage is done. |
There was a problem hiding this comment.
If the promethues itself is down somehow, then thanos sidecar is still available
That sounds like a thanos sidecar issue, not prometheus-operator or prometheus one.
I would be against configurable terminationGracePeriodSeconds. In the majority of cases tweaking it can lead to unexpected data loss. I also understand the case that @yeya24 is making about CNI and CSI being unresponsive and that in those cases fast termination is beneficial. However, those issues look like edge cases that can happen in particular critical scenarios which are most likely handled directly by users via kubectl. And if that is the case then kubectl delete pod <prometheus> --grace-period=0 is likely a better way.
If data loss is fine and users want to have a termination time < 10m. I think this use case is still valid since 10m might not fit all users. In our case we want a smaller duration like 5m. What about other use cases that think 10m is too short and they want to have a longer duration like 1h? My point is that an operator should provide some way to allow users to configure k8s native fields like |
|
Hi, team; it's been a while since the last messages in this PR. I'm interested in the feature. However, my use case is different. We want to increase beyond the default 10 minutes because, in some cases, that is not enough for a graceful shutdown if you've enabled the feature flag for snapshot in-memory chunks. See details prometheus/prometheus#7229 Our setup required more than 10 minutes to complete the chunk snapshots successfully. /cc @yeya24 I'm happy to collaborate to get this PR in good shape again. |
|
@rnaveiras more than 10 minutes for the chunks snapshot seems a bit extreme. Do you have an explanation why it takes so long? Have you tried reporting the issue to prometheus/prometheus? Having said that, we had many requests in the past to customize the termination grace periods and though sometimes the justification could be challenged, we also agreed in #4691 that we shouldn't block such customization if there was high demand from the community and no alternative existed. |
It makes the pod.spec TerminationGracePeriodSeconds configurable via the CRDs for prometheus and prometheusagent Fixes prometheus-operator#3433 Closes prometheus-operator#4681 Co-authored-by: Ben Ye <ben.ye@bytedance.com> Signed-off-by: Raul Navieras <me@raulnaveiras.com>
It makes the pod.spec TerminationGracePeriodSeconds configurable via the CRDs for prometheus and prometheusagent Fixes prometheus-operator#3433 Closes prometheus-operator#4681 Co-authored-by: Ben Ye <ben.ye@bytedance.com> Signed-off-by: Raul Navieras <me@raulnaveiras.com>
|
This pull request is being closed because it had no activity in the last 180 days. This is not a signal from the maintainers that the PR has no value. We appreciate the time and effort that you put into this work. If you're willing to re-open it, the maintainers will do their best to review it. |
Signed-off-by: Ben Ye ben.ye@bytedance.com
Description
We have the usecase to configure the termination grace seconds value of the prometheus statefulset. Now the value is hardcoded as 10m.
Type of change
What type of changes does your code introduce to the Prometheus operator? Put an
xin the box that apply.CHANGE(fix or feature that would cause existing functionality to not work as expected)FEATURE(non-breaking change which adds functionality)BUGFIX(non-breaking change which fixes an issue)ENHANCEMENT(non-breaking change which improves existing functionality)NONE(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Changelog entry
Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.