Make termination grace seconds configurable by yeya24 · Pull Request #4681 · prometheus-operator/prometheus-operator

yeya24 · 2022-03-25T05:32:28Z

Signed-off-by: Ben Ye ben.ye@bytedance.com

Description

We have the usecase to configure the termination grace seconds value of the prometheus statefulset. Now the value is hardcoded as 10m.

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

CHANGE (fix or feature that would cause existing functionality to not work as expected)
FEATURE (non-breaking change which adds functionality)
BUGFIX (non-breaking change which fixes an issue)
ENHANCEMENT (non-breaking change which improves existing functionality)
NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

Make terminationGracePeriodSeconds configurable for the Prometheus Statefulset

Signed-off-by: Ben Ye <ben.ye@bytedance.com>

slashpai · 2022-03-25T07:39:44Z

pkg/apis/monitoring/v1/types.go

+	// Set this value longer than the expected cleanup time for your process.
+	// Defaults to 600 seconds.
+	// +optional
+	TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty"`


Should this be better to use uint to avoid negative values

Also add default value using validation marker so we can remove the conditional from statefulset code

// +kubebuilder:default:="600"

Done. Thanks for the review.

Signed-off-by: Ben Ye <ben.ye@bytedance.com>

slashpai

Can you also add tests?

slashpai · 2022-03-26T03:35:53Z

pkg/apis/monitoring/v1/types.go

-	// Defaults to 600 seconds.
 	// +optional
-	TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty"`
+	// +kubebuilder:default:="600"


Since its int type we don't need quotes

// +kubebuilder:default:=600

Thanks. Updated

slashpai · 2022-03-26T03:36:24Z

pkg/prometheus/statefulset.go

-	var minReadySeconds int32
+	var (
+		minReadySeconds        int32
+		terminationGracePeriod int64


This should be uint too?

No it is casted from uint64. The pod spec needs int64 not uint64.

yeya24 · 2022-03-28T02:34:25Z

I have added tests. Can you please help take a look? Thanks!

philipgough

lgtm

simonpasquier · 2022-03-28T09:40:12Z

👋 @yeya24 can you share more details about your use case?
In the past (#3433) we've been reluctant adding such field since the requirements were either faster/immediate migrations (but increasing the risk of data corruption) or alleviating sub-optimal performances on Prometheus shutdown (which should rather be addressed in Prometheus directly).

yeya24 · 2022-03-28T17:06:14Z

As I mentioned in thanos-io/thanos#5255, the issue is on our CNI side. It enables graceful termination on the IPVS side so the connection will remain for 10min in prometheus operator case (the termination seconds is 600s hardcoded). If the promethues itself is down somehow, then thanos sidecar is still available, causing a lot of partial query errors from our Thanos Query for 10m.

simonpasquier · 2022-04-20T12:27:40Z

@yeya24 sorry for the late follow-up but I'm not sure to understand exactly the scenario. You want to configure a shorter termination grace period in case Prometheus is stuck?

yeya24 · 2022-04-20T20:49:39Z

@yeya24 sorry for the late follow-up but I'm not sure to understand exactly the scenario. You want to configure a shorter termination grace period in case Prometheus is stuck?

Yes. If it is stuch in case the backend storage like Ceph is not responsive, then it doesn't make sense to wait for 10m to ensure data is written successfully because the storage is done.
In this case, I want to stop it ASAP.

paulfantom

If the promethues itself is down somehow, then thanos sidecar is still available

That sounds like a thanos sidecar issue, not prometheus-operator or prometheus one.

I would be against configurable terminationGracePeriodSeconds. In the majority of cases tweaking it can lead to unexpected data loss. I also understand the case that @yeya24 is making about CNI and CSI being unresponsive and that in those cases fast termination is beneficial. However, those issues look like edge cases that can happen in particular critical scenarios which are most likely handled directly by users via kubectl. And if that is the case then kubectl delete pod <prometheus> --grace-period=0 is likely a better way.

yeya24 · 2022-05-10T19:31:52Z

That sounds like a thanos sidecar issue, not prometheus-operator or prometheus one.
I agree and this was solved on that side already.

In the majority of cases tweaking it can lead to unexpected data loss

If data loss is fine and users want to have a termination time < 10m. I think this use case is still valid since 10m might not fit all users. In our case we want a smaller duration like 5m. What about other use cases that think 10m is too short and they want to have a longer duration like 1h?

My point is that an operator should provide some way to allow users to configure k8s native fields like terminationGracePeriodSeconds. Users can take this risk if they really want to, like us.

rnaveiras · 2023-11-22T16:32:20Z

Hi, team; it's been a while since the last messages in this PR.

I'm interested in the feature. However, my use case is different. We want to increase beyond the default 10 minutes because, in some cases, that is not enough for a graceful shutdown if you've enabled the feature flag for snapshot in-memory chunks. See details prometheus/prometheus#7229

Our setup required more than 10 minutes to complete the chunk snapshots successfully.

/cc @yeya24 I'm happy to collaborate to get this PR in good shape again.

simonpasquier · 2023-11-22T16:57:34Z

@rnaveiras more than 10 minutes for the chunks snapshot seems a bit extreme. Do you have an explanation why it takes so long? Have you tried reporting the issue to prometheus/prometheus?

Having said that, we had many requests in the past to customize the termination grace periods and though sometimes the justification could be challenged, we also agreed in #4691 that we shouldn't block such customization if there was high demand from the community and no alternative existed.
In summary, feel free to resurrect this pull request (sharing credits with @yeya24 of course).

It makes the pod.spec TerminationGracePeriodSeconds configurable via the CRDs for prometheus and prometheusagent Fixes prometheus-operator#3433 Closes prometheus-operator#4681 Co-authored-by: Ben Ye <ben.ye@bytedance.com> Signed-off-by: Raul Navieras <me@raulnaveiras.com>

github-actions · 2024-07-21T01:56:16Z

This pull request is being closed because it had no activity in the last 180 days. This is not a signal from the maintainers that the PR has no value. We appreciate the time and effort that you put into this work. If you're willing to re-open it, the maintainers will do their best to review it.

make termination grace seconds configurable

665eaa3

Signed-off-by: Ben Ye <ben.ye@bytedance.com>

yeya24 requested a review from a team as a code owner March 25, 2022 05:32

slashpai reviewed Mar 25, 2022

View reviewed changes

yeya24 mentioned this pull request Mar 25, 2022

info: Return store info only when the service is ready thanos-io/thanos#5255

Merged

2 tasks

fix review comments

1687c33

Signed-off-by: Ben Ye <ben.ye@bytedance.com>

slashpai reviewed Mar 26, 2022

View reviewed changes

Ben Ye added 2 commits March 25, 2022 23:04

fix review

bbcd245

add unit test

2a61c7d

philipgough approved these changes Mar 28, 2022

View reviewed changes

simonpasquier mentioned this pull request Mar 28, 2022

How can the CRDs offer more flexibility/control for the pod template spec? #4691

Open

paulfantom requested changes May 10, 2022

View reviewed changes

slashpai mentioned this pull request Aug 2, 2022

Prometheus and Alertmanager Kind needs to support the configuration of the parameter of TerminationGracePeriod #4948

Open

github-actions bot added the stale label Jan 11, 2023

github-actions bot removed the stale label Nov 23, 2023

github-actions bot added the stale label Jan 22, 2024

github-actions bot closed this Jul 21, 2024

Conversation

yeya24 commented Mar 25, 2022 • edited by fpetkovski Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changelog entry

Uh oh!

slashpai Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

slashpai Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

yeya24 Mar 25, 2022

Choose a reason for hiding this comment

Uh oh!

slashpai left a comment

Choose a reason for hiding this comment

Uh oh!

slashpai Mar 26, 2022

Choose a reason for hiding this comment

Uh oh!

yeya24 Mar 26, 2022

Choose a reason for hiding this comment

Uh oh!

slashpai Mar 26, 2022

Choose a reason for hiding this comment

Uh oh!

yeya24 Mar 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 commented Mar 28, 2022

Uh oh!

philipgough left a comment

Choose a reason for hiding this comment

Uh oh!

simonpasquier commented Mar 28, 2022

Uh oh!

yeya24 commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonpasquier commented Apr 20, 2022

Uh oh!

yeya24 commented Apr 20, 2022

Uh oh!

paulfantom left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeya24 commented May 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rnaveiras commented Nov 22, 2023

Uh oh!

simonpasquier commented Nov 22, 2023

Uh oh!

github-actions bot commented Jul 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yeya24 commented Mar 25, 2022 •

edited by fpetkovski

Loading

yeya24 Mar 26, 2022 •

edited

Loading

yeya24 commented Mar 28, 2022 •

edited

Loading

paulfantom left a comment •

edited

Loading

yeya24 commented May 10, 2022 •

edited

Loading