-
Notifications
You must be signed in to change notification settings - Fork 42.8k
Maximum number of failures or failure backoff policy for Jobs #30243
Description
A max number of failures or failure backoff policy for Jobs would be useful.
Imagine you have an ETL job in production that's failing due to some pathological input.
By the time you spot it, it's already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever -- and crashed the Kubernetes dashboard in the process.
Today, "restartPolicy" is not respected by Jobs because the goal is to achieve successful completion. (Strangely, "restartPolicy: Never" is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they've been scheduled on. Deletes are rate limited, and in v1.3+, the verbose "you're being throttled" messages are hidden from you when you run the kubectl command to delete a job. So it just looks like it's taking forever! This is not a pleasant UX if you have a runaway job in prod or if you're testing out a new job and it takes minutes to clean up after a broken test.
What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?
(cc @thockin , who suggested I file this feature request)
Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods -- i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.
However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I'd rather specify a maximum number of retries than a high deadline.