Skip to content

Maximum number of failures or failure backoff policy for Jobs #30243

@maximz

Description

@maximz

A max number of failures or failure backoff policy for Jobs would be useful.

Imagine you have an ETL job in production that's failing due to some pathological input.
By the time you spot it, it's already been rescheduled thousands of times. As an example, I had 20 broken jobs that kept getting rescheduled; killing them took forever -- and crashed the Kubernetes dashboard in the process.

Today, "restartPolicy" is not respected by Jobs because the goal is to achieve successful completion. (Strangely, "restartPolicy: Never" is still valid YAML.) This means failed jobs keep getting rescheduled. When you go to delete them, you have to delete all the pods they've been scheduled on. Deletes are rate limited, and in v1.3+, the verbose "you're being throttled" messages are hidden from you when you run the kubectl command to delete a job. So it just looks like it's taking forever! This is not a pleasant UX if you have a runaway job in prod or if you're testing out a new job and it takes minutes to clean up after a broken test.

What are your thoughts on specifying a maximum number of failures/retries or adding a failure backoff restart policy?

(cc @thockin , who suggested I file this feature request)


Ninja edit: per this SO thread, the throttling issue is avoidable by using the OnFailure restart policy to keep rescheduling to the same pods, rather than to new pods -- i.e. this prevents the explosion of the number of pods. And deadlines can help weed out failures after a certain amount of time.

However, suppose my ETL job takes an hour to run properly but may fail within seconds if the input data is bad. I'd rather specify a maximum number of retries than a high deadline.

Metadata

Metadata

Assignees

Labels

area/batchsig/appsCategorizes an issue or PR as relevant to SIG Apps.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions