jobs: perform exponential backoff to reduce impact of jobs that cause panics

**Is your feature request related to a problem? Please describe.**
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: https://github.com/cockroachdb/cockroach/issues/44252.

The impact can be very high. See this graph of the SQL prober error rate:

![image](https://user-images.githubusercontent.com/1443719/73546451-415a7e00-440b-11ea-8027-88d2d8e917e6.png)

50-100% error rate for 1hr!

The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).

How can we reduce impact / make it easier to mitigate this issue?

1. If a job fails, the job system could do an exponential backoff.
2. If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it.
3. If an SRE passes a command line flag to CRDB, the job system could not pick up any jobs.

This bug tracks 1 only.

I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.

**Describe the solution you'd like**
If a job fails, the job system could do an exponential backoff. This would reduce the impact of a job that causes panics. The amount of time between panics would increase over time. This would also make it easier for an operator to cancel the job.

I don't know that the job system is not ALREADY doing this. If so, my bad! I do see the cluster setting `jobs.registry.leniency`. The description for this cluster setting reads "the amount of time to defer any attempts to reschedule a job". Doesn't sound like an exponential backoff.

On the CC side, we should set this cluster setting so as to reduce impact of jobs that cause panics, IMHO.

**Describe alternatives you've considered**
See 1, 2, and 3 from the above list.

@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy 

Epic: CRDB-7912


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

jobs: perform exponential backoff to reduce impact of jobs that cause panics #44594

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions