sql,jobs: automatically fail stats jobs running for too long by rytaft · Pull Request #118983 · cockroachdb/cockroach

rytaft · 2024-02-08T21:58:35Z

Release note (sql change): Added a new cluster setting,
sql.stats.job_replaceable_after_duration, which indicates the duration
after which a stats job (either automatic or manual from CREATE STATISTICS /
ANALYZE) can be automatically canceled and replaced by
another stats job. Since only one automatic stats job is allowed to
run at a time in a given cluster, this setting prevents a stuck job from
blocking all other stats jobs indefinitely.

cockroach-teamcity · 2024-02-08T21:58:45Z

This change is

yuzefovich

Nice! I have a couple of comments, and I think you're missing ./dev gen docs.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)

pkg/sql/create_stats.go line 79 at r1 (raw file):

	"sql.stats.max_job_duration",
	"maximum duration of a stats job before it automatically fails",
	time.Hour*24*7, // one week

We should add a validation function like NonNegativeDuration. Then 0 value can be treated as infinity, i.e. never cancel the long running stats job.

pkg/sql/stats/create_stats_job_test.go line 344 at r1 (raw file):

	ctx := context.Background()
	const nodes = 1
	tc := testcluster.StartTestCluster(t, nodes, params)

nit: do we need the full test cluster with 1 node? I.e. can we use StartServer?

dt · 2024-02-08T22:53:18Z

Would it work equally well to make the jobs fail themselves rather than look for other jobs that have run too long?

dt · 2024-02-08T22:58:40Z

concretely I’m wondering if we could just put a context deadline at the top of the resumer (or fire off a little goroutine that polls the setting and cancels a shared ctx if we want updates mid-job to be respected) to make this change fully contained within the stats resumer?

yuzefovich

The benefit of the current approach is that if there are no attempts to create a new stats job, then the existing long-running one will keep on running, and - assuming it's not stuck but is making progress - this seems like a good thing.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)

dt · 2024-02-08T23:06:16Z

That isn’t quite the behavior described in the PR text or setting name — right now it is more of “replacable_after_duration” than “max_duration”.

I guess I’m wondering how much this behavior difference matters? If the job actually needs X hours to complete, then getting lucky and happening to not get aborted by a new one doesn’t seem like a reliable strategy to count on, compared to adjusting the duration allowance to reliably give it that time?

I feel like it is generally easier to operate a predictable behavior of consistently running in <x or failing if it exceeds x than sometimes failing at x and sometimes succeeding

michae2

I have no strong feelings on the current approach vs @dt's suggestion.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @rytaft)

pkg/sql/create_stats.go line 752 at r1 (raw file):

			canceled, innerErr := checkJobRunningForTooLong(ctx, txn, jobID, created, p)
			if innerErr != nil {
				log.Warningf(ctx, "failed to fail job running for too long: %v (jobID: %v, created: %v)",

nit: Instead of "failed to fail" maybe something like "could not mark stats collection job running for too long as failed"?

pkg/sql/create_stats.go line 782 at r1 (raw file):

	// The job has been running for too long, so mark it as failed.
	err = p.ExecCfg().JobRegistry.Failed(ctx, txn, jobID, errors.Newf("stats job has been running for too long"))
	return err == nil, err

Maybe we should return true even if we could not mark the old job as failed. Seems like the old job is stuck for some reason in that case, so we might as well go ahead and resume stats collections.

dt · 2024-02-08T23:51:10Z

pkg/sql/create_stats.go

I'm not sure this method should actually be exported at all; this is moving the job directly into the failed state even while it is still running, whereas callers should be requesting cancellation, which would move it to the cancel requested state, and then when the job noticed the cancelation request and actually exits resume it would move into cancelling, and then when onFailOrCancel completes it would move to cancelled. Essentially the job system relies on cooperative scheduling, where the caller requests a state and then the job moves to it, rather than the caller moving the job. Moving the job directly from running to failed may well violate some job system assumptions so I'm a little wary of this change. We also just had an issue due to stats jobs directly messing with the jobs system in bespoke ways (partially deleting the aborted jobs) so I'm also a little paranoid about this today.

(this is why I was suggesting a simpler approach of simply having the resumer limit its own execution time, rather than trying to reach into one job from another and change its state).

rytaft

TFTRs!

right now it is more of “replacable_after_duration” than “max_duration”.

Good point, I changed the name of the setting and the description. Also responded to your comment below with more discussion.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @yuzefovich)

pkg/sql/create_stats.go line 79 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

We should add a validation function like NonNegativeDuration. Then 0 value can be treated as infinity, i.e. never cancel the long running stats job.

Done.

pkg/sql/create_stats.go line 752 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Instead of "failed to fail" maybe something like "could not mark stats collection job running for too long as failed"?

Done.

pkg/sql/create_stats.go line 781 at r1 (raw file):

simpler approach of simply having the resumer limit its own execution time

But would that have avoided the problem we saw recently where the stats job never even started? It was stuck in the created state.

What I like about the current approach is that there is no additional overhead (e.g., no extra goroutine), and the job reliably is canceled when it needs to be (so another job can run). Is there something I should call instead of this Failed method?

If you think this is very dangerous I can change to use your suggested approach with the goroutine, but I think there are some benefits to this approach.

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Maybe we should return true even if we could not mark the old job as failed. Seems like the old job is stuck for some reason in that case, so we might as well go ahead and resume stats collections.

Done.

pkg/sql/stats/create_stats_job_test.go line 344 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: do we need the full test cluster with 1 node? I.e. can we use StartServer?

Done.

yuzefovich

nit: update the PR description too.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @michae2)

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Done.

Wouldn't this mean that we can have multiple concurrent stats jobs running (even if one of them is stuck)? Would that violate assumptions somewhere?

Fixes cockroachdb#118584 Release note (sql change): Added a new cluster setting, sql.stats.job_replaceable_after_duration, which indicates the duration after which a stats job (either automatic or manual from CREATE STATISTICS / ANALYZE) can be automatically canceled and replaced by another stats job. Since only one automatic stats job is allowed to run at a time in a given cluster, this setting prevents a stuck job from blocking all other stats jobs indefinitely.

rytaft

nit: update the PR description too.

Done.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @yuzefovich)

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Wouldn't this mean that we can have multiple concurrent stats jobs running (even if one of them is stuck)? Would that violate assumptions somewhere?

You're right, I guess that's why we did it this way in the first place. Users can see if there's an error in the logs and manually cancel the job if need be. I put it back to how it was.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

You're right, I guess that's why we did it this way in the first place. Users can see if there's an error in the logs and manually cancel the job if need be. I put it back to how it was.

This thread worries me a little: if it is important that the other job actually not be running then it doesn't actually matter what Failed() returned: the other job can happily keep executing for an unbounded amount of time after Failed() returns, since all Failed is doing is changing the job row; it is up to the job to notice that it has been marked failed and stop executing, so you can't really use Failed returning or not to determine anything.

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

This thread worries me a little: if it is important that the other job actually not be running then it doesn't actually matter what Failed() returned: the other job can happily keep executing for an unbounded amount of time after Failed() returns, since all Failed is doing is changing the job row; it is up to the job to notice that it has been marked failed and stop executing, so you can't really use Failed returning or not to determine anything.

Essentially either it was safe to adopt the optimization @michae2 suggested -- and just be okay with >1 stats jobs executing at once -- or t isn't. If it isn't then we need to switch up the approach in this PR to either a) a voluntary timeout in the first job, so it just gets out of the way if subsequent jobs on its own or b) the second job needs to do the the cancel-request/wait for polling loop to notice/cancellation hooks run/cancelled state to be reached before continuing. If we don't do one of these two, then we're certainly open to >1 stats jobs executing at once, but fact the suggestion up-thread was walked back seems to imply we aren't?

rytaft · 2024-02-12T21:14:48Z

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Essentially either it was safe to adopt the optimization @michae2 suggested -- and just be okay with >1 stats jobs executing at once -- or t isn't. If it isn't then we need to switch up the approach in this PR to either a) a voluntary timeout in the first job, so it just gets out of the way if subsequent jobs on its own or b) the second job needs to do the the cancel-request/wait for polling loop to notice/cancellation hooks run/cancelled state to be reached before continuing. If we don't do one of these two, then we're certainly open to >1 stats jobs executing at once, but fact the suggestion up-thread was walked back seems to imply we aren't?

Good point, in case the job is actually executing (and not just in the created phase), we'd like to cancel it. I think we should do your suggested option (b) to fit with the other goals of this PR. Do you know off-hand of a place we're already doing something similar that I can emulate? If not I can poke around...

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @rytaft, and @yuzefovich)

-- commits line 7 at r4:
Bikeshedding: could we use the word "timeout" in the name of this setting?

rytaft · 2024-02-12T21:33:26Z

-- commits line 7 at r4:

Previously, michae2 (Michael Erickson) wrote…

Bikeshedding: could we use the word "timeout" in the name of this setting?

Sure, what would you change the whole name to?

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)

pkg/sql/create_stats.go line 781 at r1 (raw file):

and the job reliably is canceled

I'm not actually sure there is any difference in the reliability? Without knowing the exact details of the "suck in the created state" failure (and AFAIK we don't have stacks/debug.zip from when it was stuck that way to dig more), I'm not actually sure if a cancellation request would have unstuck it, or if we'd just end up blocked waiting for the cancelation request that was in turn waiting on the stuck start.

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Good point, in case the job is actually executing (and not just in the created phase), we'd like to cancel it. I think we should do your suggested option (b) to fit with the other goals of this PR. Do you know off-hand of a place we're already doing something similar that I can emulate? If not I can poke around...

I'd probably just use an the SQL syntax, via an internal executor, and run CANCEL JOB $1 which will request cancellation, then SHOW JOB WHEN COMPLETE $1 which will block, polling the status with backoff until it reaches a terminal state (likely cancelled), rather than hit the internal registry methods

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @rytaft, and @yuzefovich)

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'd probably just use an the SQL syntax, via an internal executor, and run CANCEL JOB $1 which will request cancellation, then SHOW JOB WHEN COMPLETE $1 which will block, polling the status with backoff until it reaches a terminal state (likely cancelled), rather than hit the internal registry methods

To add some color, note that in https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1707786198918579 we discuss a case where CANCEL JOB failed. Maybe if the cancel fails, we need to alert the customer to contact support?

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

To add some color, note that in https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1707786198918579 we discuss a case where CANCEL JOB failed. Maybe if the cancel fails, we need to alert the customer to contact support?

Right, that is sorta why above I'm wondering if this more complex approach, of having one job try to cancel another and blocking until it does actually is any more reliable than just a simple context.WithDeadline at the top of resumer?

rytaft · 2024-02-13T22:00:58Z

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Right, that is sorta why above I'm wondering if this more complex approach, of having one job try to cancel another and blocking until it does actually is any more reliable than just a simple context.WithDeadline at the top of resumer?

With this approach at least we can log a message before attempting to cancel, and possibly wait for a pre-defined amount of time before doing something else (a louder log message?). Not sure how we'd detect such a problem using the resumer approach.

rytaft · 2024-05-04T02:33:07Z

I think this PR is not a clear improvement, and I won't be able to work on it any time soon, so I'm going to close it.

rytaft requested review from michae2 and yuzefovich February 8, 2024 21:58

rytaft requested review from a team as code owners February 8, 2024 21:58

rytaft requested review from dt and removed request for a team February 8, 2024 21:58

yuzefovich reviewed Feb 8, 2024

View reviewed changes

michae2 reviewed Feb 8, 2024

View reviewed changes

dt reviewed Feb 8, 2024

View reviewed changes

rytaft force-pushed the cancel-stats branch 4 times, most recently from 004a0c1 to e552278 Compare February 9, 2024 00:25

rytaft requested review from dt, michae2 and yuzefovich February 9, 2024 00:27

rytaft commented Feb 9, 2024

View reviewed changes

yuzefovich reviewed Feb 9, 2024

View reviewed changes

rytaft force-pushed the cancel-stats branch from e552278 to 72e7d24 Compare February 9, 2024 02:13

rytaft requested a review from yuzefovich February 9, 2024 02:22

rytaft commented Feb 9, 2024

View reviewed changes

rytaft force-pushed the cancel-stats branch from 72e7d24 to b267c31 Compare February 9, 2024 02:25

dt reviewed Feb 9, 2024

View reviewed changes

michae2 reviewed Feb 12, 2024

View reviewed changes

dt requested a review from michae2 February 13, 2024 14:59

dt reviewed Feb 13, 2024

View reviewed changes

michae2 reviewed Feb 13, 2024

View reviewed changes

dt reviewed Feb 13, 2024

View reviewed changes

rafiss removed the request for review from a team April 17, 2024 14:01

rytaft closed this May 4, 2024

rytaft mentioned this pull request May 4, 2024

sql: automatically cancel stats jobs that are running for too long #118584

Open

michae2 mentioned this pull request Aug 20, 2024

sql: add support for collecting partial statistics automatically #93067

Merged

Conversation

rytaft commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Feb 8, 2024

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Feb 8, 2024

Uh oh!

dt commented Feb 8, 2024

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

dt commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

dt Feb 8, 2024

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Feb 12, 2024

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Feb 12, 2024

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

michae2 left a comment

Choose a reason for hiding this comment

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft commented Feb 13, 2024

Uh oh!

rytaft commented May 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rytaft commented Feb 8, 2024 •

edited

Loading

dt commented Feb 8, 2024 •

edited

Loading