Skip to content

sql,jobs: automatically fail stats jobs running for too long#118983

Closed
rytaft wants to merge 1 commit intocockroachdb:masterfrom
rytaft:cancel-stats
Closed

sql,jobs: automatically fail stats jobs running for too long#118983
rytaft wants to merge 1 commit intocockroachdb:masterfrom
rytaft:cancel-stats

Conversation

@rytaft
Copy link
Copy Markdown
Collaborator

@rytaft rytaft commented Feb 8, 2024

Fixes #118584

Release note (sql change): Added a new cluster setting,
sql.stats.job_replaceable_after_duration, which indicates the duration
after which a stats job (either automatic or manual from CREATE STATISTICS /
ANALYZE) can be automatically canceled and replaced by
another stats job. Since only one automatic stats job is allowed to
run at a time in a given cluster, this setting prevents a stuck job from
blocking all other stats jobs indefinitely.

@rytaft rytaft requested review from michae2 and yuzefovich February 8, 2024 21:58
@rytaft rytaft requested review from a team as code owners February 8, 2024 21:58
@rytaft rytaft requested review from dt and removed request for a team February 8, 2024 21:58
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I have a couple of comments, and I think you're missing ./dev gen docs.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)


pkg/sql/create_stats.go line 79 at r1 (raw file):

	"sql.stats.max_job_duration",
	"maximum duration of a stats job before it automatically fails",
	time.Hour*24*7, // one week

We should add a validation function like NonNegativeDuration. Then 0 value can be treated as infinity, i.e. never cancel the long running stats job.


pkg/sql/stats/create_stats_job_test.go line 344 at r1 (raw file):

	ctx := context.Background()
	const nodes = 1
	tc := testcluster.StartTestCluster(t, nodes, params)

nit: do we need the full test cluster with 1 node? I.e. can we use StartServer?

@dt
Copy link
Copy Markdown
Contributor

dt commented Feb 8, 2024

Would it work equally well to make the jobs fail themselves rather than look for other jobs that have run too long?

@dt
Copy link
Copy Markdown
Contributor

dt commented Feb 8, 2024

concretely I’m wondering if we could just put a context deadline at the top of the resumer (or fire off a little goroutine that polls the setting and cancels a shared ctx if we want updates mid-job to be respected) to make this change fully contained within the stats resumer?

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of the current approach is that if there are no attempts to create a new stats job, then the existing long-running one will keep on running, and - assuming it's not stuck but is making progress - this seems like a good thing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)

@dt
Copy link
Copy Markdown
Contributor

dt commented Feb 8, 2024

That isn’t quite the behavior described in the PR text or setting name — right now it is more of “replacable_after_duration” than “max_duration”.

I guess I’m wondering how much this behavior difference matters? If the job actually needs X hours to complete, then getting lucky and happening to not get aborted by a new one doesn’t seem like a reliable strategy to count on, compared to adjusting the duration allowance to reliably give it that time?

I feel like it is generally easier to operate a predictable behavior of consistently running in <x or failing if it exceeds x than sometimes failing at x and sometimes succeeding

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong feelings on the current approach vs @dt's suggestion.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @rytaft)


pkg/sql/create_stats.go line 752 at r1 (raw file):

			canceled, innerErr := checkJobRunningForTooLong(ctx, txn, jobID, created, p)
			if innerErr != nil {
				log.Warningf(ctx, "failed to fail job running for too long: %v (jobID: %v, created: %v)",

nit: Instead of "failed to fail" maybe something like "could not mark stats collection job running for too long as failed"?


pkg/sql/create_stats.go line 782 at r1 (raw file):

	// The job has been running for too long, so mark it as failed.
	err = p.ExecCfg().JobRegistry.Failed(ctx, txn, jobID, errors.Newf("stats job has been running for too long"))
	return err == nil, err

Maybe we should return true even if we could not mark the old job as failed. Seems like the old job is stuck for some reason in that case, so we might as well go ahead and resume stats collections.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this method should actually be exported at all; this is moving the job directly into the failed state even while it is still running, whereas callers should be requesting cancellation, which would move it to the cancel requested state, and then when the job noticed the cancelation request and actually exits resume it would move into cancelling, and then when onFailOrCancel completes it would move to cancelled. Essentially the job system relies on cooperative scheduling, where the caller requests a state and then the job moves to it, rather than the caller moving the job. Moving the job directly from running to failed may well violate some job system assumptions so I'm a little wary of this change. We also just had an issue due to stats jobs directly messing with the jobs system in bespoke ways (partially deleting the aborted jobs) so I'm also a little paranoid about this today.

(this is why I was suggesting a simpler approach of simply having the resumer limit its own execution time, rather than trying to reach into one job from another and change its state).

@rytaft rytaft force-pushed the cancel-stats branch 4 times, most recently from 004a0c1 to e552278 Compare February 9, 2024 00:25
@rytaft rytaft requested review from dt, michae2 and yuzefovich February 9, 2024 00:27
Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

right now it is more of “replacable_after_duration” than “max_duration”.

Good point, I changed the name of the setting and the description. Also responded to your comment below with more discussion.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @yuzefovich)


pkg/sql/create_stats.go line 79 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

We should add a validation function like NonNegativeDuration. Then 0 value can be treated as infinity, i.e. never cancel the long running stats job.

Done.


pkg/sql/create_stats.go line 752 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Instead of "failed to fail" maybe something like "could not mark stats collection job running for too long as failed"?

Done.


pkg/sql/create_stats.go line 781 at r1 (raw file):

simpler approach of simply having the resumer limit its own execution time

But would that have avoided the problem we saw recently where the stats job never even started? It was stuck in the created state.

What I like about the current approach is that there is no additional overhead (e.g., no extra goroutine), and the job reliably is canceled when it needs to be (so another job can run). Is there something I should call instead of this Failed method?

If you think this is very dangerous I can change to use your suggested approach with the goroutine, but I think there are some benefits to this approach.


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Maybe we should return true even if we could not mark the old job as failed. Seems like the old job is stuck for some reason in that case, so we might as well go ahead and resume stats collections.

Done.


pkg/sql/stats/create_stats_job_test.go line 344 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: do we need the full test cluster with 1 node? I.e. can we use StartServer?

Done.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the PR description too.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @michae2)


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Done.

Wouldn't this mean that we can have multiple concurrent stats jobs running (even if one of them is stuck)? Would that violate assumptions somewhere?

Fixes cockroachdb#118584

Release note (sql change): Added a new cluster setting,
sql.stats.job_replaceable_after_duration, which indicates the duration
after which a stats job (either automatic or manual from CREATE
STATISTICS / ANALYZE) can be automatically canceled and replaced by
another stats job. Since only one automatic stats job is allowed to
run at a time in a given cluster, this setting prevents a stuck job from
blocking all other stats jobs indefinitely.
@rytaft rytaft requested a review from yuzefovich February 9, 2024 02:22
Copy link
Copy Markdown
Collaborator Author

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update the PR description too.

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @yuzefovich)


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Wouldn't this mean that we can have multiple concurrent stats jobs running (even if one of them is stuck)? Would that violate assumptions somewhere?

You're right, I guess that's why we did it this way in the first place. Users can see if there's an error in the logs and manually cancel the job if need be. I put it back to how it was.

Copy link
Copy Markdown
Contributor

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

You're right, I guess that's why we did it this way in the first place. Users can see if there's an error in the logs and manually cancel the job if need be. I put it back to how it was.

This thread worries me a little: if it is important that the other job actually not be running then it doesn't actually matter what Failed() returned: the other job can happily keep executing for an unbounded amount of time after Failed() returns, since all Failed is doing is changing the job row; it is up to the job to notice that it has been marked failed and stop executing, so you can't really use Failed returning or not to determine anything.

Copy link
Copy Markdown
Contributor

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

This thread worries me a little: if it is important that the other job actually not be running then it doesn't actually matter what Failed() returned: the other job can happily keep executing for an unbounded amount of time after Failed() returns, since all Failed is doing is changing the job row; it is up to the job to notice that it has been marked failed and stop executing, so you can't really use Failed returning or not to determine anything.

Essentially either it was safe to adopt the optimization @michae2 suggested -- and just be okay with >1 stats jobs executing at once -- or t isn't. If it isn't then we need to switch up the approach in this PR to either a) a voluntary timeout in the first job, so it just gets out of the way if subsequent jobs on its own or b) the second job needs to do the the cancel-request/wait for polling loop to notice/cancellation hooks run/cancelled state to be reached before continuing. If we don't do one of these two, then we're certainly open to >1 stats jobs executing at once, but fact the suggestion up-thread was walked back seems to imply we aren't?

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented Feb 12, 2024

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Essentially either it was safe to adopt the optimization @michae2 suggested -- and just be okay with >1 stats jobs executing at once -- or t isn't. If it isn't then we need to switch up the approach in this PR to either a) a voluntary timeout in the first job, so it just gets out of the way if subsequent jobs on its own or b) the second job needs to do the the cancel-request/wait for polling loop to notice/cancellation hooks run/cancelled state to be reached before continuing. If we don't do one of these two, then we're certainly open to >1 stats jobs executing at once, but fact the suggestion up-thread was walked back seems to imply we aren't?

Good point, in case the job is actually executing (and not just in the created phase), we'd like to cancel it. I think we should do your suggested option (b) to fit with the other goals of this PR. Do you know off-hand of a place we're already doing something similar that I can emulate? If not I can poke around...

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @rytaft, and @yuzefovich)


-- commits line 7 at r4:
Bikeshedding: could we use the word "timeout" in the name of this setting?

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented Feb 12, 2024

-- commits line 7 at r4:

Previously, michae2 (Michael Erickson) wrote…

Bikeshedding: could we use the word "timeout" in the name of this setting?

Sure, what would you change the whole name to?

@dt dt requested a review from michae2 February 13, 2024 14:59
Copy link
Copy Markdown
Contributor

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


pkg/sql/create_stats.go line 781 at r1 (raw file):

and the job reliably is canceled

I'm not actually sure there is any difference in the reliability? Without knowing the exact details of the "suck in the created state" failure (and AFAIK we don't have stacks/debug.zip from when it was stuck that way to dig more), I'm not actually sure if a cancellation request would have unstuck it, or if we'd just end up blocked waiting for the cancelation request that was in turn waiting on the stuck start.


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Good point, in case the job is actually executing (and not just in the created phase), we'd like to cancel it. I think we should do your suggested option (b) to fit with the other goals of this PR. Do you know off-hand of a place we're already doing something similar that I can emulate? If not I can poke around...

I'd probably just use an the SQL syntax, via an internal executor, and run CANCEL JOB $1 which will request cancellation, then SHOW JOB WHEN COMPLETE $1 which will block, polling the status with backoff until it reaches a terminal state (likely cancelled), rather than hit the internal registry methods

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @rytaft, and @yuzefovich)


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'd probably just use an the SQL syntax, via an internal executor, and run CANCEL JOB $1 which will request cancellation, then SHOW JOB WHEN COMPLETE $1 which will block, polling the status with backoff until it reaches a terminal state (likely cancelled), rather than hit the internal registry methods

To add some color, note that in https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1707786198918579 we discuss a case where CANCEL JOB failed. Maybe if the cancel fails, we need to alert the customer to contact support?

Copy link
Copy Markdown
Contributor

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

To add some color, note that in https://cockroachlabs.slack.com/archives/C02DSDS9TM1/p1707786198918579 we discuss a case where CANCEL JOB failed. Maybe if the cancel fails, we need to alert the customer to contact support?

Right, that is sorta why above I'm wondering if this more complex approach, of having one job try to cancel another and blocking until it does actually is any more reliable than just a simple context.WithDeadline at the top of resumer?

@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented Feb 13, 2024

pkg/sql/create_stats.go line 782 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Right, that is sorta why above I'm wondering if this more complex approach, of having one job try to cancel another and blocking until it does actually is any more reliable than just a simple context.WithDeadline at the top of resumer?

With this approach at least we can log a message before attempting to cancel, and possibly wait for a pre-defined amount of time before doing something else (a louder log message?). Not sure how we'd detect such a problem using the resumer approach.

@rafiss rafiss removed the request for review from a team April 17, 2024 14:01
@rytaft
Copy link
Copy Markdown
Collaborator Author

rytaft commented May 4, 2024

I think this PR is not a clear improvement, and I won't be able to work on it any time soon, so I'm going to close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: automatically cancel stats jobs that are running for too long

5 participants