Skip to content

jobs: improve job adoption#53589

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
ajwerner:ajwerner/fix-jobs-badness
Sep 1, 2020
Merged

jobs: improve job adoption#53589
craig[bot] merged 4 commits intocockroachdb:masterfrom
ajwerner:ajwerner/fix-jobs-badness

Conversation

@ajwerner
Copy link
Copy Markdown
Contributor

jobs: don't hold mutex during adoption, launch in parallel

jobs: break up new stages of job lifecycle movement

In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.

These stages are:

  • Cancel jobs
  • Serve pause and cancel requests
  • Delete claims due to dead sessions
  • Claim jobs
  • Process claimed jobs

This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.

After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.

Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).

Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None

Release justification: bug fixes and low-risk updates to new functionality
Release note: None
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/fix-jobs-badness branch 2 times, most recently from be28a4b to cf3b6e7 Compare August 28, 2020 14:28
In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.

These stages are:

 * Cancel jobs
 * Serve pause and cancel requests
 * Delete claims due to dead sessions
 * Claim jobs
 * Process claimed jobs

This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.

After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.

Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).

Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-jobs-badness branch from cf3b6e7 to 8020c95 Compare August 28, 2020 15:02
@ajwerner
Copy link
Copy Markdown
Contributor Author

I've roachprod-stressraced this for 16 minutes so far.

@ajwerner ajwerner requested a review from spaskob August 28, 2020 15:03
@ajwerner ajwerner marked this pull request as ready for review August 28, 2020 15:03
@ajwerner ajwerner requested a review from a team August 28, 2020 15:03
@ajwerner ajwerner requested a review from a team as a code owner August 28, 2020 15:03
@ajwerner ajwerner requested review from adityamaru and removed request for a team August 28, 2020 15:03
@ajwerner
Copy link
Copy Markdown
Contributor Author

On master the typeorm test takes:

--- PASS: typeorm (2425.84s)
--- PASS: typeorm (2510.47s)
--- PASS: typeorm (2554.91s)
PASS

With this PR it is:

--- PASS: typeorm (1774.17s)
--- PASS: typeorm (1786.25s)
--- PASS: typeorm (1821.85s)
PASS

Better but not amazing. cc @rafiss

@ajwerner
Copy link
Copy Markdown
Contributor Author

Actually setting the GC TTL and the merge queue setting gets us to:

--- PASS: typeorm (1444.19s)
--- PASS: typeorm (1463.58s)
--- PASS: typeorm (1530.58s)

The CPU profiles are rather hilarious still. I'm working on an RFC to deal with the root cause of that pain. I'll post the basic finding which I should have very soon.

@ajwerner
Copy link
Copy Markdown
Contributor Author

A couple more tweaks to the KV layer gets us to:

=== RUN   typeorm
=== RUN   typeorm
=== RUN   typeorm
--- PASS: typeorm (1094.87s)
--- PASS: typeorm (1094.73s)
--- PASS: typeorm (1084.94s)

@ajwerner
Copy link
Copy Markdown
Contributor Author

In sum this + one of #53605 or #53603 + #53606 gets us to:

--- PASS: typeorm (1094.87s)
--- PASS: typeorm (1094.73s)
--- PASS: typeorm (1084.94s)

if r.adoptionDisabled(ctx) {
r.deprecatedCancelAll(ctx)
return
removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused what is happening here - this is the old adoption logic. Why are we changing it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this comment. This loop was both the old and the new logic. I've now separated them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic is in claimAndProcessJobs

return stop.ErrUnavailable
case <-ctx.Done():
return ctx.Err()
if !usingSQLLiveness || i == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment explaining the if condition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Andrew Werner added 2 commits August 31, 2020 08:50
This commit turns out to make a reasonably big difference.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None
Release justification: low risk, high benefit changes to existing functionality
Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-jobs-badness branch from 8020c95 to df0e881 Compare August 31, 2020 12:50
Copy link
Copy Markdown
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL
ping me directly if it is easier

if r.adoptionDisabled(ctx) {
r.deprecatedCancelAll(ctx)
return
removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic is in claimAndProcessJobs

Copy link
Copy Markdown
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 15 files at r2, 2 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @spaskob)


pkg/jobs/adopt.go, line 77 at r4 (raw file):

// resumeClaimedJobs invokes r.resumeJob for each job in claimedToResume. It
// does so concurrently.
func (r *Registry) resumeClaimedJobs(

an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.

Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=spaskob

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @spaskob)


pkg/jobs/adopt.go, line 77 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.

In a world where the interfaces were different, I think you're right. I'd love to inject the set of currently running jobs under the sql query via a builtin or virtual table and then pull the job payload directly but that feels more disruptive at this point. I think this will work nicely for this release.


pkg/jobs/registry.go, line 594 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

The new logic is in claimAndProcessJobs

Done.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 31, 2020

Build failed (retrying...):

@ajwerner
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 31, 2020

Already running a review

@otan
Copy link
Copy Markdown
Contributor

otan commented Aug 31, 2020

is this causing TestTruncateCompletion to fail in https://teamcity.cockroachdb.com/viewLog.html?buildId=2236228&buildTypeId=Cockroach_UnitTests?

@ajwerner
Copy link
Copy Markdown
Contributor Author

Maybe.

bors r-

while I investigate

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 31, 2020

Canceled.

ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Aug 31, 2020
The table data is deleted asynchronously by the GC job. It's not obvious to me
why this wasn't flakey before. The work in cockroachdb#53589 to generally speed up job
adoption seems to have revealed the flake that I think existed before the
change. This test used to fail under stress 2 minutes. I've run it for 6
now and so far so good.

Release justification: non-production code changes

Release note: None
craig bot pushed a commit that referenced this pull request Sep 1, 2020
53711: sql: fix TestTruncateCompletion r=rohany a=ajwerner

The table data is deleted asynchronously by the GC job. It's not obvious to me
why this wasn't flakey before. The work in #53589 to generally speed up job
adoption seems to have revealed the flake that I think existed before the
change. This test used to fail under stress 2 minutes. I've run it for 6
now and so far so good.

Release justification: non-production code changes

Release note: None

53714: roachtest: add expected passes to ORM tests, stabilize sqlalchemy r=rafiss a=rafiss

fixes #53598 

Release note: None
Release justification: test-only change

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@ajwerner
Copy link
Copy Markdown
Contributor Author

ajwerner commented Sep 1, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 1, 2020

Build succeeded:

@craig craig bot merged commit d331642 into cockroachdb:master Sep 1, 2020
@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Sep 1, 2020

touches #52556

ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 3, 2020
This was broken in cockroachdb#53589. Having this broken made the backup tests much
slower.

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
craig bot pushed a commit that referenced this pull request Sep 3, 2020
53872: jobs: fix TestingNudgeAdoptionQueue r=ajwerner a=ajwerner

This was broken in #53589. Having this broken made the backup tests much
slower.

Release justification: bug fixes and low-risk updates to new functionality

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
ajwerner pushed a commit to ajwerner/cockroach that referenced this pull request Sep 3, 2020
These were missed in cockroachdb#53589.

These settings would really be much better served by a cluster setting.

Release justification: non-production code change.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 8, 2020
53898: sql,importccl: adopt jobs.TestingSetAdoptAndCancelIntervals r=ajwerner a=ajwerner

These were missed in #53589.

These settings would really be much better served by a cluster setting.

Release justification: non-production code change.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants