Skip to content

jobs: enable marking jobs as idle for metrics#75897

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
samiskin:mark-jobs-idle
Feb 10, 2022
Merged

jobs: enable marking jobs as idle for metrics#75897
craig[bot] merged 1 commit intocockroachdb:masterfrom
samiskin:mark-jobs-idle

Conversation

@samiskin
Copy link
Copy Markdown
Contributor

@samiskin samiskin commented Feb 2, 2022

In order to allow long-running / indefinite jobs to not continually
block a tenant from being treated as inactive in a serverless
environment, this change adds a MarkIdle API to the jobs registry which
increases / decreases a counter of currently idle jobs. An idle job
should be able to be shut down at any moment. Through the CurrentlyIdle
and CurrentlyRunning jobs numbers serverless can determine if all jobs
are idle and the tenant can be safely shut down. For now this is only
saved in-memory in the registry object, however we may transition to
saving in persisted job state as well in
the future.

Fixes #74747

Release note: None

@samiskin samiskin requested a review from a team as a code owner February 2, 2022 21:32
@samiskin samiskin requested review from shermanCRL and removed request for a team February 2, 2022 21:32
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@samiskin
Copy link
Copy Markdown
Contributor Author

samiskin commented Feb 2, 2022

Putting this in a separate PR from the rest of the Changefeeds side of using this API so that this can be merged without being delayed by any changefeed-specific feedback. Time matters since some work on the KV team is blocked by this.

@samiskin samiskin requested a review from miretskiy February 2, 2022 21:35
@samiskin
Copy link
Copy Markdown
Contributor Author

samiskin commented Feb 2, 2022

cc @irfansharif

@samiskin samiskin force-pushed the mark-jobs-idle branch 2 times, most recently from 475ed7e to d817228 Compare February 2, 2022 22:03
@samiskin samiskin requested a review from dt February 2, 2022 22:04
err = resumer.Resume(resumeCtx, execCtx)
}()

r.MarkIdle(job, false)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not needed since false is the default anyway?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry forgot to respond to this last time, this is to unmark its idleness after its been potentially marked idle during its resumer.Resume(...) lifetime.

}

// A single job should not toggle its idleness more than twice per-minute as it
// is logged and may write to persisted job state in the future
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know; I wouldn't mention anything about persistence here or in the PR description.
I would however mention that MarkIdle should not be called too frequently (w/out prescribing 2x/minute)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was from the discussion in #74747, I don't mind removing the persistence/freq specifics since the main point is just the infrequent requirement

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIs have contracts that are just in comments, and maybe not enforced in code but could cause misbehavior if violated, all the time like "caller must hold lock xyz" or "is not threadsafe". I think "should not be called more than x times per minute" fits that bill too (and is how we justify logging below)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment should end with a period.

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @samiskin, and @shermanCRL)


pkg/jobs/registry.go, line 1290 at r1 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

This was from the discussion in #74747, I don't mind removing the persistence/freq specifics since the main point is just the infrequent requirement

This is an API. There is nothing enforcing any sort of limits on that. It's just a recommendation.


pkg/jobs/registry.go, line 1299 at r1 (raw file):

Previously, dt (David Taylor) wrote…

So I think I argued for logging these state transitions in the issue. We log, today, when a job starts running ("stepping though state running..."). This is useful since you can now look at a log and go "aha, the job started running at 123, and as 124 we see these things explode". If a job is moving from idle/parked into active work, that is potentially equally notable to when it initially started running, if idleness is defined on a long enough time-scale; in conjunction with the contract above that says you can only change state twice a minute, I think you can argue this logline isn't too spammy.

Ack. I'm okay here; was a bit worried if this is called 100s of times/sec.. But I guess seeing "broken" code like this is also okay.


pkg/jobs/registry.go, line 1300 at r1 (raw file):

Previously, samiskin (Shiranka Miskin) wrote…

This should already function that way since it'll only inc/dec if aj.isIdle != isIdle, so multiple MarkIdle(true) fail that. I'll add the below-0 error and a test around this 👍

Ahh Yes, nice. But please do add the tests.

test.update(job.ID())

testutils.SucceedsSoon(t, func() error {
if r.IsJobIdle(job.ID()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you cancel the job, isn't it possible that the registry will drop that job from it's in-memory structure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that is possible, but right now I have IsJobIdle(...) simply returning false if the job isn't currently adopted in the registry (this is mentioned in the comment above the function). I could also make IsJobIdle return a bool + error for if the job ID isn't found.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... Perhaps this is a test only function? If it were renamed to TestingIsJobIdle, I'd be more okay w/ that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it TestingIsJobIdle for now and then see if it ends up being necessary for the changefeeds code in that PR. I was initially using it when manually deciding to place down a protected timestamp record but that'l no longer be necessary.

In order to allow long-running / indefinite jobs to not continually
block a tenant from being treated as inactive in a serverless
environment, this change adds a MarkIdle API to the jobs registry which
increases / decreases a counter of currently idle jobs. An idle job
should be able to be shut down at any moment.  Through the CurrentlyIdle
and CurrentlyRunning jobs numbers serverless can determine if all jobs
are idle and the tenant can be safely shut down.  For now this is only
saved in-memory in the registry object, however we may transition to
saving in persisted job state as well in
the future.

Fixes cockroachdb#74747

Release note: None
@samiskin
Copy link
Copy Markdown
Contributor Author

samiskin commented Feb 9, 2022

bors r+

@craig craig bot merged commit 0fdf716 into cockroachdb:master Feb 10, 2022
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 10, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jobs: Add MarkIdle API

4 participants