jobs: enable marking jobs as idle for metrics#75897
jobs: enable marking jobs as idle for metrics#75897craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
Putting this in a separate PR from the rest of the Changefeeds side of using this API so that this can be merged without being delayed by any changefeed-specific feedback. Time matters since some work on the KV team is blocked by this. |
|
cc @irfansharif |
475ed7e to
d817228
Compare
| err = resumer.Resume(resumeCtx, execCtx) | ||
| }() | ||
|
|
||
| r.MarkIdle(job, false) |
There was a problem hiding this comment.
probably not needed since false is the default anyway?
There was a problem hiding this comment.
Oh sorry forgot to respond to this last time, this is to unmark its idleness after its been potentially marked idle during its resumer.Resume(...) lifetime.
pkg/jobs/registry.go
Outdated
| } | ||
|
|
||
| // A single job should not toggle its idleness more than twice per-minute as it | ||
| // is logged and may write to persisted job state in the future |
There was a problem hiding this comment.
I don't know; I wouldn't mention anything about persistence here or in the PR description.
I would however mention that MarkIdle should not be called too frequently (w/out prescribing 2x/minute)
There was a problem hiding this comment.
This was from the discussion in #74747, I don't mind removing the persistence/freq specifics since the main point is just the infrequent requirement
There was a problem hiding this comment.
APIs have contracts that are just in comments, and maybe not enforced in code but could cause misbehavior if violated, all the time like "caller must hold lock xyz" or "is not threadsafe". I think "should not be called more than x times per minute" fits that bill too (and is how we justify logging below)
There was a problem hiding this comment.
nit: comment should end with a period.
miretskiy
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @samiskin, and @shermanCRL)
pkg/jobs/registry.go, line 1290 at r1 (raw file):
Previously, samiskin (Shiranka Miskin) wrote…
This was from the discussion in #74747, I don't mind removing the persistence/freq specifics since the main point is just the infrequent requirement
This is an API. There is nothing enforcing any sort of limits on that. It's just a recommendation.
pkg/jobs/registry.go, line 1299 at r1 (raw file):
Previously, dt (David Taylor) wrote…
So I think I argued for logging these state transitions in the issue. We log, today, when a job starts running ("stepping though state running..."). This is useful since you can now look at a log and go "aha, the job started running at 123, and as 124 we see these things explode". If a job is moving from idle/parked into active work, that is potentially equally notable to when it initially started running, if idleness is defined on a long enough time-scale; in conjunction with the contract above that says you can only change state twice a minute, I think you can argue this logline isn't too spammy.
Ack. I'm okay here; was a bit worried if this is called 100s of times/sec.. But I guess seeing "broken" code like this is also okay.
pkg/jobs/registry.go, line 1300 at r1 (raw file):
Previously, samiskin (Shiranka Miskin) wrote…
This should already function that way since it'll only inc/dec if
aj.isIdle != isIdle, so multipleMarkIdle(true)fail that. I'll add the below-0 error and a test around this 👍
Ahh Yes, nice. But please do add the tests.
633306c to
9a612b8
Compare
pkg/jobs/registry_test.go
Outdated
| test.update(job.ID()) | ||
|
|
||
| testutils.SucceedsSoon(t, func() error { | ||
| if r.IsJobIdle(job.ID()) { |
There was a problem hiding this comment.
if you cancel the job, isn't it possible that the registry will drop that job from it's in-memory structure?
There was a problem hiding this comment.
Yep that is possible, but right now I have IsJobIdle(...) simply returning false if the job isn't currently adopted in the registry (this is mentioned in the comment above the function). I could also make IsJobIdle return a bool + error for if the job ID isn't found.
There was a problem hiding this comment.
Hm... Perhaps this is a test only function? If it were renamed to TestingIsJobIdle, I'd be more okay w/ that.
There was a problem hiding this comment.
I'll make it TestingIsJobIdle for now and then see if it ends up being necessary for the changefeeds code in that PR. I was initially using it when manually deciding to place down a protected timestamp record but that'l no longer be necessary.
In order to allow long-running / indefinite jobs to not continually block a tenant from being treated as inactive in a serverless environment, this change adds a MarkIdle API to the jobs registry which increases / decreases a counter of currently idle jobs. An idle job should be able to be shut down at any moment. Through the CurrentlyIdle and CurrentlyRunning jobs numbers serverless can determine if all jobs are idle and the tenant can be safely shut down. For now this is only saved in-memory in the registry object, however we may transition to saving in persisted job state as well in the future. Fixes cockroachdb#74747 Release note: None
9a612b8 to
5c92d71
Compare
|
bors r+ |
|
Build succeeded: |
In order to allow long-running / indefinite jobs to not continually
block a tenant from being treated as inactive in a serverless
environment, this change adds a MarkIdle API to the jobs registry which
increases / decreases a counter of currently idle jobs. An idle job
should be able to be shut down at any moment. Through the CurrentlyIdle
and CurrentlyRunning jobs numbers serverless can determine if all jobs
are idle and the tenant can be safely shut down. For now this is only
saved in-memory in the registry object, however we may transition to
saving in persisted job state as well in
the future.
Fixes #74747
Release note: None