jobs: Rewrite job scheduler query logic. by miretskiy · Pull Request #78564 · cockroachdb/cockroach

miretskiy · 2022-03-26T19:49:01Z

Prior to this change, job scheduler would lookup a set of
schedules to execute, and it would lock each schedule via FOR UPDATE
clause. The query was a complext query that also performed joins
on system.job table. This resulted in a larger read set of rows
being locked, and it made transaction restarts more expensive.

This PR modifies the querying logic so that the scheduler first
obtains a set of potential schedules to execute. Then, each schedule
executes under its own transaction, where only a single schedule is
locked for update (to guarantee only one scheduler executes this schedule).

Release Notes (enterprise change): Job scheduler is more efficient
and should no longer lock-up jobs and scheduled jobs tables.

Release Justification: Stability improvement for scheduled jobs system.

cockroach-teamcity · 2022-03-26T19:49:11Z

This change is

nvb · 2022-03-28T20:46:16Z

@miretskiy It would be very helpful if we included the EXPLAIN plan in a comment above each SQL query. As we've seen, the potential for transaction contention is closely related to how constrained the underlying index scans are on these tables.

miretskiy · 2022-03-28T21:00:30Z

Good idea, @nvanbenschoten . Will do once I rebase/cleanup and get this ready for review.

miretskiy · 2022-03-29T12:17:09Z

executeSchedules runs a simple query now under nil txn; the plan uses correct next_run index:

demo@127.0.0.1:26257/movr> explain SELECT schedule_id FROM system.scheduled_jobs WHERE next_run < now() ORDER BY random() LIMIT 10;
                               info
------------------------------------------------------------------
  distribution: local
  vectorized: true

  • top-k
  │ order: +column13
  │ k: 10
  │
  └── • render
      │
      └── • scan
            missing stats
            table: scheduled_jobs@next_run_idx
            spans: (/NULL - /'2022-03-29 12:07:30.689322+00:00']
(13 rows)

After that query, for each candidate row, we execute:

demo@127.0.0.1:26257/movr> explain SELECT * FROM system.scheduled_jobs WHERE schedule_id=123 AND next_run < now() FOR UPDATE;
                                                                                               info
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • filter
  │ filter: next_run < '2022-03-29 12:09:36.471592+00:00'
  │
  └── • scan
        missing stats
        table: scheduled_jobs@primary
        spans: [/123 - /123]
        locking strength: for update

  index recommendations: 1
  1. type: index creation
     SQL command: CREATE INDEX ON scheduled_jobs (schedule_id, next_run) STORING (schedule_name, created, owner, schedule_state, schedule_expr, schedule_details, executor_type, execution_args);
(15 rows)

I'm not sure why it recommends creating index on schedule_id, next_run -- it's still a point lookup. I could drop "WHERE next_run<now()" and just parse/compare against current time. Not sure it's worth it. This is done under txn.

Once we get non-nil row from above (i.e. schedule ready to run), we count how many running jobs, using nil txn.

demo@127.0.0.1:26257/movr> explain SELECT count(*) FROM system.jobs WHERE created_by_type = 'xxx' AND created_by_id = 123 AND status IN ('blah', 'blah');
                                           info
-------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • group (scalar)
  │
  └── • filter
      │ filter: status = 'blah'
      │
      └── • scan
            missing stats
            table: jobs@jobs_created_by_type_created_by_id_idx
            spans: [/'xxx'/123 - /'xxx'/123]

  index recommendations: 1
  1. type: index creation
     SQL command: CREATE INDEX ON jobs (status) STORING (created_by_type, created_by_id);
(16 rows)

Again, not sure why we get index creation suggestion jobs@jobs_created_by_type_created_by_id_idx stores status... so, why?

Subsequent logic runs under txn used to lock a single row for update.
Savepoint
ProcessSchedule
Rollback or commit based on ProcessSchedule result

We are using savepoint so that failures in schedule execution can be rollbacked, and those failures can be handled
as specified by schedule policy (reschedule, retry, etc). I think that's fine.

The big change, of course, is executing the first query w/out any explicit txn and w/out for update. Also, look up against jobs table doesn't use txn -- thus no increase in read set. Just 1 row ought to be locked for update. Of course, execute schedule could do something silly (like read entire jobs table under txn) -- but we have separate issue for that.

ajwerner · 2022-03-29T13:32:17Z

I think you meant to tag @nvanbenschoten

miretskiy · 2022-03-29T14:52:39Z

I think you meant to tag @nvanbenschoten

Yeah... Completion failed me.

Prior to this change, job scheduler would lookup a set of schedules to execute, and it would lock each schedule via `FOR UPDATE` clause. The query was a complext query that also performed joins on `system.job` table. This resulted in a larger read set of rows being locked, and it made transaction restarts more expensive. This PR modifies the querying logic so that the scheduler first obtains a set of potential schedules to execute. Then, each schedule executes under its own transaction, where only a single schedule is locked for update (to guarantee only one scheduler executes this schedule). Release Notes (enterprise change): Job scheduler is more efficient and should no longer lock-up jobs and scheduled jobs tables. Release Justification: Stability improvement for scheduled jobs system. s

shermanCRL · 2022-03-29T20:42:43Z

pkg/jobs/job_scheduler.go

-	if n, ok := row[0].(*tree.DInt); ok {
-		return j, int64(*n), nil
+	if row == nil {
+		return nil, errScheduleNotRunnable


Should we add (wrap) a little information here, such as the scheduleID? Just in case we this happens more frequently than we expect.

HonoreDB · 2022-03-29T21:04:44Z

Again, not sure why we get index creation suggestion

Probably on your table status has better specificity...lots of old job records not running any more with the same created_by. Might actually be realistic.

miretskiy · 2022-03-29T22:12:10Z

Probably on your table status has better specificity...lots of old job records not running any more with the same created_by. Might actually be realistic.

But created_by_index also stores status....

nvb · 2022-03-30T22:24:30Z

This all looks much better. Executing each candidate schedule in its own txn is a major improvement.

SELECT schedule_id FROM system.scheduled_jobs WHERE next_run < now() ORDER BY random() LIMIT 10;

Why the ORDER BY random()? The effect of this is that we can't push the limit into the scan. But we do end up scanning scheduled_jobs@next_run_idx from (/NULL - /'2022-03-29 12:07:30.689322+00:00'], so this transaction will contend with any other that write into this range. However, you're now running this scan outside of the read-write scheduling transaction, so this is probably ok.

If we could, would we like this to be?

SELECT schedule_id FROM system.scheduled_jobs WHERE next_run < now() LIMIT 10 FOR UPDATE SKIP LOCKED;

miretskiy · 2022-03-30T22:59:34Z

Going to add a TODO to use SKIP LOCKED when it's ready.

dt · 2022-03-31T18:45:25Z

pkg/jobs/job_scheduler.go

+	}
+
+	timeout := schedulerScheduleExecutionTimeout.Get(&s.Settings.SV)
+	if processErr := withSavePoint(ctx, txn, func() error {


do we need this save point anymore ? can we just do this whole schedule over?

I was a bit worried about dropping this. We need to update schedule when ExecuteJob returns. If it returns an error,
I don't know if it did any mutations. I think if we do something like #78466, then we would be able to remove withSavePoint.

miretskiy · 2022-03-31T20:29:38Z

bors r+

craig · 2022-03-31T22:50:28Z

Build succeeded:

GitHub CI (Cockroach)

79134: kv: support FOR {UPDATE,SHARE} SKIP LOCKED r=arulajmani a=nvanbenschoten KV portion of #40476. Assists #62734. Assists #72407. Assists #78564. **NOTE: the SQL changes here were extracted from this PR and moved to #83627. This allows us to land the KV portion of this change without exposing it yet.** ```sql CREATE TABLE kv (k INT PRIMARY KEY, v INT) INSERT INTO kv VALUES (1, 1), (2, 2), (3, 3) -- in session 1 BEGIN; UPDATE kv SET v = 0 WHERE k = 1 RETURNING * k | v ----+---- 1 | 0 -- in session 2 BEGIN; SELECT * FROM kv ORDER BY k LIMIT 1 FOR UPDATE SKIP LOCKED k | v ----+---- 2 | 2 -- in session 3 BEGIN; SELECT * FROM kv FOR UPDATE SKIP LOCKED k | v ----+---- 3 | 3 ``` These semantics closely match those of FOR {UPDATE,SHARE} SKIP LOCKED in PostgreSQL. With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped. Skipping locked rows provides an inconsistent view of the data, so this is not suitable for general purpose work, but can be used to avoid lock contention with multiple consumers accessing a queue-like table. [Here](https://www.pgcasts.com/episodes/the-skip-locked-feature-in-postgres-9-5) is a short video that explains why users might want to use SKIP LOCKED in Postgres. The same motivation applies to CockroachDB. However, SKIP LOCKED is not a complete solution to queues, as MVCC garbage will still become a major problem with sufficiently high consumer throughput. Even with a very low gc.ttl, CockroachDB does not garbage collect MVCC garbage fast enough to avoid slowing down consumers that scan from the head of a queue over MVCC tombstones of previously consumed queue entries. ---- ### Implementation Skip locked has a number of touchpoints in Storage and KV. To understand these, we first need to understand the isolation model of skip-locked. When a request is using a SkipLocked wait policy, it behaves as if run at a weaker isolation level for any keys that it skips over. If the read request does not return a key, it does not make a claim about whether that key does or does not exist or what the key's value was at the read's MVCC timestamp. Instead, it only makes a claim about the set of keys that are returned. For those keys which were not skipped and were returned (and often locked, if combined with a locking strength, though this is not required), serializable isolation is enforced. When the `pebbleMVCCScanner` is configured with the skipLocked option, it does not include locked keys in the result set. To support this, the MVCC layer needs to be provided access to the in-memory lock table, so that it can determine whether keys are locked with unreplicated lock. Replicated locks are represented as intents, which will be skipped over in getAndAdvance. Requests using the SkipLocked wait policy acquire the same latches as before and wait on all latches ahead of them in line. However, if a request is using a SkipLocked wait policy, we always perform optimistic evaluation. In Replica.collectSpansRead, SkipLocked reads are able to constrain their read spans down to point reads on just those keys that were returned and were not already locked. This means that there is a good chance that some or all of the write latches that the SkipLocked read would have blocked on won't overlap with the keys that the request ends up returning, so they won't conflict when checking for optimistic conflicts. Skip locked requests do not scan the lock table when initially sequencing. Instead, they capture a snapshot of the in-memory lock table while sequencing and scan the lock table as they perform their MVCC scan using the btree snapshot stored in the concurrency guard. MVCC was taught about skip locked in the previous commit. Skip locked requests add point reads for each of the keys returned to the timestamp cache, instead of adding a single ranged read. This satisfies the weaker isolation level of skip locked. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to prevent writes below its read timestamp to keys that were skipped. Similarly, Skip locked requests only records refresh spans for the individual keys returned, instead of recording a refresh span across the entire read span. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to validate that they have not changed if the transaction ever needs to refresh. ---- ### Benchmarking I haven't done any serious benchmarking with this SKIP LOCKED yet, though I'd like to. At some point, I would like to build a simple queue-like workload into the `workload` tool and experiment with various consumer access patterns (non-locking reads, locking reads, skip-locked reads), indexing schemes, concurrency levels (for producers and consumers), and batch sizes. 82915: sql: add locality to system.sql_instances table r=rharding6373 a=rharding6373 This PR adds the column `locality` to the `system.sql_instances` table that contains the locality (e.g., region) of a SQL instance. The encoded locality is a string representing the `roachpb.Locality` that may have been provided when the instance was created. This change also pipes the locality through `InstanceInfo`. This will allow us to determine and use locality information of other SQL instances, e.g. in DistSQL for multi-tenant locality-awareness distribution planning. Informs: #80678 Release note (sql change): Table `system.sql_instances` has a new column, `locality`, that stores the locality of a SQL instance if it was provided when the instance was started. This exposes a SQL instance's locality to other instances in the cluster for query planning. 83418: loopvarcapture: do not flag `defer` within local closure r=srosenberg,dhartunian a=renatolabs Previously, handling of `defer` statements in the `loopvarcapture` linter was naive: whenever a `defer` statement in the body of a loop referenced a loop variable, the linter would flag it as an invalid reference. However, that can be overly restrictive, as a relatively common idiom is to create literal functions and immediately call them so as to take advantage of `defer` semantics, as in the example below: ```go for _, n := range numbers { // ... func() { // ... defer func() { doSomewithing(n) }() // always safe // ... }() } ``` The above reference is valid because it is guaranteed to be called with the correct value for the loop variable. A similar scenario occurs when a closure is assigned to a local variable for use within the loop: ```go for _, n := range numbers { // ... helper := func() { // ... defer func() { doSomething(n) }() // ... } // ... helper() // always safe } ``` In the snippet above, calling the `helper` function is also always safe because the `defer` statement is scoped to the closure containing it. However, it is still *not* safe to call the helper function within a Go routine. This commit updates the `loopvarcapture` linter to recognize when a `defer` statement is safe because it is contained in a local closure. The two cases illustrated above will no longer be flagged, allowing for that idiom to be used freely. Release note: None. 83545: sql/schemachanger: move end to end testing to one test per-file r=fqazi a=fqazi Previously, we allowed multiple tests per-file for end-to-end testing inside the declarative schema changer. This was inadequate because we plan on extending the end-to-end testing to start injecting additional read/write operations at different stages, which would make it difficult. To address this, this patch will split tests into individual files, with one test per file. Additionally, it extends support to allow multiple statements per-test statement, for transaction support testing (this is currently unused). Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>

miretskiy force-pushed the scheduler_rewrite branch from 4f24b5b to 7373f03 Compare March 26, 2022 19:54

miretskiy mentioned this pull request Mar 28, 2022

jobs: remove batching from job scheduler #78661

Open

miretskiy force-pushed the scheduler_rewrite branch 2 times, most recently from 74882db to f131804 Compare March 29, 2022 12:04

miretskiy force-pushed the scheduler_rewrite branch from f131804 to 679c731 Compare March 29, 2022 14:54

miretskiy marked this pull request as ready for review March 29, 2022 20:22

miretskiy requested a review from a team as a code owner March 29, 2022 20:22

miretskiy requested review from a team, HonoreDB, nvb and stevendanna and removed request for a team March 29, 2022 20:22

shermanCRL reviewed Mar 29, 2022

View reviewed changes

miretskiy requested a review from dt March 30, 2022 12:22

nvb mentioned this pull request Mar 31, 2022

kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134

Merged

dt reviewed Mar 31, 2022

View reviewed changes

dt approved these changes Mar 31, 2022

View reviewed changes

craig bot merged commit eec4ccc into cockroachdb:master Mar 31, 2022

miretskiy added the backport-22.1.x label Apr 1, 2022

miretskiy mentioned this pull request Apr 4, 2022

release-22.1: jobs: Rewrite job scheduler query logic #79328

Merged

nvb mentioned this pull request Jun 29, 2022

sql: support FOR {UPDATE,SHARE} SKIP LOCKED #83627

Closed

Conversation

miretskiy commented Mar 26, 2022

Uh oh!

cockroach-teamcity commented Mar 26, 2022

Uh oh!

nvb commented Mar 28, 2022

Uh oh!

miretskiy commented Mar 28, 2022

Uh oh!

miretskiy commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajwerner commented Mar 29, 2022

Uh oh!

miretskiy commented Mar 29, 2022

Uh oh!

shermanCRL Mar 29, 2022

Choose a reason for hiding this comment

Uh oh!

HonoreDB commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miretskiy commented Mar 29, 2022

Uh oh!

nvb commented Mar 30, 2022

Uh oh!

miretskiy commented Mar 30, 2022

Uh oh!

dt Mar 31, 2022

Choose a reason for hiding this comment

Uh oh!

miretskiy Mar 31, 2022

Choose a reason for hiding this comment

Uh oh!

miretskiy commented Mar 31, 2022

Uh oh!

craig bot commented Mar 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

miretskiy commented Mar 29, 2022 •

edited

Loading

HonoreDB commented Mar 29, 2022 •

edited

Loading