jobs: clear job claim after execution#91563
Conversation
pkg/jobs/adopt.go
Outdated
| // caller since the caller's context may have been canceled. | ||
| r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) { | ||
| err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { | ||
| if err := txn.SetUserPriority(roachpb.MinUserPriority); err != nil { |
There was a problem hiding this comment.
I was mostly following what we do in other lifecyle queries in this package (claimJobs and servePauseAndCancelRequests). However, those happen constantly on all nodes so are more likely to cause repeated contention whereas one hopes this is relatively infrequently called, so perhaps we can do without it.
There was a problem hiding this comment.
Yeah, the other point queries in this package don't do this. The idea to use low priority for the scans was to not starve user queries or point operations.
pkg/jobs/adopt.go
Outdated
| // We use the serverCtx here rather than the context from the | ||
| // caller since the caller's context may have been canceled. | ||
| r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) { | ||
| err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { |
There was a problem hiding this comment.
I think we'd be better off not using a txn loop if we can avoid it. This is a txn which can use parallel commits. I'd rather that it did that by using a nil txn passed to the ie. I think if you want low priority you maybe can do that through session data.
There was a problem hiding this comment.
I've updated this and decided to go forward without a low priority query for now.
Do you happen to know if there is a good way to verify that the query is in fact using parralel commit?
There was a problem hiding this comment.
A KV trace would show you. Not exactly sure how to get that in this context. I think maybe you could install something into sessiondata and then the context?
Like, if you did:
root@localhost:26257/defaultdb> create table foo (i int primary key, j int, index(j));
CREATE TABLE
Time: 897ms total (execution 897ms / network 0ms)
root@localhost:26257/defaultdb> set tracing = kv;
SET TRACING
Time: 0ms total (execution 0ms / network 0ms)
root@localhost:26257/defaultdb> insert into foo values (5, 1);
INSERT 0 1
Time: 45ms total (execution 44ms / network 0ms)
root@localhost:26257/defaultdb> show kv trace for session;
timestamp | age | message | tag | location | operation | span
--------------------------------+-----------------+--------------------------------------------------+----------------------------------------------------+-----------------------------------------+--------------------------------------+-------
2022-11-10 15:35:54.82937+00 | 00:00:00.003108 | rows affected: 1 | [n1,client=127.0.0.1:54700,user=root] | sql/conn_executor_exec.go:723 | sql query | 2
2022-11-10 15:35:59.56418+00 | 00:00:04.737918 | CPut /Table/105/1/5/0 -> /TUPLE/2:2:Int/1 | [n1,client=127.0.0.1:54700,user=root] | sql/row/writer.go:215 | count | 18
2022-11-10 15:35:59.564228+00 | 00:00:04.737966 | InitPut /Table/105/2/1/5/0 -> /BYTES/ | [n1,client=127.0.0.1:54700,user=root] | sql/row/inserter.go:183 | count | 18
2022-11-10 15:35:59.564313+00 | 00:00:04.738051 | querying next range at /Table/105/1/5/0 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/range_iter.go:182 | dist sender send | 20
2022-11-10 15:35:59.564371+00 | 00:00:04.738109 | querying next range at /Table/105/2/1/5/0 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/range_iter.go:182 | dist sender send | 20
2022-11-10 15:35:59.564409+00 | 00:00:04.738147 | r58: sending batch 1 InitPut to (n1,s1):1 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/dist_sender.go:2048 | dist sender send | 20
2022-11-10 15:35:59.564476+00 | 00:00:04.738214 | r57: sending batch 1 CPut, 1 EndTxn to (n1,s1):1 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/dist_sender.go:2048 | kv.DistSender: sending partial batch | 21
2022-11-10 15:35:59.606589+00 | 00:00:04.780327 | fast path completed | [n1,client=127.0.0.1:54700,user=root] | sql/plan_node_to_row_source.go:176 | count | 18
2022-11-10 15:35:59.607296+00 | 00:00:04.781034 | rows affected: 1 | [n1,client=127.0.0.1:54700,user=root] | sql/conn_executor_exec.go:723 | sql query | 12
2022-11-10 15:35:59.610933+00 | 00:00:04.784671 | rows affected: 1 | [n1,client=127.0.0.1:54700,user=root] | sql/conn_executor_exec.go:723 | sql query | 27
(10 rows)
That's definitely a parallel commit.
b5149f1 to
ebf9b97
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)
Previously, if the ALTER CHANGEFEED txn retried, we would increment these counters again. In cockroachdb#91563 we are making a change that made it more likely that this transaction may retry during one of the test, revealing the issue. Release note: None
|
The test failure is #91812 |
Previously, if the ALTER CHANGEFEED txn retried, we would increment these counters again. In cockroachdb#91563 we are making a change that made it more likely that this transaction may retry during one of the test, revealing the issue. Release note: None
Since cockroachdb#89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions. In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest. Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned. Release note: None
Release note: None
ebf9b97 to
deb0896
Compare
|
bors r=ajwerner |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 2ba983d to blathers/backport-release-22.1-91563: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
92005: jobs: deflake TestRegistryLifecycle r=ajwerner a=stevendanna The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that #91563 made transaction retries more likely. Fixes #92001 Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com>
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
Since #89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions.
In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest.
Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned.
Epic: None
Release note: None