Skip to content

sql: don't use streamer for local flows and non-default key locking#94399

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
yuzefovich:streamer-no-locking
Dec 29, 2022
Merged

sql: don't use streamer for local flows and non-default key locking#94399
craig[bot] merged 1 commit intocockroachdb:masterfrom
yuzefovich:streamer-no-locking

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Dec 29, 2022

This commit makes it so that we don't use the streamer API when running
fully-local plans if some processors in that flow require non-default
key locking. This change allows us to fix a regression with SELECT FOR UPDATE where the acquired locks by that stmt would not be properly
cleaned up on the txn commit (because we currently don't propagate lock
spans for leaf txns, and the streamer requires us to use the leaf txns).
The initial attempt to fix this propagation exposed an issue with
multi-tenant setups, so for now we choose to simply not use the streamer
in such cases.

Additionally, this commit disables the parallelization of local scans
when non-default key locking strength is found. The parallelization of
local scans also requires us to use the leaf txns, and it was introduced
in 21.2 version; however, we haven't had any user reports. Still, it
seems reasonable to update that code path with the recent learnings.

Addresses: #94290.
Addresses: #94400.

Epic: None

Release note (bug fix): Previously, CockroachDB could delay the release
of the locks acquired when evaluating SELECT FOR UPDATE statements in
some cases. This delay (up to 5s) could then block the future readers.
The bug was introduced in 22.2.0, and the temporary workaround without
upgrading to a release with this fix is to set undocumented cluster
setting sql.distsql.use_streamer.enabled to false.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich
Copy link
Copy Markdown
Member Author

Original attempt at fixing this properly was in #94388.

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Some questions but feel free to ship this to unblock the release.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, and @yuzefovich)


pkg/sql/distsql_running.go line 683 at r1 (raw file):

			// will need to use the Leaf txn.
			for _, flow := range flows {
				localState.HasConcurrency = localState.HasConcurrency || execinfra.HasParallelProcessors(flow)

This might be an ignorant question, but since we can have local parallelism using leaf txns for other reasons, why do we not see this problem in those cases? Why only with the streamer?


pkg/sql/distsql_running.go line 712 at r1 (raw file):

			// cleaned up properly when the txn commits).
			// TODO(yuzefovich): fix the propagation of the lock spans with the
			// leaf txns and remove this check.

nit: Could we please include the issue number in this TODO?


pkg/sql/distsql_running.go line 730 at r1 (raw file):

						foundNonDefaultKeyLocking = foundNonDefaultKeyLocking ||
							proc.Spec.Core.InvertedJoiner.LockingStrength != descpb.ScanLockingStrength_FOR_NONE
					}

nit: This is fine for now but it would be nicer if this were determined at optimization time and passed as a planFlag like planFlagContainsMutation.

This commit makes it so that we don't use the streamer API when running
fully-local plans if some processors in that flow require non-default
key locking. This change allows us to fix a regression with `SELECT FOR
UPDATE` where the acquired locks by that stmt would not be properly
cleaned up on the txn commit (because we currently don't propagate lock
spans for leaf txns, and the streamer requires us to use the leaf txns).
The initial attempt to fix this propagation exposed an issue with
multi-tenant setups, so for now we choose to simply not use the streamer
in such cases.

Additionally, this commit disables the parallelization of local scans
when non-default key locking strength is found. The parallelization of
local scans also requires us to use the leaf txns, and it was introduced
in 21.2 version; however, we haven't had any user reports. Still, it
seems reasonable to update that code path with the recent learnings.

Release note (bug fix): Previously, CockroachDB could delay the release
of the locks acquired when evaluating SELECT FOR UPDATE statements in
some cases. This delay (up to 5s) could then block the future readers.
The bug was introduced in 22.2.0, and the temporary workaround without
upgrading to a release with this fix is to set undocumented cluster
setting `sql.distsql.use_streamer.enabled` to `false`.
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented your suggestions. There are still some questions but will merge in order to get a nightly run since we know it's not the final solution anyway.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @mgartner, and @michae2)


pkg/sql/distsql_running.go line 683 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This might be an ignorant question, but since we can have local parallelism using leaf txns for other reasons, why do we not see this problem in those cases? Why only with the streamer?

That's actually a very good question. We do might have the same problem there #94400, so I adjusted this PR to disable the parallelization of local scans with non-default key locking too.

However, even after that I still see the delayed lock cleanup even though we do use the root txn now (by not parallelizing the scan), so I didn't add a regression test for this.


pkg/sql/distsql_running.go line 712 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Could we please include the issue number in this TODO?

Done.


pkg/sql/distsql_running.go line 730 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: This is fine for now but it would be nicer if this were determined at optimization time and passed as a planFlag like planFlagContainsMutation.

Done.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 29, 2022

Build succeeded:

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Dec 29, 2022

we currently don't propagate lock spans for leaf txns, and the streamer requires us to use the leaf txns

The other complication that we would have run into with this approach is that only the root txn coordinator runs a heartbeat loop to keep its locks alive. If we only acquired locks in leaf txns, we would never have started the txn heartbeat and the txn could have been aborted after 5s. We could have launched the heartbeat when merging the LeafTxnFinalState, but that still only works if we call UpdateRootWithLeafFinalState within 5s of the initial lock being acquired.

I think this again hints at our implementation of explicit row-level locking and pushing that locking directly into scans (in all cases) being a little funky (see #75457 and #57031). If we had a separate RowLock execution node that lived above a scan/filter/limit/etc. then we could run that using the RootTxn and distribute the rest of the query as much as we'd like. We're going to have to do part of that work for read-committed.

miraradeva added a commit to miraradeva/cockroach that referenced this pull request Mar 21, 2023
Previously, as noted in cockroachdb#94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in cockroachdb#94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn.

This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure.

Fixes: cockroachdb#97817
Release note: None
miraradeva added a commit to miraradeva/cockroach that referenced this pull request Mar 21, 2023
Previously, as noted in cockroachdb#94290, it was possible for a LeafTxn to issue
locking requests as part of SELECT FOR UPDATE. This behavior was
unexpected and the RootTxn wasn't properly cleaning up the locks,
resulting in others waiting for those locks to be released. The issue
was resolved, in cockroachdb#94399, by ensuring non-default locking strength
transactions don't use the streamer API and always run as RootTxn.

This patch adds an assertion on the kv side to prevent other existing or
future attempts of LeafTxn issuing locking requests. We don't expect
that there are such existing cases, so we don't expect this assertion
to fail, but will keep an eye on the nightly tests to make sure.

Fixes: cockroachdb#97817 Release note: None
miraradeva added a commit to miraradeva/cockroach that referenced this pull request Mar 22, 2023
Previously, as noted in cockroachdb#94290, it was possible for a LeafTxn to issue
locking requests as part of SELECT FOR UPDATE. This behavior was
unexpected and the RootTxn wasn't properly cleaning up the locks,
resulting in others waiting for those locks to be released. The issue
was resolved, in cockroachdb#94399, by ensuring non-default locking strength
transactions don't use the streamer API and always run as RootTxn.

This patch adds an assertion on the kv side to prevent other existing or
future attempts of LeafTxn issuing locking requests. We don't expect
that there are such existing cases, so we don't expect this assertion
to fail, but will keep an eye on the nightly tests to make sure.

Fixes: cockroachdb#97817 Release note: None
craig bot pushed a commit that referenced this pull request Mar 22, 2023
98741: ci: update bazel builder image r=rickystewart a=cockroach-teamcity

Release note: None
Epic: None


98878: backupccl: fix occassional TestRestoreErrorPropagates flake r=stevendanna a=adityamaru

Very rarely under stress race another automatic job would race with the restore and increment the error count. This would result in the count being greater than our expected value of 1. This disables all the automatic jobs eliminating the chance of this race.

Fixes: #98037

Release note: None

99099: kvserver: deflake TestReplicaTombstone r=andrewbaptist a=tbg

Like many other tests, this test could flake because we'd sometimes
catch a "cannot remove learner while snapshot is in flight" error.

I think the root cause is that sometimes there are errant Raft snapshots
in the system[^1] and these get mistaken for LEARNERs that are still
being caught up by the replicate queue. I tried to address this general
class of issues by making the check for in-flight learner snapshots not
care about *raft* snapshots.

I was able to stress TestReplicaTombstone for 30+ minutes without a
failure using that approach, whereas previously it usually failed within
a few minutes.

```
./dev test --stress pkg/kv/kvserver/ --filter TestReplicaTombstone 2>&1 | tee stress.log
[...]
2461 runs so far, 0 failures, over 35m45s
```

[^1]: #87553

Fixes #98883.

Epic: none
Release note: None


99126: kv: return error on locking request in LeafTxn r=nvanbenschoten a=miraradeva

Previously, as noted in #94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in #94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn.

This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure.

Fixes: #97817
Release note: None

99150: backupccl: stop logging unsanitized backup stmt in schedule executor r=stevendanna a=msbutler

Informs #99145

Release note: None

Co-authored-by: cockroach-teamcity <teamcity@cockroachlabs.com>
Co-authored-by: adityamaru <adityamaru@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Mira Radeva <mira@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Mar 23, 2023
Previously, as noted in #94290, it was possible for a LeafTxn to issue
locking requests as part of SELECT FOR UPDATE. This behavior was
unexpected and the RootTxn wasn't properly cleaning up the locks,
resulting in others waiting for those locks to be released. The issue
was resolved, in #94399, by ensuring non-default locking strength
transactions don't use the streamer API and always run as RootTxn.

This patch adds an assertion on the kv side to prevent other existing or
future attempts of LeafTxn issuing locking requests. We don't expect
that there are such existing cases, so we don't expect this assertion
to fail, but will keep an eye on the nightly tests to make sure.

Fixes: #97817 Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants