sql: don't use streamer for local flows and non-default key locking#94399
sql: don't use streamer for local flows and non-default key locking#94399craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
Original attempt at fixing this properly was in #94388. |
michae2
left a comment
There was a problem hiding this comment.
Some questions but feel free to ship this to unblock the release.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @mgartner, and @yuzefovich)
pkg/sql/distsql_running.go line 683 at r1 (raw file):
// will need to use the Leaf txn. for _, flow := range flows { localState.HasConcurrency = localState.HasConcurrency || execinfra.HasParallelProcessors(flow)
This might be an ignorant question, but since we can have local parallelism using leaf txns for other reasons, why do we not see this problem in those cases? Why only with the streamer?
pkg/sql/distsql_running.go line 712 at r1 (raw file):
// cleaned up properly when the txn commits). // TODO(yuzefovich): fix the propagation of the lock spans with the // leaf txns and remove this check.
nit: Could we please include the issue number in this TODO?
pkg/sql/distsql_running.go line 730 at r1 (raw file):
foundNonDefaultKeyLocking = foundNonDefaultKeyLocking || proc.Spec.Core.InvertedJoiner.LockingStrength != descpb.ScanLockingStrength_FOR_NONE }
nit: This is fine for now but it would be nicer if this were determined at optimization time and passed as a planFlag like planFlagContainsMutation.
This commit makes it so that we don't use the streamer API when running fully-local plans if some processors in that flow require non-default key locking. This change allows us to fix a regression with `SELECT FOR UPDATE` where the acquired locks by that stmt would not be properly cleaned up on the txn commit (because we currently don't propagate lock spans for leaf txns, and the streamer requires us to use the leaf txns). The initial attempt to fix this propagation exposed an issue with multi-tenant setups, so for now we choose to simply not use the streamer in such cases. Additionally, this commit disables the parallelization of local scans when non-default key locking strength is found. The parallelization of local scans also requires us to use the leaf txns, and it was introduced in 21.2 version; however, we haven't had any user reports. Still, it seems reasonable to update that code path with the recent learnings. Release note (bug fix): Previously, CockroachDB could delay the release of the locks acquired when evaluating SELECT FOR UPDATE statements in some cases. This delay (up to 5s) could then block the future readers. The bug was introduced in 22.2.0, and the temporary workaround without upgrading to a release with this fix is to set undocumented cluster setting `sql.distsql.use_streamer.enabled` to `false`.
fab73ba to
9c2c4f2
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I implemented your suggestions. There are still some questions but will merge in order to get a nightly run since we know it's not the final solution anyway.
TFTR!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @mgartner, and @michae2)
pkg/sql/distsql_running.go line 683 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
This might be an ignorant question, but since we can have local parallelism using leaf txns for other reasons, why do we not see this problem in those cases? Why only with the streamer?
That's actually a very good question. We do might have the same problem there #94400, so I adjusted this PR to disable the parallelization of local scans with non-default key locking too.
However, even after that I still see the delayed lock cleanup even though we do use the root txn now (by not parallelizing the scan), so I didn't add a regression test for this.
pkg/sql/distsql_running.go line 712 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: Could we please include the issue number in this TODO?
Done.
pkg/sql/distsql_running.go line 730 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: This is fine for now but it would be nicer if this were determined at optimization time and passed as a
planFlaglikeplanFlagContainsMutation.
Done.
|
Build succeeded: |
The other complication that we would have run into with this approach is that only the root txn coordinator runs a heartbeat loop to keep its locks alive. If we only acquired locks in leaf txns, we would never have started the txn heartbeat and the txn could have been aborted after 5s. We could have launched the heartbeat when merging the I think this again hints at our implementation of explicit row-level locking and pushing that locking directly into scans (in all cases) being a little funky (see #75457 and #57031). If we had a separate |
Previously, as noted in cockroachdb#94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in cockroachdb#94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn. This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure. Fixes: cockroachdb#97817 Release note: None
Previously, as noted in cockroachdb#94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in cockroachdb#94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn. This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure. Fixes: cockroachdb#97817 Release note: None
Previously, as noted in cockroachdb#94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in cockroachdb#94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn. This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure. Fixes: cockroachdb#97817 Release note: None
98741: ci: update bazel builder image r=rickystewart a=cockroach-teamcity Release note: None Epic: None 98878: backupccl: fix occassional TestRestoreErrorPropagates flake r=stevendanna a=adityamaru Very rarely under stress race another automatic job would race with the restore and increment the error count. This would result in the count being greater than our expected value of 1. This disables all the automatic jobs eliminating the chance of this race. Fixes: #98037 Release note: None 99099: kvserver: deflake TestReplicaTombstone r=andrewbaptist a=tbg Like many other tests, this test could flake because we'd sometimes catch a "cannot remove learner while snapshot is in flight" error. I think the root cause is that sometimes there are errant Raft snapshots in the system[^1] and these get mistaken for LEARNERs that are still being caught up by the replicate queue. I tried to address this general class of issues by making the check for in-flight learner snapshots not care about *raft* snapshots. I was able to stress TestReplicaTombstone for 30+ minutes without a failure using that approach, whereas previously it usually failed within a few minutes. ``` ./dev test --stress pkg/kv/kvserver/ --filter TestReplicaTombstone 2>&1 | tee stress.log [...] 2461 runs so far, 0 failures, over 35m45s ``` [^1]: #87553 Fixes #98883. Epic: none Release note: None 99126: kv: return error on locking request in LeafTxn r=nvanbenschoten a=miraradeva Previously, as noted in #94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in #94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn. This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure. Fixes: #97817 Release note: None 99150: backupccl: stop logging unsanitized backup stmt in schedule executor r=stevendanna a=msbutler Informs #99145 Release note: None Co-authored-by: cockroach-teamcity <teamcity@cockroachlabs.com> Co-authored-by: adityamaru <adityamaru@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Mira Radeva <mira@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Previously, as noted in #94290, it was possible for a LeafTxn to issue locking requests as part of SELECT FOR UPDATE. This behavior was unexpected and the RootTxn wasn't properly cleaning up the locks, resulting in others waiting for those locks to be released. The issue was resolved, in #94399, by ensuring non-default locking strength transactions don't use the streamer API and always run as RootTxn. This patch adds an assertion on the kv side to prevent other existing or future attempts of LeafTxn issuing locking requests. We don't expect that there are such existing cases, so we don't expect this assertion to fail, but will keep an eye on the nightly tests to make sure. Fixes: #97817 Release note: None
This commit makes it so that we don't use the streamer API when running
fully-local plans if some processors in that flow require non-default
key locking. This change allows us to fix a regression with
SELECT FOR UPDATEwhere the acquired locks by that stmt would not be properlycleaned up on the txn commit (because we currently don't propagate lock
spans for leaf txns, and the streamer requires us to use the leaf txns).
The initial attempt to fix this propagation exposed an issue with
multi-tenant setups, so for now we choose to simply not use the streamer
in such cases.
Additionally, this commit disables the parallelization of local scans
when non-default key locking strength is found. The parallelization of
local scans also requires us to use the leaf txns, and it was introduced
in 21.2 version; however, we haven't had any user reports. Still, it
seems reasonable to update that code path with the recent learnings.
Addresses: #94290.
Addresses: #94400.
Epic: None
Release note (bug fix): Previously, CockroachDB could delay the release
of the locks acquired when evaluating SELECT FOR UPDATE statements in
some cases. This delay (up to 5s) could then block the future readers.
The bug was introduced in 22.2.0, and the temporary workaround without
upgrading to a release with this fix is to set undocumented cluster
setting
sql.distsql.use_streamer.enabledtofalse.