Skip to content

kvcoord: propagate lock spans in LeafTxnFinalState#94388

Closed
yuzefovich wants to merge 2 commits intocockroachdb:masterfrom
yuzefovich:lock-spans
Closed

kvcoord: propagate lock spans in LeafTxnFinalState#94388
yuzefovich wants to merge 2 commits intocockroachdb:masterfrom
yuzefovich:lock-spans

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Dec 28, 2022

This commit makes it so that we propagate the lock footprint of the leaf txns in LeafTxnFinalState. This is needed in order to disable the read-only fast path when committing the txn that acquired locks while using the leaf txn (which can happen, for example, if we use the streamer API). Without this propagation, previously the locks would only be released after the txn expiration duration (5s).

Addresses: #94290.

Epic: None.

Release note (bug fix): Previously, CockroachDB could delay the release of the locks acquired when evaluating SELECT FOR UPDATE statements in some cases. This delay (up to 5s) could then block the future readers. The bug was introduced in 22.2.0, and the temporary workaround without upgrading to a release with this fix is to set undocumented cluster setting sql.distsql.use_streamer.enabled to false.

@yuzefovich yuzefovich requested review from knz and nvb December 28, 2022 20:24
@yuzefovich yuzefovich requested a review from a team as a code owner December 28, 2022 20:24
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice.

condensed := tp.lockFootprint.maybeCondense(ctx, tp.riGen, locksBudget)
if condensed && !alreadyCondensed {
if tp.condensedIntentsEveryN.ShouldLog() || log.ExpensiveLogEnabled(ctx, 2) {
var baDetails string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redact.StringBuilder, then use log.Warningf and include it with %s in the message.

// Copy mutable state so access is safe for the caller.
lockSpans := tp.lockFootprint.asSlice()
tfs.LockSpans = make([]roachpb.Span, len(lockSpans))
copy(tfs.LockSpans, lockSpans)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the LeafTxn gets reused/extended afterwards, so maybe you could avoid the copy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If you know a copy is needed, maybe explain why in a comment)

@yuzefovich yuzefovich force-pushed the lock-spans branch 2 times, most recently from 9c8b83c to 5198f9e Compare December 28, 2022 22:08
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly-added test in select_for_update now fails on 3node-tenant config (before this change it wouldn't fail):

--- FAIL: TestLogic_tmp (2.04s)
    test_log_scope.go:161: test logs captured to: /Users/yuzefovich/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/967ce6772a6c5a2809661c88cfdfc502/logTestLogic_tmp362815130
    test_log_scope.go:79: use -show-logs to present logs inline
    logic.go:3198: 
        
        /private/var/tmp/_bazel_yahor/e311b1c2454b7c301a6b73bddff450e7/sandbox/darwin-sandbox/16694/execroot/com_github_cockroachdb_cockroach/bazel-out/darwin_arm64-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/_tmp:7: 
        expected success, but found
        (XXUUU) RangeIterator failed to seek to /Min: rpc error: code = Unauthenticated desc = kvtenant.Proxy does not have access to FirstRange
        range_iter.go:236: in Seek()
    logic.go:3990: 
        /private/var/tmp/_bazel_yahor/e311b1c2454b7c301a6b73bddff450e7/sandbox/darwin-sandbox/16694/execroot/com_github_cockroachdb_cockroach/bazel-out/darwin_arm64-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/com_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/_tmp:7: error while processing
    logic.go:3990: pq: RangeIterator failed to seek to /Min: rpc error: code = Unauthenticated desc = kvtenant.Proxy does not have access to FirstRange
    panic.go:522: -- test log scope end --
FAIL

Do you have an idea of what could be the issue?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go line 732 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

redact.StringBuilder, then use log.Warningf and include it with %s in the message.

Done.


pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go line 820 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

(If you know a copy is needed, maybe explain why in a comment)

I don't know whether it is needed - I simply copied what we have in txnSpanRefresher, and there we do make a copy. I was hoping that @nvanbenschoten would point out whether it is, in fact, needed here and the reason why. (The copying in the refresh spans was added in 51ade9f.)

Update: discussed on slack, will keep the copy out of caution.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly-added test in select_for_update now fails on 3node-tenant config (before this change it wouldn't fail):
Do you have an idea of what could be the issue?

Yes I believe it is a recent regression related to the work that @ecwall and myself are doing on the master branch.
My recommendation would be to:

  1. verify that your fix + this test succeed properly on v22.2 (i.e. this failure is specific to master)
  2. skip this test on master, with an issue filed for Evan to investigate further
  3. do the backport and ensure the test is not skipped on the 22.2 branch.

If the test still fails on 22.2, I'm out of ideas and I fear we won't fix this today.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

knz
knz previously approved these changes Dec 28, 2022
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo my previous comment.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 28, 2022

Summary of what we encountered:

  • the test failures occurs on 22.2
  • we traced this to a RangeIter incorrectly attempting to scan from /Min in .maybeCondense()
  • which is because kvcoord.RangeIterator does not have its start key set to the start of the tenant logical keyspace

The call to .maybeCondense was bound to always fail in secondary tenants. I'm not sure why we didn't catch it earlier.

@knz knz dismissed their stale review December 28, 2022 22:38

Approach did not work.

@yuzefovich
Copy link
Copy Markdown
Member Author

I debugged this a bit further, and I no longer think maybeCondense is to blame. Here is the stack trace where we issue a lookup with nil key:

runtime/debug.Stack()
	GOROOT/src/runtime/debug/stack.go:24 +0x64
runtime/debug.PrintStack()
	GOROOT/src/runtime/debug/stack.go:16 +0x1c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangecache.(*RangeCache).lookupInternal(0x0?, {0x106573ef8, 0x14009c9ba40}, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangecache/range_cache.go:716 +0x64
github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangecache.(*RangeCache).LookupWithEvictionToken(0x0?, {0x106573ef8?, 0x14009c9ba40?}, {0x0?, 0xffffffffffffffff?, 0x14004c6a378?}, {0x0, 0x0, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/rangecache/range_cache.go:641 +0x60
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).getRoutingInfo(0x106573ef8?, {0x106573ef8, 0x14009c9ba40}, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0, 0x0, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:638 +0x74
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*RangeIterator).Seek(0x1400c0cd040, {0x106573ef8, 0x14009c9ba40}, {0x0, 0x0, 0x0}, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/range_iter.go:208 +0x3b8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0x1400a0a0500, {0x106573ef8, 0x14009c9ba40}, 0x14003752960, {{0x0, 0x0, 0x0}, {0x140088301a8, 0x1, 0x1}}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1234 +0x16c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0x1400a0a0500, {0x106573ef8, 0x14009c9ba10}, 0x14003752960)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:861 +0x49c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(0x14009832538, {0x106573ef8, 0x14009c9ba10}, 0x14003752960)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x188
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(0x14009832500, {0x106573ef8?, 0x14009c9ba10?}, 0x0?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0xc4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(0x140098324d0, {0x106573ef8, 0x14009c9ba10}, 0x14003752960)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:202 +0x270
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(0x140098323d0, {0x106573ef8, 0x14009c9ba10}, 0x14003752960, 0x5)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:225 +0x1a4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(0x140098323d0, {0x106573ef8, 0x14009c9ba10}, 0x14003752960?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:153 +0xac
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(0x140098322a0, {0x106573ef8, 0x14009c9ba10}, 0x0?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 +0xf4
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(0x14009832280?, {0x106573ef8?, 0x14009c9ba10?}, 0x14003752960?)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:104 +0x60
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(0x140098321d0, {0x106573ef8, 0x14009c9ba10}, 0x14003752960)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:245 +0x358
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(0x14009832000, {0x106573ef8, 0x14009c9b9e0}, 0x14003752960)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:526 +0x434
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(0x14002651710, {0x106573ef8, 0x14009c9b9e0}, 0x14003752960, {0x1333a30c8, 0x14009832000})
	github.com/cockroachdb/cockroach/pkg/kv/db.go:998 +0x98
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(0x14007956d10, {0x106573ef8, 0x14009c9b9e0}, 0x14003752960)
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1057 +0x1f4
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).commit(0x14007956d10, {0x106573ef8, 0x14009c9b9e0})
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:683 +0x10c
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Commit(0x106573ef8?, {0x106573ef8?, 0x14009c9b9e0?})
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:698 +0x70
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).commitSQLTransactionInternal(0x14006813800, {0x106573ef8?, 0x1400834ea80?})
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1065 +0x1d8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).commitSQLTransaction(0x14006813800, {0x106573ef8, 0x1400834ea80}, {0x10659d460, 0x14004e51450}, 0x1400c0ce6b8)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:958 +0xec
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).handleAutoCommit(0x14006813800, {0x106573ef8, 0x1400834ea80}, {0x10659d460, 0x14004e51450})
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2289 +0xf8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState.func8({0x106573ef8?, 0x1400834ea80?})
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:584 +0x194
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0x14006813800, {0x106573ef8?, 0x1400834e930?}, {{0x10659d460, 0x14004e51450}, {0x0, 0x0, 0x0}, {0x14006b98163, 0x2b}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:814 +0x1d90
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1({0x106573ef8?, 0x1400834e930?})
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:131 +0x8c
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling(0x106573ef8?, {0x106573ef8?, 0x1400834e930?}, {0x10659d460?, 0x14004e51450?}, 0x0?, 0x1093ec180?)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2553 +0x27c
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0x14006813800, {0x106573ef8, 0x1400834e930}, {{0x10659d460, 0x14004e51450}, {0x0, 0x0, 0x0}, {0x14006b98163, 0x2b}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:130 +0x360
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1(0x14009959aa8, 0x14006813800, 0x140099597a0, 0x14009959858, 0x14009959898, 0x1400c0cf878, 0x1400c0cf868, 0x1400c0cf888)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1983 +0x2dc
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0x14006813800)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1988 +0xcbc
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0x14006813800, {0x106573e50, 0x1400516b900}, 0x101204c48?, 0x1400c539dc8?, 0x14007d9f660?)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1906 +0x18c
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0x140036d9cc0?, {0x106573e50?, 0x1400516b900?}, {0x14006db4e00?}, 0x3?, 0x140008c63f0?)
	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:858 +0xb4
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:728 +0x270
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:639 +0x1e0

Not sure about other details yet.

@yuzefovich
Copy link
Copy Markdown
Member Author

So the problem is that when committing the txn we issue EndTxnRequest in which we don't set Key nor EndKey. Then, in DistSender.Send we construct rs which covers all requests in the BatchRequest, and we end up with something like rs = roachpb.RSpan{keys.RKeyMin, keys.RKeyMax} which encounters the error when Seeking to the start key since it's outside of the tenant's key space.

I tried some hacky things (like intersecting that rs with the tenant span), and they didn't work. I'm not sure what the correct solution here should be, so I'm going to put this approach on hold and will implement my original idea of simply not using the streamer when non-default key locking is present in order to include it into 22.2.2.

@knz
Copy link
Copy Markdown
Contributor

knz commented Jan 10, 2023

filed that last issue as #95003.

@yuzefovich yuzefovich requested a review from a team as a code owner March 7, 2023 23:56
@yuzefovich yuzefovich requested a review from mgartner March 7, 2023 23:56
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 7, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

This commit makes it so that we propagate the lock footprint of the leaf
txns in `LeafTxnFinalState`. This is needed in order to disable the
read-only fast path when committing the txn that acquired locks while
using the leaf txn (which can happen, for example, if we use the
streamer API). Without this propagation, previously the locks would only
be released after the txn expiration duration (5s).

Release note (bug fix): Previously, CockroachDB could delay the release
of the locks acquired when evaluating SELECT FOR UPDATE statements in
some cases. This delay (up to 5s) could then block the future readers.
The bug was introduced in 22.2.0, and the temporary workaround without
upgrading to a release with this fix is to set undocumented cluster
setting `sql.distsql.use_streamer.enabled` to `false`.
…ocking"

This reverts commit 9c2c4f2 while
keeping the regression test.

Release note: None
@yuzefovich
Copy link
Copy Markdown
Member Author

The approach taken by this PR is flawed (see this comment from Nathan), so I'm going to close this since I don't have any intention on working on this. However, #94290 is still present, so the underlying problem is still being tracked.

@yuzefovich yuzefovich closed this Mar 21, 2023
@yuzefovich yuzefovich deleted the lock-spans branch March 21, 2023 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-noremind Bots won't notify about PRs with X-noremind

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants