Skip to content

kvserver/concurrency: apply goroutine labels before waiting for push#71622

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:label-before-txn-wait
Dec 7, 2021
Merged

kvserver/concurrency: apply goroutine labels before waiting for push#71622
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:label-before-txn-wait

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented Oct 15, 2021

Often I find a stack that ends in a goroutine waiting on a push, but I can't coorelate that back
to which txn's are involved. Ideally we'd pprof label every single request we send but the allocation
overhead of constructing the labels probably makes that a bad idea. Instead, maybe it is okay to just
label right before going into the waiting step of push.

With this change, /debug/pprof/goroutine?debug=1 shows something like this:

# labels: {"pushee":"d507450c-18bc-45c7-9ee5-3cb3c6248fe7", "pusher":"65690ed6-9dcf-46b7-a975-e1635ab9947e"}
#	0x54a2b4e	pkg/kv/kvserver/txnwait.(*Queue).MaybeWaitForPush+0xf2e			pkg/kv/kvserver/txnwait/queue.go:513
#	0x54d5c32	pkg/kv/kvserver/concurrency.(*managerImpl).maybeInterceptReq.func1+0x52	pkg/kv/kvserver/concurrency/concurrency_manager.go:335
#	0x489fb42	runtime/pprof.Do+0xa2												/usr/local/opt/go/libexec/src/runtime/pprof/runtime.go:40
#	0x54d5afd	pkg/kv/kvserver/concurrency.(*managerImpl).maybeInterceptReq+0x47d		pkg/kv/kvserver/concurrency/concurrency_manager.go:334
#	0x54d4e35	pkg/kv/kvserver/concurrency.(*managerImpl).sequenceReqWithGuard+0x175		pkg/kv/kvserver/concurrency/concurrency_manager.go:237
#	0x54d4ba8	pkg/kv/kvserver/concurrency.(*managerImpl).SequenceReq+0x2c8			pkg/kv/kvserver/concurrency/concurrency_manager.go:211
#	0x5618a90	pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries+0x310		pkg/kv/kvserver/replica_send.go:409
#	0x5617ad6	pkg/kv/kvserver.(*Replica).sendWithRangeID+0x336				pkg/kv/kvserver/replica_send.go:159
#	0x56628a4	pkg/kv/kvserver.(*Replica).Send+0x7c4						pkg/kv/kvserver/replica_send.go:95
#	0x566280b	pkg/kv/kvserver.(*Store).Send+0x72b						pkg/kv/kvserver/store_send.go:190
#	0x566ec67	pkg/kv/kvserver.(*Stores).Send+0x1e7						pkg/kv/kvserver/stores.go:191
#	0x6e5cb0a	pkg/server.(*Node).batchInternal.func1+0x26a					pkg/server/node.go:928
#	0x4b3e25c	pkg/util/stop.(*Stopper).RunTaskWithErr+0xdc					pkg/util/stop/stopper.go:347
#	0x6e5c828	pkg/server.(*Node).batchInternal+0x148						pkg/server/node.go:916
#	0x6e5cee4	pkg/server.(*Node).Batch+0x104							pkg/server/node.go:962
#	0x5215573	pkg/rpc.internalClientAdapter.Batch+0x53					pkg/rpc/context.go:482
#	0x5381d9d	pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch+0x15d			pkg/kv/kvclient/kvcoord/transport.go:203
#	0x5381bd1	pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext+0x131			pkg/kv/kvclient/kvcoord/transport.go:185
#	0x5379b6d	pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas+0x11ed			pkg/kv/kvclient/kvcoord/dist_sender.go:1960
#	0x5376504	pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch+0xd84			pkg/kv/kvclient/kvcoord/dist_sender.go:1552
#	0x5373696	pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges+0x416		pkg/kv/kvclient/kvcoord/dist_sender.go:1170
#	0x53714f3	pkg/kv/kvclient/kvcoord.(*DistSender).Send+0x6d3				pkg/kv/kvclient/kvcoord/dist_sender.go:804
#	0x4f3b817	pkg/kv.(*CrossRangeTxnWrapperSender).Send+0xd7					pkg/kv/db.go:219
#	0x4f3f754	pkg/kv.(*DB).sendUsingSender+0x154						pkg/kv/db.go:858
#	0x4f4dab0	pkg/kv.(*DB).send+0x90								pkg/kv/db.go:841
#	0x4f3f118	pkg/kv.sendAndFill+0x118							pkg/kv/db.go:769
#	0x4f3f293	pkg/kv.(*DB).Run+0x73								pkg/kv/db.go:792
#	0x54a81dd	pkg/kv/kvserver/intentresolver.(*IntentResolver).MaybePushTransactions+0x6dd	pkg/kv/kvserver/intentresolver/intent_resolver.go:389
#	0x54a79c4	pkg/kv/kvserver/intentresolver.(*IntentResolver).PushTransaction+0x144		pkg/kv/kvserver/intentresolver/intent_resolver.go:297
#	0x54e628d	pkg/kv/kvserver/concurrency.(*lockTableWaiterImpl).pushLockTxn+0x50d		pkg/kv/kvserver/concurrency/lock_table_waiter.go:519

Release note: none.

@dt dt requested a review from nvb October 15, 2021 19:46
@dt dt requested a review from a team as a code owner October 15, 2021 19:46
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Oct 15, 2021

@nvanbenschoten this was a random Friday musing based on an experience of debugging a "stuck" job where we saw in the stack trace that one of our job's goroutines had been in the select in MaybeWaitForPush for thousands of minutes, so we knew it was stuck on a transaction, but we didn't know which one. I have no idea if this is the right way to solve that, just playing around with it.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @nvanbenschoten)


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 334 at r1 (raw file):

		var err *roachpb.Error
		labels := pprof.Labels("pushee", t.PusheeTxn.ID.String(), "pusher", t.PusherTxn.ID.String())
		pprof.Do(ctx, labels, func(ctx context.Context) {

I wonder how much faster it is to use pprof.SetGoroutineLabels. Doing so would avoid the closure allocation. Look at the source of pprof.Do. It is pretty small convenience function, and we're performance sensitive.

Setting the profile labels for a goroutine compiles down into a function call which does getg().labels = labels. So I think the primary overhead here is in the pprof.Labels call and the pprof.WithLabels call. Unfortunate allocations, but if used rarely on paths which are likely to be slow I think it makes a lot of sense.

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool! It's a good idea for any cold code paths where goroutines can get stuck waiting for long periods of time.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @nvanbenschoten, and @petermattis)


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 333 at r1 (raw file):

		var resp *roachpb.PushTxnResponse
		var err *roachpb.Error
		labels := pprof.Labels("pushee", t.PusheeTxn.ID.String(), "pusher", t.PusherTxn.ID.String())

Let's push this into MaybeWaitForPush. We can extract an inner Queue.waitForPush method that is called in the !ShouldPushImmediately case.

Then we can also handle the case where PusherTxn is empty better without cluttering this code, by giving it a label like "pusher": "non-txn".


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 334 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder how much faster it is to use pprof.SetGoroutineLabels. Doing so would avoid the closure allocation. Look at the source of pprof.Do. It is pretty small convenience function, and we're performance sensitive.

Setting the profile labels for a goroutine compiles down into a function call which does getg().labels = labels. So I think the primary overhead here is in the pprof.Labels call and the pprof.WithLabels call. Unfortunate allocations, but if used rarely on paths which are likely to be slow I think it makes a lot of sense.

This is a cold enough code path that I wouldn't be too worried about an extra allocation or two. With lock wait-queues, requests only end up here if they have been waiting for more than 50ms on a conflicting transaction.

@tbg
Copy link
Copy Markdown
Member

tbg commented Oct 19, 2021

Nice! I had no idea labels were now included in the ?debug=1 output. Feels like this wasn't always true.

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Oct 19, 2021

They are unfortunately only in the debug=1 output, not the debug=2 output that we usually look at, collect in debug.zip’s, etc but I figure once I find, in =2 output, my stuck stack, I can then switch over to figure out who it is stuck on.

@petermattis
Copy link
Copy Markdown
Collaborator

Another useful goroutine to tag in this way would be the pgwire SQL execution goroutines. If my memory serves, those goroutines are long-lived. We could tag them with the session ID.

@tbg
Copy link
Copy Markdown
Member

tbg commented Oct 19, 2021

They are unfortunately only in the debug=1 output, not the debug=2 output that we usually look at, collect in debug.zip’s, etc but I figure once I find, in =2 output, my stuck stack, I can then switch over to figure out who it is stuck on.

Does debug.zip contain ?debug=1? If not, this is a good reason to add it.

@knz
Copy link
Copy Markdown
Contributor

knz commented Oct 27, 2021

Another useful goroutine to tag in this way would be the pgwire SQL execution goroutines. If my memory serves, those goroutines are long-lived. We could tag them with the session ID.

david can you file an issue for that (unless you're picking it up in this pr) and assign to the server team, thanks

Often I find a stack that ends in a goroutine waiting on a push, but I can't coorelate that back
to which txn's are involved. Ideally we'd pprof label every single request we send but the allocation
overhead of constructing the labels probably makes that a bad idea. Instead, maybe it is okay to just
label right before going into the waiting step of push.

Release note: none.
@dt dt force-pushed the label-before-txn-wait branch from 1117ccd to 2d0e95c Compare December 7, 2021 01:46
Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 333 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's push this into MaybeWaitForPush. We can extract an inner Queue.waitForPush method that is called in the !ShouldPushImmediately case.

Then we can also handle the case where PusherTxn is empty better without cluttering this code, by giving it a label like "pusher": "non-txn".

Done: I pulled out the inner helper a little further down, starting after the existing log message, so I could reuse the existing pusherStr string that already handled that case as suggested.

@dt dt requested a review from nvb December 7, 2021 18:57
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @petermattis)

@dt
Copy link
Copy Markdown
Contributor Author

dt commented Dec 7, 2021

TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 7, 2021

Build succeeded:

@craig craig bot merged commit 65094ec into cockroachdb:master Dec 7, 2021
@dt dt deleted the label-before-txn-wait branch December 21, 2021 04:45
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Feb 14, 2022

blathers backport 21.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants