storage: Added txn wait queue related metrics by darinpp · Pull Request #35921 · cockroachdb/cockroach

darinpp · 2019-03-18T23:43:11Z

Resolves #31106

The new metrics are:

current count of pushees, pushers and queries
the total number of instances where a pusher had to wait more than 1 min on the queue
histograms for the time spent waiting on the queue for pushers and queries
total number of deadlocks

Here is a sample chart for the current count of pushees, pushers, queries as well as the total number of cases where a pusher waited more than 1 min

Here is a sample chart of the cumulative count of the deadlocks expressed as rate

Here is a sample chart of the pusher wait time breakdown

and a sample chart of the queue wait time breakdown

cockroach-teamcity · 2019-03-18T23:43:18Z

This change is

CLAassistant · 2019-03-18T23:43:18Z

All committers have signed the CLA.

darinpp · 2019-03-18T23:57:26Z

I haven't added any tests. Not sure how to test these. Suggestions are welcome.

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/txnqueue.go, line 473 at r1 (raw file):

	excessiveWait := time.Minute
	excessiveWaitTimer := time.NewTimer(excessiveWait)

Is there any concern that we're spinning up a goroutine each time we get to this point in the code? Is this on a critical perf path?

tbg

I haven't added any tests. Not sure how to test these. Suggestions are welcome.

I would start by instrumenting the existing tests so that they verify the metrics at each step in the test. Additionally, there are certain invariants that should be true at the beginning end of all tests when a queue is empty. (i.e. all gauges should be zero, etc).

BTW, the existing tests are in pkg/storage/txn_wait_queue_test.go for historical reasons (tech debt).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/metrics.go, line 1 at r1 (raw file):

// Copyright 2017 The Cockroach Authors.

2019 (the current year is used for new code)

pkg/storage/txnwait/metrics.go, line 37 at r1 (raw file):

		Pushees: metric.NewGauge(
			metric.Metadata{
				Name:        "txn_wait_queue.pushees",

We're wildly inconsistent in our naming of metrics, but I'd say

txnwaitqueue throughout
txnwaitqueue.{pushee,pusher,query}.waiting

pkg/storage/txnwait/metrics.go, line 64 at r1 (raw file):

		PusherExcessiveWaitTotal: metric.NewCounter(
			metric.Metadata{
				Name:        "txn_wait_queue.pusher.excessive_wait_total",

Taking

root@127.0.0.1:54192/defaultdb> select name from crdb_internal.node_metrics WHERE name LIKE '%slow%';
         name
+---------------------+
  requests.slow.lease
  requests.slow.latch
  requests.slow.raft
(3 rows)

as a precedent, make this txnwaitqueue.pusher.slow

Also, this should be a gauge. In normal operation, it's always zero. It's nonzero exactly while a request is slow. (This is how all of the other ones work).

pkg/storage/txnwait/metrics.go, line 74 at r1 (raw file):

Histogram of durations spent in queue by pushers.

pkg/storage/txnwait/metrics.go, line 86 at r1 (raw file):

			metric.Metadata{
				Name:        "txn_wait_queue.query.wait_time",
				Help:        "The breakdown of the query's wait time when they are removed from the queue",

Ditto.

pkg/storage/txnwait/metrics.go, line 98 at r1 (raw file):

Number of deadlocks detected by the txn wait queue.

pkg/storage/txnwait/txnqueue.go, line 473 at r1 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Is there any concern that we're spinning up a goroutine each time we get to this point in the code? Is this on a critical perf path?

Please use the same pattern from

cockroach/pkg/storage/replica_write.go

Lines 194 to 223 in 38e7961

    
           slowTimer := timeutil.NewTimer() 
        
           defer slowTimer.Stop() 
        
           slowTimer.Reset(base.SlowRequestThreshold) 
        
           tBegin := timeutil.Now() 
        
           for { 
        
           	select { 
        
           	case propResult := <-ch: 
        
           		// Semi-synchronously process any intents that need resolving here in 
        
           		// order to apply back pressure on the client which generated them. The 
        
           		// resolution is semi-synchronous in that there is a limited number of 
        
           		// outstanding asynchronous resolution tasks allowed after which 
        
           		// further calls will block. 
        
           		if len(propResult.Intents) > 0 { 
        
           			// TODO(peter): Re-proposed and canceled (but executed) commands can 
        
           			// both leave intents to GC that don't hit this code path. No good 
        
           			// solution presents itself at the moment and such intents will be 
        
           			// resolved on reads. 
        
           			if err := r.store.intentResolver.CleanupIntentsAsync(ctx, propResult.Intents, true /* allowSync */); err != nil { 
        
           				log.Warning(ctx, err) 
        
           			} 
        
           		} 
        
           		if len(propResult.EndTxns) > 0 { 
        
           			if err := r.store.intentResolver.CleanupTxnIntentsAsync(ctx, r.RangeID, propResult.EndTxns, true /* allowSync */); err != nil { 
        
           				log.Warning(ctx, err) 
        
           			} 
        
           		} 
        
           		return propResult.Reply, propResult.Err, propResult.ProposalRetry 
        
           	case <-slowTimer.C: 
        
           		slowTimer.Read = true

which is allocation optimized (and prevents deadlocks).

@andy-kimball if you enter the queue, a lot more work is already being done than this txn, so it shouldn't be on the critical perf path.

pkg/storage/txnwait/txnqueue.go, line 481 at r1 (raw file):

	// queue.
	pusheeTxnTimer.Reset(0)
	startWaiting := time.Now()

Instead of calling RecordValue for this in many different places, defer a function here that does it.

pkg/storage/txnwait/txnqueue.go, line 485 at r1 (raw file):

		select {
		case <- excessiveWaitTimer.C:
			log.Warningf(ctx, "pusher request longer than %s", excessiveWait)

Please also use the same have been waiting kind of wording and print the pusher and pushee. You also need to defer a decrement to the gauge. Basically just do exactly what that other code I refer to above does.

pkg/storage/txnwait/txnqueue.go, line 694 at r1 (raw file):

	select {
	case <-ctx.Done():
		metrics.QueryWaitTime.RecordValue(time.Since(startWaiting).Nanoseconds())

Ditto about defering this once instead.

bdarnell

Reviewed 2 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, and @nvanbenschoten)

pkg/storage/txnwait/metrics.go, line 56 at r1 (raw file):

			metric.Metadata{
				Name:        "txn_wait_queue.queries",
				Help:        "Number of queries on the txn wait queue",

Can we say more about what a "query" is here? "Pushee" and "pusher" make sense but without going back to the code I can't remember what a "query" is.

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/metrics.go, line 1 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

2019 (the current year is used for new code)

Done.

pkg/storage/txnwait/metrics.go, line 37 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We're wildly inconsistent in our naming of metrics, but I'd say

txnwaitqueue throughout

txnwaitqueue.{pushee,pusher,query}.waiting

Done.

pkg/storage/txnwait/metrics.go, line 56 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can we say more about what a "query" is here? "Pushee" and "pusher" make sense but without going back to the code I can't remember what a "query" is.

added some clarification based on my understanding.

pkg/storage/txnwait/metrics.go, line 64 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Taking
root@127.0.0.1:54192/defaultdb> select name from crdb_internal.node_metrics WHERE name LIKE '%slow%';
         name
+---------------------+
  requests.slow.lease
  requests.slow.latch
  requests.slow.raft
(3 rows)
as a precedent, make this txnwaitqueue.pusher.slow

Also, this should be a gauge. In normal operation, it's always zero. It's nonzero exactly while a request is slow. (This is how all of the other ones work).

Done.

pkg/storage/txnwait/metrics.go, line 74 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Histogram of durations spent in queue by pushers.

Done.

pkg/storage/txnwait/metrics.go, line 86 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto.

Done.

pkg/storage/txnwait/metrics.go, line 98 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Number of deadlocks detected by the txn wait queue.

Done.

pkg/storage/txnwait/txnqueue.go, line 473 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Please use the same pattern from

cockroach/pkg/storage/replica_write.go

Lines 194 to 223 in 38e7961

slowTimer := timeutil.NewTimer()

defer slowTimer.Stop()

slowTimer.Reset(base.SlowRequestThreshold)

tBegin := timeutil.Now()

for {

select {

case propResult := <-ch:

// Semi-synchronously process any intents that need resolving here in

// order to apply back pressure on the client which generated them. The

// resolution is semi-synchronous in that there is a limited number of

// outstanding asynchronous resolution tasks allowed after which

// further calls will block.

if len(propResult.Intents) > 0 {

// TODO(peter): Re-proposed and canceled (but executed) commands can

// both leave intents to GC that don't hit this code path. No good

// solution presents itself at the moment and such intents will be

// resolved on reads.

if err := r.store.intentResolver.CleanupIntentsAsync(ctx, propResult.Intents, true /* allowSync */); err != nil {

log.Warning(ctx, err)

}

}

if len(propResult.EndTxns) > 0 {

if err := r.store.intentResolver.CleanupTxnIntentsAsync(ctx, r.RangeID, propResult.EndTxns, true /* allowSync */); err != nil {

log.Warning(ctx, err)

}

}

return propResult.Reply, propResult.Err, propResult.ProposalRetry

case <-slowTimer.C:

slowTimer.Read = true

which is allocation optimized (and prevents deadlocks).

@andy-kimball if you enter the queue, a lot more work is already being done than this txn, so it shouldn't be on the critical perf path.

Done.

pkg/storage/txnwait/txnqueue.go, line 481 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Instead of calling RecordValue for this in many different places, defer a function here that does it.

Done.

pkg/storage/txnwait/txnqueue.go, line 485 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Please also use the same have been waiting kind of wording and print the pusher and pushee. You also need to defer a decrement to the gauge. Basically just do exactly what that other code I refer to above does.

Done.

pkg/storage/txnwait/txnqueue.go, line 694 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto about defering this once instead.

Done.

tbg

Sorry it took me so long to get back at this. Still mostly commenting on style, but looks like this is getting there. I'll want to give all of the metrics operations another good look to see if I can spot any way in which they could diverge.

Reviewed 5 of 5 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, @nvanbenschoten, and @tbg)

pkg/storage/store.go, line 4357 at r2 (raw file):

txnwait

(not txnqueue)

pkg/storage/txn_wait_queue_test.go, line 58 at r2 (raw file):

func assertAllGaugesAreZero(tc testContext) {
	testutils.SucceedsSoon(tc.TB, func() error {
		if !tc.store.GetTxnWaitMetrics().HasAllGaugesZero() {

HasAllGaugesZero should live in this file.

pkg/storage/txn_wait_queue_test.go, line 68 at r2 (raw file):

	testutils.SucceedsSoon(tc.TB, func() error {
		v := tc.store.GetTxnWaitMetrics().PusherWaiting.Value()
		if v != expectedCount {

You can use assert.Equal(t, expectedCount, v, "unexpected number of pushers") (here and everywhere else). Not mandatory.

pkg/storage/txn_wait_queue_test.go, line 131 at r2 (raw file):

		t.Errorf("expected push txn queue is enabled")
	}
	assertAllGaugesAreZero(tc)

It doesn't seem that this needs an internal SucceedsSoon.

pkg/storage/txn_wait_queue_test.go, line 137 at r2 (raw file):

		t.Fatalf("expected pendingTxn to be in txns map after enqueue")
	}
	assertPusheeCount(tc, 1)

Ditto

pkg/storage/txn_wait_queue_test.go, line 157 at r2 (raw file):

			return errors.Errorf("expected GetDependents %+v; got %+v", expDeps, deps)
		}
		assertPusheeCount(tc, 1)

Neither do these two, because it's in a SucceedsSoon already (but you need to return an error from assertPusherCount)

I think you would be even better off without all these helpers and just writing m := tc.store.TxnWaitMetrics() and then in the code

assert.Equal(t, exp, m.PusherWaiting.Count())

or if you're in a SucceedsSoon

if act := m.PusherWaiting.Count(); act != exp {
  return errors.Errorf("%d pushers, but want %d", act, exp)
}

pkg/storage/txn_wait_queue_test.go, line 221 at r2 (raw file):

	assertAllGaugesAreZero(tc)
	q.Enqueue(txn)
	assertPusheeCount(tc, 1)

this would then read assert.Equal(t, 1, m.PusheeCount.Value())

pkg/storage/txnwait/metrics.go, line 49 at r2 (raw file):

				Name:        "txnwaitqueue.pushee.waiting",
				Help:        "Number of pushees on the txn wait queue",
				Measurement: "PusheeWaiting",

Is that what measurement strings look like? I think this is what gets shown on the graph axis. Maybe "Waiting Pushees"? Ditto for the other metrics.

pkg/storage/txnwait/txnqueue.go, line 485 at r1 (raw file):

Previously, darinpp wrote…

Done.

nit: pusher %s: have been waiting ... rolls off the tongue better. I'd also unify the two cases (using the empty uuid when no pusher) simply because we'll easily see what's going on from the log message and also in practice there will always be a txn (we use nontransactional reqs for barely anything).

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, @nvanbenschoten, and @tbg)

pkg/storage/store.go, line 4357 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

txnwait

(not txnqueue)

renamed.

pkg/storage/txn_wait_queue_test.go, line 58 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

HasAllGaugesZero should live in this file.

moved.

pkg/storage/txn_wait_queue_test.go, line 68 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You can use assert.Equal(t, expectedCount, v, "unexpected number of pushers") (here and everywhere else). Not mandatory.

I changed it.

pkg/storage/txn_wait_queue_test.go, line 131 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It doesn't seem that this needs an internal SucceedsSoon.

I agree that it doesn't need. I changed to be explicit. But I do think that this is worse than the way it was before. What I want to test is that gauges are set in a certain way at the point of the test. Whether they are being set synchronously in a direct call or in a go routine or the modifications are deferred - isn't important. A switch from one to the other shouldn't require me to modify tests. As is right now - it will require.

pkg/storage/txn_wait_queue_test.go, line 137 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

fixed too.

pkg/storage/txn_wait_queue_test.go, line 157 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Neither do these two, because it's in a SucceedsSoon already (but you need to return an error from assertPusherCount)

I think you would be even better off without all these helpers and just writing m := tc.store.TxnWaitMetrics() and then in the code
assert.Equal(t, exp, m.PusherWaiting.Count())
or if you're in a SucceedsSoon
if act := m.PusherWaiting.Count(); act != exp {
  return errors.Errorf("%d pushers, but want %d", act, exp)
}

changed too.

pkg/storage/txn_wait_queue_test.go, line 221 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

this would then read assert.Equal(t, 1, m.PusheeCount.Value())

changed.

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/metrics.go, line 49 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is that what measurement strings look like? I think this is what gets shown on the graph axis. Maybe "Waiting Pushees"? Ditto for the other metrics.

Good catch. I didn't see that this is used as labels on the graphs. It makes more sense that way you suggest.

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/txnqueue.go, line 485 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: pusher %s: have been waiting ... rolls off the tongue better. I'd also unify the two cases (using the empty uuid when no pusher) simply because we'll easily see what's going on from the log message and also in practice there will always be a txn (we use nontransactional reqs for barely anything).

Modified the way you are suggesting. Agreed that it is better.

tbg

You're going to want to add a category to the release note.

Something like

Release note (admin ui change): Added txn wait queue related metrics for debugging (available only via the custom chart page).

Reviewed 5 of 5 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, and @nvanbenschoten)

pkg/storage/txn_wait_queue_test.go, line 131 at r2 (raw file):

Previously, darinpp wrote…

I agree that it doesn't need. I changed to be explicit. But I do think that this is worse than the way it was before. What I want to test is that gauges are set in a certain way at the point of the test. Whether they are being set synchronously in a direct call or in a go routine or the modifications are deferred - isn't important. A switch from one to the other shouldn't require me to modify tests. As is right now - it will require.

By that same reasoning, should everything we assert through all of the unit tests in the code base tolerate asynchronicity? I think it is better to assume no asynchronicity by default and to adapt only when it is found to be necessary. We should also generally avoid SucceedsSoon and by that token, not use it unless actually necessary.

pkg/storage/txn_wait_queue_test.go, line 57 at r3 (raw file):

}

func hasAllGaugesZero(tc testContext) bool {

When this test fails, it'll be unclear which gauge exactly triggered it. I'd do this as

func hasAllGaugesZero(m *txnwait.Metrics) error {
   if act := m.PusheeWaiting.Value(); act != 0 { return errors.Errorf("PusheeWaiting=%d", act) }
  ...
   return nil
}

All of this is unfortunately a bit verbose, and can be done better, but let's keep it basic for now.

pkg/storage/txnwait/metrics.go, line 56 at r1 (raw file):
QueryTxn loads the pusher's own transaction (and I guess the pushee's too) on some timer to make sure they eventually get an up to date view. It also returns the txns waiting on the query-ee which I guess is used for deadlock detection. The txnwaitqueue also lets QueryTxn wait until the txn record updates relative to what the querier provided. I can't say that I understand it any better than that. In light of that maybe

Number of transaction status queries waiting for an updated transaction record

pkg/storage/txnwait/metrics.go, line 77 at r3 (raw file):

			metric.Metadata{
				Name:        "txnwaitqueue.pusher.wait_time",
				Help:        "Histogram of durations spent in queue by pushers.",

Figure out what the convention is (trailing dot or not?) and use the same throughout. I think it's no trailing dot based on what I see in storage/metrics.go.

pkg/storage/txnwait/txnqueue.go, line 501 at r3 (raw file):

				timeutil.Since(tBegin).Seconds(),
				req.PusheeTxn.ID.Short(),
			)

In other places, we print a corresponding message that gives the actual wait time. I'd do something similar here by defer'ing a closure that prints the time again when it fires.

See

cockroach/pkg/storage/replica_write.go

Lines 216 to 225 in cf449c6

    
           defer func() { 
        
           	r.store.metrics.SlowRaftRequests.Dec(1) 
        
           	log.Infof( 
        
           		ctx, 
        
           		"slow command %s finished after %.2fs with error %v", 
        
           		ba, 
        
           		timeutil.Since(tBegin).Seconds(), 
        
           		pErr, 
        
           	) 
        
           }()

darinpp

updated the release note category.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @andy-kimball, @bdarnell, @nvanbenschoten, and @tbg)

pkg/storage/txn_wait_queue_test.go, line 131 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

By that same reasoning, should everything we assert through all of the unit tests in the code base tolerate asynchronicity? I think it is better to assume no asynchronicity by default and to adapt only when it is found to be necessary. We should also generally avoid SucceedsSoon and by that token, not use it unless actually necessary.

No, I don't think everything should be tested in a way that tolerates asynchronicity. I could imagine many cases where deferring the result will be a bug so the test shouldn't allow for immediate failure and success after a wait and retry. However for the metrics case above this issue doesn't exist and all metrics reporting should ideally be deferred to not slow down the main code.

pkg/storage/txn_wait_queue_test.go, line 57 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

When this test fails, it'll be unclear which gauge exactly triggered it. I'd do this as
func hasAllGaugesZero(m *txnwait.Metrics) error {
   if act := m.PusheeWaiting.Value(); act != 0 { return errors.Errorf("PusheeWaiting=%d", act) }
  ...
   return nil
}
All of this is unfortunately a bit verbose, and can be done better, but let's keep it basic for now.

Refactored it. I agree that this is better.

pkg/storage/txnwait/metrics.go, line 56 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

QueryTxn loads the pusher's own transaction (and I guess the pushee's too) on some timer to make sure they eventually get an up to date view. It also returns the txns waiting on the query-ee which I guess is used for deadlock detection. The txnwaitqueue also lets QueryTxn wait until the txn record updates relative to what the querier provided. I can't say that I understand it any better than that. In light of that maybe

Number of transaction status queries waiting for an updated transaction record

Changed.

pkg/storage/txnwait/metrics.go, line 77 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Figure out what the convention is (trailing dot or not?) and use the same throughout. I think it's no trailing dot based on what I see in storage/metrics.go.

No dot seems to be the default. I fixed the few offenders.

pkg/storage/txnwait/txnqueue.go, line 501 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In other places, we print a corresponding message that gives the actual wait time. I'd do something similar here by defer'ing a closure that prints the time again when it fires.

See

cockroach/pkg/storage/replica_write.go

Lines 216 to 225 in cf449c6

defer func() {

r.store.metrics.SlowRaftRequests.Dec(1)

log.Infof(

ctx,

"slow command %s finished after %.2fs with error %v",

ba,

timeutil.Since(tBegin).Seconds(),

pErr,

)

}()

Changed.

tbg

Thanks. Almost ready!

Reviewed 3 of 3 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @andy-kimball, @bdarnell, @darinpp, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/txnqueue.go, line 500 at r4 (raw file):

			defer func() {
				metrics.PusherSlow.Dec(1)
				log.Warningf(ctx, "pusher %s finished waiting after %.2fs for %s pushee",

This message is in addition to the one after one minute. Also please update the wording for both to

pusher %s: .... for pushee %s.

Resolves cockroachdb#31106 The new metrics are: - current count of pushees, pushers and queries - the total number of instances where a pusher had to wait more than 1 min on the queue - histograms for the time spent waiting on the queue for pushers and queries - total number of deadlocks Release note (admin ui change): Added txn wait queue related metrics for debugging (available only via the custom chart page)

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @andy-kimball, @bdarnell, @nvanbenschoten, and @tbg)

pkg/storage/txnwait/txnqueue.go, line 500 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This message is in addition to the one after one minute. Also please update the wording for both to

pusher %s: .... for pushee %s.

Ah..makes sense to log both. Updated.

tbg

Reviewed 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @andy-kimball, @bdarnell, and @nvanbenschoten)

tbg · 2019-03-26T09:12:27Z

bors r+

Thanks @darinpp! After this has merged, could you backport it to release-19.1? You can do so by running backport 35921 after having installed backport via go get github.com/benesch/backport.

35921: storage: Added txn wait queue related metrics r=tbg a=darinpp Resolves #31106 The new metrics are: - current count of pushees, pushers and queries - the total number of instances where a pusher had to wait more than 1 min on the queue - histograms for the time spent waiting on the queue for pushers and queries - total number of deadlocks Here is a sample chart for the current count of pushees, pushers, queries as well as the total number of cases where a pusher waited more than 1 min ![image](https://user-images.githubusercontent.com/7052693/54570571-3c7c0f80-499c-11e9-938e-e58cda791e17.png) Here is a sample chart of the cumulative count of the deadlocks expressed as rate ![image](https://user-images.githubusercontent.com/7052693/54570649-895fe600-499c-11e9-9e56-c24a3dad52b6.png) Here is a sample chart of the pusher wait time breakdown ![image](https://user-images.githubusercontent.com/7052693/54570704-bb714800-499c-11e9-9a8f-7da56f697997.png) and a sample chart of the queue wait time breakdown ![image](https://user-images.githubusercontent.com/7052693/54570731-d643bc80-499c-11e9-8fc5-3ed6f39344ea.png) Co-authored-by: Darin Peshev <darinp@gmail.com>

craig · 2019-03-26T09:36:11Z

Build succeeded

GitHub CI (Cockroach)

darinpp requested a review from a team March 18, 2019 23:43

darinpp requested review from andreimatei, andy-kimball, nvb and tbg and removed request for a team March 18, 2019 23:45

darinpp force-pushed the master branch from e39b0c7 to d0527fb Compare March 18, 2019 23:54

darinpp changed the title ~~added txn wait queue related metrics~~ storage: Added txn wait queue related metrics Mar 18, 2019

darinpp requested a review from bdarnell March 19, 2019 00:01

andy-kimball reviewed Mar 19, 2019

View reviewed changes

tbg requested changes Mar 19, 2019

View reviewed changes

bdarnell reviewed Mar 19, 2019

View reviewed changes

darinpp force-pushed the master branch 2 times, most recently from d5224bd to b3e5eed Compare March 22, 2019 05:07

darinpp commented Mar 22, 2019

View reviewed changes

darinpp requested a review from tbg March 22, 2019 05:16

tbg reviewed Mar 22, 2019

View reviewed changes

darinpp force-pushed the master branch from b3e5eed to dcf87e9 Compare March 23, 2019 23:38

darinpp commented Mar 23, 2019

View reviewed changes

tbg approved these changes Mar 25, 2019

View reviewed changes

darinpp force-pushed the master branch from dcf87e9 to 7cebefd Compare March 25, 2019 17:57

darinpp commented Mar 25, 2019

View reviewed changes

tbg requested changes Mar 25, 2019

View reviewed changes

darinpp force-pushed the master branch from 7cebefd to 5df3db4 Compare March 25, 2019 23:27

darinpp commented Mar 25, 2019

View reviewed changes

tbg approved these changes Mar 26, 2019

View reviewed changes

craig bot merged commit 5df3db4 into cockroachdb:master Mar 26, 2019

darinpp mentioned this pull request Mar 26, 2019

release-19.1: storage: Added txn wait queue related metrics #36166

Merged

tbg mentioned this pull request May 1, 2019

storage: propagate errors from contentionQueue, catch stalls in roachtest #37199

Merged

	slowTimer := timeutil.NewTimer()
	defer slowTimer.Stop()
	slowTimer.Reset(base.SlowRequestThreshold)
	tBegin := timeutil.Now()

	for {
	select {
	case propResult := <-ch:
	// Semi-synchronously process any intents that need resolving here in
	// order to apply back pressure on the client which generated them. The
	// resolution is semi-synchronous in that there is a limited number of
	// outstanding asynchronous resolution tasks allowed after which
	// further calls will block.
	if len(propResult.Intents) > 0 {
	// TODO(peter): Re-proposed and canceled (but executed) commands can
	// both leave intents to GC that don't hit this code path. No good
	// solution presents itself at the moment and such intents will be
	// resolved on reads.
	if err := r.store.intentResolver.CleanupIntentsAsync(ctx, propResult.Intents, true /* allowSync */); err != nil {
	log.Warning(ctx, err)
	}
	}
	if len(propResult.EndTxns) > 0 {
	if err := r.store.intentResolver.CleanupTxnIntentsAsync(ctx, r.RangeID, propResult.EndTxns, true /* allowSync */); err != nil {
	log.Warning(ctx, err)
	}
	}
	return propResult.Reply, propResult.Err, propResult.ProposalRetry
	case <-slowTimer.C:
	slowTimer.Read = true

	defer func() {
	r.store.metrics.SlowRaftRequests.Dec(1)
	log.Infof(
	ctx,
	"slow command %s finished after %.2fs with error %v",
	ba,
	timeutil.Since(tBegin).Seconds(),
	pErr,
	)
	}()

Conversation

darinpp commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Mar 18, 2019

Uh oh!

CLAassistant commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darinpp commented Mar 18, 2019

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

darinpp left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

darinpp left a comment

Choose a reason for hiding this comment

Uh oh!

darinpp left a comment

Choose a reason for hiding this comment

Uh oh!

darinpp left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

darinpp left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

darinpp left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Mar 26, 2019

Uh oh!

craig bot commented Mar 26, 2019

Build succeeded

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

darinpp commented Mar 18, 2019 •

edited

Loading

CLAassistant commented Mar 18, 2019 •

edited

Loading