[DNM] tests: add qos roachtest by msirek · Pull Request #80112 · cockroachdb/cockroach

msirek · 2022-04-18T16:49:47Z

This commit adds a new roachtest which tests admission control QoS by
running TPC-C concurrently with TPC-H OLAP queries, once with the OLAP
session having session variable default_transaction_quality_of_service
set to background and once with it set to regular. More often than
not this test should show a slight improvement using background QoS,
but since results are variable, the test passes if the TPC-C
transactions per minute regresses no more than 5% when the OLAP session
has background QoS.

The test can be run with the commands:

    make bin/workload
    make bin/roachtest
    build/builder.sh mkrelease
    roachtest run qos

Example abbreviated output:

...
15:56:34 util.go:52: up-replication complete
15:56:34 tpcc.go:160: test status: loading fixture
15:57:02 tpcc.go:176: test status:
15:57:02 qos_tpcc_background_olap.go:135: test status: loading TPCH tables
15:58:12 qos_tpcc_background_olap.go:181: test worker status: running tpcc
15:58:12 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16 with background quality of service
16:02:15 cluster.go:659: test status: getting perf/stats_qos.json
16:02:15 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats_qos.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats_qos.json
16:02:17 cluster.go:659: test status:
16:02:17 qos_tpcc_background_olap.go:181: test worker status: running tpcc
16:02:17 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16
16:06:20 cluster.go:659: test status: getting perf/stats.json
16:06:20 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats.json
16:06:22 cluster.go:659: test status:
16:06:22 qos_tpcc_background_olap.go:231:
16:06:22 qos_tpcc_background_olap.go:232: TPCC results with OLAP queries running simultaneously
16:06:22 qos_tpcc_background_olap.go:233: -----------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 96.88, tpmC: 366.33
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6    939.5   2550.1   2818.6   3623.9   5368.7 delivery
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.1    570.4   1476.4   1811.9   2415.9   4026.5 newOrder
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.5    176.2    385.9    453.0    637.5    738.2 orderStatus
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.2    335.5    906.0   1208.0   1610.6   2550.1 payment
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6     96.5    419.4    536.9   1140.9   1811.9 stockLevel
16:06:22 qos_tpcc_background_olap.go:235:

16:06:22 qos_tpcc_background_olap.go:236: TPCC results with OLAP queries running simultaneously with background QoS
16:06:22 qos_tpcc_background_olap.go:237: -------------------------------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 98.11, tpmC: 371.00
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6   1476.4   3489.7   4026.5   5100.3   6710.9 delivery
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.2    805.3   1946.2   2281.7   3087.0   4295.0 newOrder
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.5    209.7    436.2    570.4    704.6    872.4 orderStatus
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.0    503.3   1275.1   1543.5   2080.4   4295.0 payment
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6    192.9    453.0    486.5   1476.4   2013.3 stockLevel
16:06:22 qos_tpcc_background_olap.go:239:

16:06:22 qos_tpcc_background_olap.go:96: tpmC_No_QoS:         366.33   Efficiency: 96.88
16:06:22 qos_tpcc_background_olap.go:98: tpmC_Background_QoS: 371.00   Efficiency: 98.11
+ 16:06:22 qos_tpcc_background_olap.go:119: SUCCESS: TPCC run in parallel with OLAP queries using background QoS
+                                                    had a tpmC score 1.3% higher than with regular QoS.
16:06:22 test_runner.go:886: tearing down after success; see teardown.log
 --- PASS: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (751.23s)
16:06:22 test_runner.go:628: [w0] test passed: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (run 1)
...

Release note: none

cockroach-teamcity · 2022-04-18T16:49:57Z

This change is

ajwerner

Isn't this result backwards?

It feels like this test should run 3 things:

TPCC with no concurrent load
TPCC with lots of concurrent load via tpch (to the point that it fails and we see efficiency in the 60% or worse range)
Same load as 2) but with the background QoS

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

msirek

I tried higher load, and can change the test to do this again. However, with background QoS I sometimes get much worse performance, so I would have to widen the allowed regression percentage so the test isn't failing all the time.. What would the first test show? Wouldn't it always be faster than tests 2 and 3?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

ajwerner · 2022-04-18T17:13:00Z

If admission control worked as intended, the idea is that test 3 should come close to test 1 and test 2 should do poorly

msirek

I don't think the result is backwards. Giving higher admission control priority to TPCC than TPCH should allow it to run faster.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

msirek

If admission control worked as intended, the idea is that test 3 should come close to test 1 and test 2 should do poorly

Yeah, so far I am having a hard time finding a test that shows precisely this point, so I'm using the test to record the current level of performance and make sure we don't regress any further.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

ajwerner

Ah, sorry, I was looking at the latency profiles. At the end of the day, 96.88 vs 98.11 feels in the noise for a 3 minute run. I have a feeling that load-balancing variability could be the culprit. Both are clearly passing runs. Do we know that any admission control throttling kicked in? It'd be nice to validate that the delay over the time interval was non-zero. We have some other roachtest which validate metrics

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

msirek

No worries, OK, I'll increase the load and runtime. Any pointers on the other roachtests you're referring to is appreciated. Thanks.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

ajwerner

cockroach/pkg/cmd/roachtest/tests/ts_util.go

Lines 160 to 162 in 21a286f

    
           response := mustGetMetrics(t, adminURL, start, end, []tsQuery{ 
        
           	{name: "cr.node.distsender.rangelookups", queryType: rate}, 
        
           })

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

msirek

I don't see a predefined metric.Metadata for admission control throttling. I guess a new one is needed.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

ajwerner

These are cockroach metrics. see like

cockroach/pkg/util/admission/work_queue.go

Lines 1457 to 1494 in ef29359

    
           var ( 
        
           	requestedMeta = metric.Metadata{ 
        
           		Name:        "admission.requested.", 
        
           		Help:        "Number of requests", 
        
           		Measurement: "Requests", 
        
           		Unit:        metric.Unit_COUNT, 
        
           	} 
        
           	admittedMeta = metric.Metadata{ 
        
           		Name:        "admission.admitted.", 
        
           		Help:        "Number of requests admitted", 
        
           		Measurement: "Requests", 
        
           		Unit:        metric.Unit_COUNT, 
        
           	} 
        
           	erroredMeta = metric.Metadata{ 
        
           		Name:        "admission.errored.", 
        
           		Help:        "Number of requests not admitted due to error", 
        
           		Measurement: "Requests", 
        
           		Unit:        metric.Unit_COUNT, 
        
           	} 
        
           	waitDurationSumMeta = metric.Metadata{ 
        
           		Name:        "admission.wait_sum.", 
        
           		Help:        "Total wait time in micros", 
        
           		Measurement: "Microseconds", 
        
           		Unit:        metric.Unit_COUNT, 
        
           	} 
        
           	waitDurationsMeta = metric.Metadata{ 
        
           		Name:        "admission.wait_durations.", 
        
           		Help:        "Wait time durations for requests that waited", 
        
           		Measurement: "Wait time Duration", 
        
           		Unit:        metric.Unit_NANOSECONDS, 
        
           	} 
        
           	waitQueueLengthMeta = metric.Metadata{ 
        
           		Name:        "admission.wait_queue_length.", 
        
           		Help:        "Length of wait queue", 
        
           		Measurement: "Requests", 
        
           		Unit:        metric.Unit_COUNT, 
        
           	} 
        
           )

You can look at /_status/vars to find metrics too. admission_wait_durations_kv_stores_sum may be interesting

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

cucaroach

The admission.wait* metrics will show whether overload is being achieved, the overload page in the admin is good too.

Reviewed 2 of 4 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

msirek

Thanks. The latest code push shows my attempt to look up these metrics (see printAdmissionMetrics), though I am not seeing any entries in the response, at least on a local cluster. I am running again now on a cloud cluster, but if you see anything I'm doing wrong in the code, please let me know.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

sumeerbhola · 2022-04-21T20:48:13Z

I'm glad to see such a test -- I fully expect that we will see something surprising.

I'd suggest first just eyeballing the overload page while this is running (and increasing the test duration) to be longer than 3min.
IIRC tpccbench stresses CPU more than the store. I don't know anything about TPCH.
Consider starting with a single node cluster -- the results will be easier to analyze. If admission control happens to do the right thing there, we can then try with 3 nodes.

yuzefovich

Nice work! I have quite a few nits (sorry) and a suggestion.

Reviewed 3 of 4 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @sumeerbhola)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 36 at r3 (raw file):

	Warehouses      int
	Concurrency     int
	OlapConcurrency int

nit: s/Olap/OLAP/ and s/Tpcc/TPCC/ throughout the file - I think we tend to either capitalize all letters in an acronym or none.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 42 at r3 (raw file):

}

// run sets up TPCC then runs both TPCC plus TPCH OLAP queries involving the

nit: probably s/involving the/involving them/.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 46 at r3 (raw file):

// quality of service, then with OLAP queries using `regular` transaction
// quality of service.  The results are logged and compared. The test fails if
// using `background` QoS doesn't result in a tpmC score from TPCC which is at

nit: s/which is at least/at least/.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 60 at r3 (raw file):

	noThrottleTpmC = make([]float64, 0, s.numRuns)
	noThrottleTpccEfficiency = make([]float64, 0, s.numRuns)
	var tpmCQos, tpmCNoQoS float64

nit: s/Qos/QoS/ throughout the file for consistency.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 61 at r3 (raw file):

	noThrottleTpccEfficiency = make([]float64, 0, s.numRuns)
	var tpmCQos, tpmCNoQoS float64
	var efficiencyCQos, efficiencyNoQoS float64

nit: s/CQos/QoS/.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 67 at r3 (raw file):

		crdbNodes, workloadNode := s.setupDatabases(ctx, t, c)

		s.runTpcc(ctx, t, c, crdbNodes, workloadNode, histogramsPath, false)

nit: it's good to add an inline comment for constant arguments like false /* useBackgroundQoS */.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 135 at r3 (raw file):

	if percentImprovement < maxAllowedRegression {
		ttycolor.Stdout(ttycolor.Red)
		failMessage :=

nit: seems like this and the next lines should be squashed.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 207 at r3 (raw file):

	response := mustGetMetrics(t, adminURL, start, end, admissionMetrics)

	// Drop the first two minutes of datapoints as a "ramp-up" period.

nit: seems like a leftover?

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 256 at r3 (raw file):

) {
	m := c.NewMonitor(ctx, crdbNodes)
	// Kick off TPC-H with concurrency

nit: missing period.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 394 at r3 (raw file):

			CPUs:            4,
			Concurrency:     32,
			OlapConcurrency: 64,

This concurrency might be a bit too high since it might push the cluster into the OOM territory. I'd probably drop it to 32. Or you had to bump it up to see the difference? If so, it might be better to only use TPCH Q1 and also do SET CLUSTER SETTING sql.distsql.temp_storage.hash_agg.enabled = false; - Q1 will be burning up CPU while staying within memory limits.

msirek

Thanks for the input. Right now I'm blocked a bit because printAdmissionMetrics is not able to retrieve metrics from the cluster. Any advice on what looks wrong in this code is appreciated.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @yuzefovich)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 36 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/Olap/OLAP/ and s/Tpcc/TPCC/ throughout the file - I think we tend to either capitalize all letters in an acronym or none.

Done

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 42 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: probably s/involving the/involving them/.

Removed involving the

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 46 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/which is at least/at least/.

It sounds better as is.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/Qos/QoS/ throughout the file for consistency.

Sure. I was trying to follow our Go naming conventions because I thought I got review comments before indicating not to capitalize letters other than the first of an acronym. I have made this change.

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 61 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/CQos/QoS/.

Done

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 67 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it's good to add an inline comment for constant arguments like false /* useBackgroundQoS */.

Done

Code quote:

runTpcc(

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 135 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like this and the next lines should be squashed.

Done

Code quote:

ttycolor.Stdout(ttycolor.Red)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 207 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like a leftover?

I don't know if I need this code yet. I am having trouble getting any metrics back at all, so I don't know if I need this code. I'll leave it in commented-out for now.

Code quote:

Drop the first two minutes of datapoints as a

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 256 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: missing period.

Added

Code quote:

Kick off TPC-H with concurrency

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 394 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This concurrency might be a bit too high since it might push the cluster into the OOM territory. I'd probably drop it to 32. Or you had to bump it up to see the difference? If so, it might be better to only use TPCH Q1 and also do SET CLUSTER SETTING sql.distsql.temp_storage.hash_agg.enabled = false; - Q1 will be burning up CPU while staying within memory limits.

Thanks for the advice, I'll try out different things. Right now the blocker for me is I'm unable to get any metrics at all in printAdmissionMetrics.

yuzefovich

Hm, I don't have any experience with those, so hopefully others can chime in.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @sumeerbhola)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):

Previously, msirek (Mark Sirek) wrote…

Sure. I was trying to follow our Go naming conventions because I thought I got review comments before indicating not to capitalize letters other than the first of an acronym. I have made this change.

I'm a big fan of consistency, and my nit came from the fact we had Qos and NoQoS previously - either Qos or QoS seems ok to me, but once picked, it should be used throughout the whole file.

msirek

Thanks. I'm talking with Tommy today, maybe he'll know the answer.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @yuzefovich)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm a big fan of consistency, and my nit came from the fact we had Qos and NoQoS previously - either Qos or QoS seems ok to me, but once picked, it should be used throughout the whole file.

Good point, thanks.

cucaroach

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @msirek, @sumeerbhola, and @yuzefovich)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 64 at r4 (raw file):

	var baselineTpmC, baselineEfficiency float64

	for i := 0; i < s.numRuns; i++ {

numRuns is possibly redundant with roachtest --count

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 84 at r4 (raw file):

		startNoQoS := timeutil.Now()
		s.runTPCCAndOLAPQueries(ctx, t, c, crdbNodes, workloadNode, histogramsPath, false /* useBackgroundQoS */)

Do we want to run 3 workloads back to back? We could also run 3 separate tests and not worry about having a pass fail. The semantics of restarting a workload are unknown to me, like does it try to insert the same records on subsequent runs and get a bunch of collisions? Consider also just leaving tpcc running the whole time and do a couple minutes
w/o tpch, a couple minutes w/ noqos, a couple minutes w/ qos and then a washout period at the end again with no tpch. Not sure how to get instantaneous tpmc metrics but maybe they can be copied and analyzed while the test is running?

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 392 at r4 (raw file):

		{
			CPUs:            4,
			Concurrency:     32,

I think in order for admission control queuing to occur you need sustained goroutines per cpu to be over 40. Check the overload page in the admin. I'd guess concurrency needs to be higher.

msirek · 2022-05-24T22:20:00Z

Even when run with parameter --max-rate=0, this test doesn't seem to be reaching any level of overload:

yuzefovich · 2022-05-24T22:30:05Z

Try changing wait parameter of TPCC?

msirek · 2022-05-24T22:51:11Z

Try changing wait parameter of TPCC?

Thanks!

msirek · 2022-05-25T03:05:56Z

With the --wait=0 option, queries per second increased by an order of magnitude. When tpch was run with background QoS, tpcc had a lower admission work rate and lower tpmC score. I'm thinking the kv test may be a better test since it's more read-heavy.
Results with a 15-minute test:

01:38:44 qos_tpcc_background_olap.go:338: 
01:38:44 qos_tpcc_background_olap.go:339: TPCC results with OLAP queries running simultaneously
01:38:44 qos_tpcc_background_olap.go:340: -----------------------------------------------------
01:38:44 qos_tpcc_background_olap.go:350: Duration: 2m59., Warehouses: 30, Efficiency: 35.26, tpmC: 133.33
01:38:44 qos_tpcc_background_olap.go:352: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            0.2   7516.2  31138.5  42949.7  73014.4  73014.4 delivery
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            2.2  33286.0  98784.2 103079.2 103079.2 103079.2 newOrder
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            0.2    704.6   6710.9   8053.1  12348.0  12348.0 orderStatus
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            2.4  32212.3  73014.4  94489.3 103079.2 103079.2 payment
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            0.2   5905.6  38654.7  51539.6  62277.0  62277.0 stockLevel
01:38:44 qos_tpcc_background_olap.go:342: 

01:38:44 qos_tpcc_background_olap.go:343: TPCC results with OLAP queries running simultaneously with background QoS
01:38:44 qos_tpcc_background_olap.go:344: -------------------------------------------------------------------------
01:38:44 qos_tpcc_background_olap.go:350: Duration: 3m0.9, Warehouses: 30, Efficiency: 30.16, tpmC: 114.03
01:38:44 qos_tpcc_background_olap.go:352: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            0.2  10737.4  20401.1  23622.3  33286.0  33286.0 delivery
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            1.9  60129.5 103079.2 103079.2 103079.2 103079.2 newOrder
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            0.2   1677.7   8589.9  13421.8  34359.7  34359.7 orderStatus
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            1.7  26843.5  85899.3 103079.2 103079.2 103079.2 payment
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            0.2   1543.5  30064.8  51539.6  98784.2  98784.2 stockLevel

Perhaps it needs to run much longer to show extreme overload.

Admission graphs with a 3-minute test, first 3 minutes starting at 1:09 use background QoS for tpch:

func registerTPCCQoSBackgroundOLAP(r registry.Registry) {
	specs := []tpccQoSBackgroundOLAPSpec{
		{
			CPUs:            4,
			Concurrency:     200,
			OLAPConcurrency: 64,
			Nodes:           3,
			Warehouses:      30,
			Duration:        3 * time.Minute,
			Ramp:            1 * time.Minute,
			numRuns:         1,
		},
	}
	for _, s := range specs {
		registerTPCCQoSBackgroundOLAPSpec(r, s)
	}
}

This commit adds a new roachtest which tests admission control QoS by running TPC-C concurrently with TPC-H OLAP queries, once with the OLAP session having session variable `default_transaction_quality_of_service` set to `background` and once with it set to `regular`. More often than not this test should show a slight improvement using `background` QoS, but since results are variable, the test passes if the TPC-C transactions per minute regresses no more than 5% when the OLAP session has `background` QoS. The test can be run with the commands: ``` make bin/workload make bin/roachtest build/builder.sh mkrelease roachtest run qos ``` Example abbreviated output: ```diff ... 15:56:34 util.go:52: up-replication complete 15:56:34 tpcc.go:160: test status: loading fixture 15:57:02 tpcc.go:176: test status: 15:57:02 qos_tpcc_background_olap.go:135: test status: loading TPCH tables 15:58:12 qos_tpcc_background_olap.go:181: test worker status: running tpcc 15:58:12 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16 with background quality of service 16:02:15 cluster.go:659: test status: getting perf/stats_qos.json 16:02:15 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats_qos.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats_qos.json 16:02:17 cluster.go:659: test status: 16:02:17 qos_tpcc_background_olap.go:181: test worker status: running tpcc 16:02:17 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16 16:06:20 cluster.go:659: test status: getting perf/stats.json 16:06:20 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats.json 16:06:22 cluster.go:659: test status: 16:06:22 qos_tpcc_background_olap.go:231: 16:06:22 qos_tpcc_background_olap.go:232: TPCC results with OLAP queries running simultaneously 16:06:22 qos_tpcc_background_olap.go:233: ----------------------------------------------------- 16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 96.88, tpmC: 366.33 16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 939.5 2550.1 2818.6 3623.9 5368.7 delivery 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.1 570.4 1476.4 1811.9 2415.9 4026.5 newOrder 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.5 176.2 385.9 453.0 637.5 738.2 orderStatus 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.2 335.5 906.0 1208.0 1610.6 2550.1 payment 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 96.5 419.4 536.9 1140.9 1811.9 stockLevel 16:06:22 qos_tpcc_background_olap.go:235: 16:06:22 qos_tpcc_background_olap.go:236: TPCC results with OLAP queries running simultaneously with background QoS 16:06:22 qos_tpcc_background_olap.go:237: ------------------------------------------------------------------------- 16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 98.11, tpmC: 371.00 16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms) 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 1476.4 3489.7 4026.5 5100.3 6710.9 delivery 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.2 805.3 1946.2 2281.7 3087.0 4295.0 newOrder 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.5 209.7 436.2 570.4 704.6 872.4 orderStatus 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.0 503.3 1275.1 1543.5 2080.4 4295.0 payment 16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 192.9 453.0 486.5 1476.4 2013.3 stockLevel 16:06:22 qos_tpcc_background_olap.go:239: 16:06:22 qos_tpcc_background_olap.go:96: tpmC_No_QoS: 366.33 Efficiency: 96.88 16:06:22 qos_tpcc_background_olap.go:98: tpmC_Background_QoS: 371.00 Efficiency: 98.11 + 16:06:22 qos_tpcc_background_olap.go:119: SUCCESS: TPCC run in parallel with OLAP queries using background QoS + had a tpmC score 1.3% higher than with regular QoS. 16:06:22 test_runner.go:886: tearing down after success; see teardown.log --- PASS: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (751.23s) 16:06:22 test_runner.go:628: [w0] test passed: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (run 1) ... ``` Release note: none

msirek · 2022-09-08T00:56:23Z

Note: double-check metamorphic settings aren't in use

cockroach-teamcity · 2023-12-22T08:05:22Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Mark Sirek seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

msirek requested review from a team, cucaroach and yuzefovich April 18, 2022 16:49

ajwerner reviewed Apr 18, 2022

View reviewed changes

msirek commented Apr 18, 2022

View reviewed changes

ajwerner reviewed Apr 18, 2022

View reviewed changes

msirek commented Apr 18, 2022

View reviewed changes

ajwerner reviewed Apr 18, 2022

View reviewed changes

msirek commented Apr 18, 2022

View reviewed changes

msirek force-pushed the qos_roachtest branch from 8b04af2 to b182e4d Compare April 18, 2022 18:15

ajwerner reviewed Apr 18, 2022

View reviewed changes

cucaroach reviewed Apr 19, 2022

View reviewed changes

msirek force-pushed the qos_roachtest branch from b182e4d to 9b0bf34 Compare April 19, 2022 15:27

msirek commented Apr 19, 2022

View reviewed changes

msirek requested a review from sumeerbhola April 20, 2022 19:03

yuzefovich reviewed Apr 22, 2022

View reviewed changes

msirek force-pushed the qos_roachtest branch from 9b0bf34 to 62da937 Compare April 28, 2022 15:16

msirek commented Apr 28, 2022

View reviewed changes

yuzefovich reviewed Apr 29, 2022

View reviewed changes

msirek commented May 2, 2022

View reviewed changes

cucaroach reviewed May 3, 2022

View reviewed changes

msirek force-pushed the qos_roachtest branch from 62da937 to d755a69 Compare May 9, 2022 07:06

msirek force-pushed the qos_roachtest branch from d755a69 to 63f0bec Compare May 25, 2022 03:06

msirek changed the title ~~tests: add qos roachtest~~ [tests: add qos roachtest May 25, 2022

msirek changed the title ~~[tests: add qos roachtest~~ [DNM] tests: add qos roachtest May 25, 2022

msirek marked this pull request as draft May 25, 2022 18:17

irfansharif added the A-admission-control label Jun 20, 2022

irfansharif mentioned this pull request Oct 27, 2022

admission: add integration tests #89208

Open

21 tasks

msirek closed this Dec 22, 2023

	response := mustGetMetrics(t, adminURL, start, end, []tsQuery{
	{name: "cr.node.distsender.rangelookups", queryType: rate},
	})

	var (
	requestedMeta = metric.Metadata{
	Name: "admission.requested.",
	Help: "Number of requests",
	Measurement: "Requests",
	Unit: metric.Unit_COUNT,
	}
	admittedMeta = metric.Metadata{
	Name: "admission.admitted.",
	Help: "Number of requests admitted",
	Measurement: "Requests",
	Unit: metric.Unit_COUNT,
	}
	erroredMeta = metric.Metadata{
	Name: "admission.errored.",
	Help: "Number of requests not admitted due to error",
	Measurement: "Requests",
	Unit: metric.Unit_COUNT,
	}
	waitDurationSumMeta = metric.Metadata{
	Name: "admission.wait_sum.",
	Help: "Total wait time in micros",
	Measurement: "Microseconds",
	Unit: metric.Unit_COUNT,
	}
	waitDurationsMeta = metric.Metadata{
	Name: "admission.wait_durations.",
	Help: "Wait time durations for requests that waited",
	Measurement: "Wait time Duration",
	Unit: metric.Unit_NANOSECONDS,
	}
	waitQueueLengthMeta = metric.Metadata{
	Name: "admission.wait_queue_length.",
	Help: "Length of wait queue",
	Measurement: "Requests",
	Unit: metric.Unit_COUNT,
	}
	)

Conversation

msirek commented Apr 18, 2022

Uh oh!

cockroach-teamcity commented Apr 18, 2022

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner commented Apr 18, 2022

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

cucaroach left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola commented Apr 21, 2022

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

msirek left a comment

Choose a reason for hiding this comment

Uh oh!

cucaroach left a comment

Choose a reason for hiding this comment

Uh oh!

msirek commented May 24, 2022

Uh oh!

yuzefovich commented May 24, 2022

Uh oh!

msirek commented May 24, 2022

Uh oh!

msirek commented May 25, 2022

Uh oh!

msirek commented Sep 8, 2022

Uh oh!

cockroach-teamcity commented Dec 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants