Skip to content

[DNM] tests: add qos roachtest#80112

Closed
msirek wants to merge 1 commit intocockroachdb:masterfrom
msirek:qos_roachtest
Closed

[DNM] tests: add qos roachtest#80112
msirek wants to merge 1 commit intocockroachdb:masterfrom
msirek:qos_roachtest

Conversation

@msirek
Copy link
Copy Markdown
Contributor

@msirek msirek commented Apr 18, 2022

This commit adds a new roachtest which tests admission control QoS by
running TPC-C concurrently with TPC-H OLAP queries, once with the OLAP
session having session variable default_transaction_quality_of_service
set to background and once with it set to regular. More often than
not this test should show a slight improvement using background QoS,
but since results are variable, the test passes if the TPC-C
transactions per minute regresses no more than 5% when the OLAP session
has background QoS.

The test can be run with the commands:

    make bin/workload
    make bin/roachtest
    build/builder.sh mkrelease
    roachtest run qos

Example abbreviated output:

...
15:56:34 util.go:52: up-replication complete
15:56:34 tpcc.go:160: test status: loading fixture
15:57:02 tpcc.go:176: test status:
15:57:02 qos_tpcc_background_olap.go:135: test status: loading TPCH tables
15:58:12 qos_tpcc_background_olap.go:181: test worker status: running tpcc
15:58:12 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16 with background quality of service
16:02:15 cluster.go:659: test status: getting perf/stats_qos.json
16:02:15 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats_qos.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats_qos.json
16:02:17 cluster.go:659: test status:
16:02:17 qos_tpcc_background_olap.go:181: test worker status: running tpcc
16:02:17 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16
16:06:20 cluster.go:659: test status: getting perf/stats.json
16:06:20 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats.json
16:06:22 cluster.go:659: test status:
16:06:22 qos_tpcc_background_olap.go:231:
16:06:22 qos_tpcc_background_olap.go:232: TPCC results with OLAP queries running simultaneously
16:06:22 qos_tpcc_background_olap.go:233: -----------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 96.88, tpmC: 366.33
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6    939.5   2550.1   2818.6   3623.9   5368.7 delivery
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.1    570.4   1476.4   1811.9   2415.9   4026.5 newOrder
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.5    176.2    385.9    453.0    637.5    738.2 orderStatus
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.2    335.5    906.0   1208.0   1610.6   2550.1 payment
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6     96.5    419.4    536.9   1140.9   1811.9 stockLevel
16:06:22 qos_tpcc_background_olap.go:235:

16:06:22 qos_tpcc_background_olap.go:236: TPCC results with OLAP queries running simultaneously with background QoS
16:06:22 qos_tpcc_background_olap.go:237: -------------------------------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 98.11, tpmC: 371.00
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6   1476.4   3489.7   4026.5   5100.3   6710.9 delivery
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.2    805.3   1946.2   2281.7   3087.0   4295.0 newOrder
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.5    209.7    436.2    570.4    704.6    872.4 orderStatus
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.0    503.3   1275.1   1543.5   2080.4   4295.0 payment
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6    192.9    453.0    486.5   1476.4   2013.3 stockLevel
16:06:22 qos_tpcc_background_olap.go:239:

16:06:22 qos_tpcc_background_olap.go:96: tpmC_No_QoS:         366.33   Efficiency: 96.88
16:06:22 qos_tpcc_background_olap.go:98: tpmC_Background_QoS: 371.00   Efficiency: 98.11
+ 16:06:22 qos_tpcc_background_olap.go:119: SUCCESS: TPCC run in parallel with OLAP queries using background QoS
+                                                    had a tpmC score 1.3% higher than with regular QoS.
16:06:22 test_runner.go:886: tearing down after success; see teardown.log
 --- PASS: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (751.23s)
16:06:22 test_runner.go:628: [w0] test passed: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (run 1)
...

Release note: none

@msirek msirek requested review from a team, cucaroach and yuzefovich April 18, 2022 16:49
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this result backwards?

It feels like this test should run 3 things:

  1. TPCC with no concurrent load
  2. TPCC with lots of concurrent load via tpch (to the point that it fails and we see efficiency in the 60% or worse range)
  3. Same load as 2) but with the background QoS

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried higher load, and can change the test to do this again. However, with background QoS I sometimes get much worse performance, so I would have to widen the allowed regression percentage so the test isn't failing all the time.. What would the first test show? Wouldn't it always be faster than tests 2 and 3?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

@ajwerner
Copy link
Copy Markdown
Contributor

If admission control worked as intended, the idea is that test 3 should come close to test 1 and test 2 should do poorly

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the result is backwards. Giving higher admission control priority to TPCC than TPCH should allow it to run faster.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If admission control worked as intended, the idea is that test 3 should come close to test 1 and test 2 should do poorly

Yeah, so far I am having a hard time finding a test that shows precisely this point, so I'm using the test to record the current level of performance and make sure we don't regress any further.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I was looking at the latency profiles. At the end of the day, 96.88 vs 98.11 feels in the noise for a 3 minute run. I have a feeling that load-balancing variability could be the culprit. Both are clearly passing runs. Do we know that any admission control throttling kicked in? It'd be nice to validate that the delay over the time interval was non-zero. We have some other roachtest which validate metrics

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, OK, I'll increase the load and runtime. Any pointers on the other roachtests you're referring to is appreciated. Thanks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response := mustGetMetrics(t, adminURL, start, end, []tsQuery{
{name: "cr.node.distsender.rangelookups", queryType: rate},
})

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a predefined metric.Metadata for admission control throttling. I guess a new one is needed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are cockroach metrics. see like

var (
requestedMeta = metric.Metadata{
Name: "admission.requested.",
Help: "Number of requests",
Measurement: "Requests",
Unit: metric.Unit_COUNT,
}
admittedMeta = metric.Metadata{
Name: "admission.admitted.",
Help: "Number of requests admitted",
Measurement: "Requests",
Unit: metric.Unit_COUNT,
}
erroredMeta = metric.Metadata{
Name: "admission.errored.",
Help: "Number of requests not admitted due to error",
Measurement: "Requests",
Unit: metric.Unit_COUNT,
}
waitDurationSumMeta = metric.Metadata{
Name: "admission.wait_sum.",
Help: "Total wait time in micros",
Measurement: "Microseconds",
Unit: metric.Unit_COUNT,
}
waitDurationsMeta = metric.Metadata{
Name: "admission.wait_durations.",
Help: "Wait time durations for requests that waited",
Measurement: "Wait time Duration",
Unit: metric.Unit_NANOSECONDS,
}
waitQueueLengthMeta = metric.Metadata{
Name: "admission.wait_queue_length.",
Help: "Length of wait queue",
Measurement: "Requests",
Unit: metric.Unit_COUNT,
}
)

You can look at /_status/vars to find metrics too. admission_wait_durations_kv_stores_sum may be interesting

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

Copy link
Copy Markdown
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admission.wait* metrics will show whether overload is being achieved, the overload page in the admin is good too.

Reviewed 2 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The latest code push shows my attempt to look up these metrics (see printAdmissionMetrics), though I am not seeing any entries in the response, at least on a local cluster. I am running again now on a cloud cluster, but if you see anything I'm doing wrong in the code, please let me know.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

@msirek msirek requested a review from sumeerbhola April 20, 2022 19:03
@sumeerbhola
Copy link
Copy Markdown
Collaborator

I'm glad to see such a test -- I fully expect that we will see something surprising.

  • I'd suggest first just eyeballing the overload page while this is running (and increasing the test duration) to be longer than 3min.
  • IIRC tpccbench stresses CPU more than the store. I don't know anything about TPCH.
  • Consider starting with a single node cluster -- the results will be easier to analyze. If admission control happens to do the right thing there, we can then try with 3 nodes.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have quite a few nits (sorry) and a suggestion.

Reviewed 3 of 4 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @sumeerbhola)


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 36 at r3 (raw file):

	Warehouses      int
	Concurrency     int
	OlapConcurrency int

nit: s/Olap/OLAP/ and s/Tpcc/TPCC/ throughout the file - I think we tend to either capitalize all letters in an acronym or none.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 42 at r3 (raw file):

}

// run sets up TPCC then runs both TPCC plus TPCH OLAP queries involving the

nit: probably s/involving the/involving them/.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 46 at r3 (raw file):

// quality of service, then with OLAP queries using `regular` transaction
// quality of service.  The results are logged and compared. The test fails if
// using `background` QoS doesn't result in a tpmC score from TPCC which is at

nit: s/which is at least/at least/.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 60 at r3 (raw file):

	noThrottleTpmC = make([]float64, 0, s.numRuns)
	noThrottleTpccEfficiency = make([]float64, 0, s.numRuns)
	var tpmCQos, tpmCNoQoS float64

nit: s/Qos/QoS/ throughout the file for consistency.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 61 at r3 (raw file):

	noThrottleTpccEfficiency = make([]float64, 0, s.numRuns)
	var tpmCQos, tpmCNoQoS float64
	var efficiencyCQos, efficiencyNoQoS float64

nit: s/CQos/QoS/.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 67 at r3 (raw file):

		crdbNodes, workloadNode := s.setupDatabases(ctx, t, c)

		s.runTpcc(ctx, t, c, crdbNodes, workloadNode, histogramsPath, false)

nit: it's good to add an inline comment for constant arguments like false /* useBackgroundQoS */.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 135 at r3 (raw file):

	if percentImprovement < maxAllowedRegression {
		ttycolor.Stdout(ttycolor.Red)
		failMessage :=

nit: seems like this and the next lines should be squashed.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 207 at r3 (raw file):

	response := mustGetMetrics(t, adminURL, start, end, admissionMetrics)

	// Drop the first two minutes of datapoints as a "ramp-up" period.

nit: seems like a leftover?


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 256 at r3 (raw file):

) {
	m := c.NewMonitor(ctx, crdbNodes)
	// Kick off TPC-H with concurrency

nit: missing period.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 394 at r3 (raw file):

			CPUs:            4,
			Concurrency:     32,
			OlapConcurrency: 64,

This concurrency might be a bit too high since it might push the cluster into the OOM territory. I'd probably drop it to 32. Or you had to bump it up to see the difference? If so, it might be better to only use TPCH Q1 and also do SET CLUSTER SETTING sql.distsql.temp_storage.hash_agg.enabled = false; - Q1 will be burning up CPU while staying within memory limits.

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input. Right now I'm blocked a bit because printAdmissionMetrics is not able to retrieve metrics from the cluster. Any advice on what looks wrong in this code is appreciated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @yuzefovich)


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 36 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/Olap/OLAP/ and s/Tpcc/TPCC/ throughout the file - I think we tend to either capitalize all letters in an acronym or none.

Done


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 42 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: probably s/involving the/involving them/.

Removed involving the


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 46 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/which is at least/at least/.

It sounds better as is.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/Qos/QoS/ throughout the file for consistency.

Sure. I was trying to follow our Go naming conventions because I thought I got review comments before indicating not to capitalize letters other than the first of an acronym. I have made this change.


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 61 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/CQos/QoS/.

Done


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 67 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it's good to add an inline comment for constant arguments like false /* useBackgroundQoS */.

Done

Code quote:

runTpcc(

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 135 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like this and the next lines should be squashed.

Done

Code quote:

ttycolor.Stdout(ttycolor.Red)

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 207 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: seems like a leftover?

I don't know if I need this code yet. I am having trouble getting any metrics back at all, so I don't know if I need this code. I'll leave it in commented-out for now.

Code quote:

Drop the first two minutes of datapoints as a

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 256 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: missing period.

Added

Code quote:

Kick off TPC-H with concurrency

pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 394 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

This concurrency might be a bit too high since it might push the cluster into the OOM territory. I'd probably drop it to 32. Or you had to bump it up to see the difference? If so, it might be better to only use TPCH Q1 and also do SET CLUSTER SETTING sql.distsql.temp_storage.hash_agg.enabled = false; - Q1 will be burning up CPU while staying within memory limits.

Thanks for the advice, I'll try out different things. Right now the blocker for me is I'm unable to get any metrics at all in printAdmissionMetrics.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't have any experience with those, so hopefully others can chime in.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msirek and @sumeerbhola)


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):

Previously, msirek (Mark Sirek) wrote…

Sure. I was trying to follow our Go naming conventions because I thought I got review comments before indicating not to capitalize letters other than the first of an acronym. I have made this change.

I'm a big fan of consistency, and my nit came from the fact we had Qos and NoQoS previously - either Qos or QoS seems ok to me, but once picked, it should be used throughout the whole file.

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm talking with Tommy today, maybe he'll know the answer.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @yuzefovich)


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm a big fan of consistency, and my nit came from the fact we had Qos and NoQoS previously - either Qos or QoS seems ok to me, but once picked, it should be used throughout the whole file.

Good point, thanks.

Copy link
Copy Markdown
Contributor

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @msirek, @sumeerbhola, and @yuzefovich)


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 64 at r4 (raw file):

	var baselineTpmC, baselineEfficiency float64

	for i := 0; i < s.numRuns; i++ {

numRuns is possibly redundant with roachtest --count


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 84 at r4 (raw file):

		startNoQoS := timeutil.Now()
		s.runTPCCAndOLAPQueries(ctx, t, c, crdbNodes, workloadNode, histogramsPath, false /* useBackgroundQoS */)

Do we want to run 3 workloads back to back? We could also run 3 separate tests and not worry about having a pass fail. The semantics of restarting a workload are unknown to me, like does it try to insert the same records on subsequent runs and get a bunch of collisions? Consider also just leaving tpcc running the whole time and do a couple minutes
w/o tpch, a couple minutes w/ noqos, a couple minutes w/ qos and then a washout period at the end again with no tpch. Not sure how to get instantaneous tpmc metrics but maybe they can be copied and analyzed while the test is running?


pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 392 at r4 (raw file):

		{
			CPUs:            4,
			Concurrency:     32,

I think in order for admission control queuing to occur you need sustained goroutines per cpu to be over 40. Check the overload page in the admin. I'd guess concurrency needs to be higher.

@msirek
Copy link
Copy Markdown
Contributor Author

msirek commented May 24, 2022

Even when run with parameter --max-rate=0, this test doesn't seem to be reaching any level of overload:

image
image
image

@yuzefovich
Copy link
Copy Markdown
Member

Try changing wait parameter of TPCC?

@msirek
Copy link
Copy Markdown
Contributor Author

msirek commented May 24, 2022

Try changing wait parameter of TPCC?

Thanks!

@msirek
Copy link
Copy Markdown
Contributor Author

msirek commented May 25, 2022

With the --wait=0 option, queries per second increased by an order of magnitude. When tpch was run with background QoS, tpcc had a lower admission work rate and lower tpmC score. I'm thinking the kv test may be a better test since it's more read-heavy.
Results with a 15-minute test:

01:38:44 qos_tpcc_background_olap.go:338: 
01:38:44 qos_tpcc_background_olap.go:339: TPCC results with OLAP queries running simultaneously
01:38:44 qos_tpcc_background_olap.go:340: -----------------------------------------------------
01:38:44 qos_tpcc_background_olap.go:350: Duration: 2m59., Warehouses: 30, Efficiency: 35.26, tpmC: 133.33
01:38:44 qos_tpcc_background_olap.go:352: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            0.2   7516.2  31138.5  42949.7  73014.4  73014.4 delivery
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            2.2  33286.0  98784.2 103079.2 103079.2 103079.2 newOrder
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            0.2    704.6   6710.9   8053.1  12348.0  12348.0 orderStatus
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            2.4  32212.3  73014.4  94489.3 103079.2 103079.2 payment
01:38:44 qos_tpcc_background_olap.go:361:   180.0s            0.2   5905.6  38654.7  51539.6  62277.0  62277.0 stockLevel
01:38:44 qos_tpcc_background_olap.go:342: 

01:38:44 qos_tpcc_background_olap.go:343: TPCC results with OLAP queries running simultaneously with background QoS
01:38:44 qos_tpcc_background_olap.go:344: -------------------------------------------------------------------------
01:38:44 qos_tpcc_background_olap.go:350: Duration: 3m0.9, Warehouses: 30, Efficiency: 30.16, tpmC: 114.03
01:38:44 qos_tpcc_background_olap.go:352: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            0.2  10737.4  20401.1  23622.3  33286.0  33286.0 delivery
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            1.9  60129.5 103079.2 103079.2 103079.2 103079.2 newOrder
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            0.2   1677.7   8589.9  13421.8  34359.7  34359.7 orderStatus
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            1.7  26843.5  85899.3 103079.2 103079.2 103079.2 payment
01:38:44 qos_tpcc_background_olap.go:361:   181.0s            0.2   1543.5  30064.8  51539.6  98784.2  98784.2 stockLevel

Perhaps it needs to run much longer to show extreme overload.

Admission graphs with a 3-minute test, first 3 minutes starting at 1:09 use background QoS for tpch:
image
image

func registerTPCCQoSBackgroundOLAP(r registry.Registry) {
	specs := []tpccQoSBackgroundOLAPSpec{
		{
			CPUs:            4,
			Concurrency:     200,
			OLAPConcurrency: 64,
			Nodes:           3,
			Warehouses:      30,
			Duration:        3 * time.Minute,
			Ramp:            1 * time.Minute,
			numRuns:         1,
		},
	}
	for _, s := range specs {
		registerTPCCQoSBackgroundOLAPSpec(r, s)
	}
}

This commit adds a new roachtest which tests admission control QoS by
running TPC-C concurrently with TPC-H OLAP queries, once with the OLAP
session having session variable `default_transaction_quality_of_service`
set to `background` and once with it set to `regular`. More often than
not this test should show a slight improvement using `background` QoS,
but since results are variable, the test passes if the TPC-C
transactions per minute regresses no more than 5% when the OLAP session
has `background` QoS.

The test can be run with the commands:
```
    make bin/workload
    make bin/roachtest
    build/builder.sh mkrelease
    roachtest run qos
```
Example abbreviated output:
```diff
...
15:56:34 util.go:52: up-replication complete
15:56:34 tpcc.go:160: test status: loading fixture
15:57:02 tpcc.go:176: test status:
15:57:02 qos_tpcc_background_olap.go:135: test status: loading TPCH tables
15:58:12 qos_tpcc_background_olap.go:181: test worker status: running tpcc
15:58:12 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16 with background quality of service
16:02:15 cluster.go:659: test status: getting perf/stats_qos.json
16:02:15 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats_qos.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats_qos.json
16:02:17 cluster.go:659: test status:
16:02:17 qos_tpcc_background_olap.go:181: test worker status: running tpcc
16:02:17 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16
16:06:20 cluster.go:659: test status: getting perf/stats.json
16:06:20 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats.json
16:06:22 cluster.go:659: test status:
16:06:22 qos_tpcc_background_olap.go:231:
16:06:22 qos_tpcc_background_olap.go:232: TPCC results with OLAP queries running simultaneously
16:06:22 qos_tpcc_background_olap.go:233: -----------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 96.88, tpmC: 366.33
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6    939.5   2550.1   2818.6   3623.9   5368.7 delivery
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.1    570.4   1476.4   1811.9   2415.9   4026.5 newOrder
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.5    176.2    385.9    453.0    637.5    738.2 orderStatus
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.2    335.5    906.0   1208.0   1610.6   2550.1 payment
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6     96.5    419.4    536.9   1140.9   1811.9 stockLevel
16:06:22 qos_tpcc_background_olap.go:235:

16:06:22 qos_tpcc_background_olap.go:236: TPCC results with OLAP queries running simultaneously with background QoS
16:06:22 qos_tpcc_background_olap.go:237: -------------------------------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 98.11, tpmC: 371.00
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6   1476.4   3489.7   4026.5   5100.3   6710.9 delivery
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.2    805.3   1946.2   2281.7   3087.0   4295.0 newOrder
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.5    209.7    436.2    570.4    704.6    872.4 orderStatus
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            6.0    503.3   1275.1   1543.5   2080.4   4295.0 payment
16:06:22 qos_tpcc_background_olap.go:254:   180.0s            0.6    192.9    453.0    486.5   1476.4   2013.3 stockLevel
16:06:22 qos_tpcc_background_olap.go:239:

16:06:22 qos_tpcc_background_olap.go:96: tpmC_No_QoS:         366.33   Efficiency: 96.88
16:06:22 qos_tpcc_background_olap.go:98: tpmC_Background_QoS: 371.00   Efficiency: 98.11
+ 16:06:22 qos_tpcc_background_olap.go:119: SUCCESS: TPCC run in parallel with OLAP queries using background QoS
+                                                    had a tpmC score 1.3% higher than with regular QoS.
16:06:22 test_runner.go:886: tearing down after success; see teardown.log
 --- PASS: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (751.23s)
16:06:22 test_runner.go:628: [w0] test passed: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (run 1)
...
```

Release note: none
@msirek msirek changed the title tests: add qos roachtest [tests: add qos roachtest May 25, 2022
@msirek msirek changed the title [tests: add qos roachtest [DNM] tests: add qos roachtest May 25, 2022
@msirek msirek marked this pull request as draft May 25, 2022 18:17
@msirek
Copy link
Copy Markdown
Contributor Author

msirek commented Sep 8, 2022

Note: double-check metamorphic settings aren't in use

@cockroach-teamcity
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Mark Sirek seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@msirek msirek closed this Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants