Conversation
ajwerner
left a comment
There was a problem hiding this comment.
Isn't this result backwards?
It feels like this test should run 3 things:
- TPCC with no concurrent load
- TPCC with lots of concurrent load via tpch (to the point that it fails and we see efficiency in the 60% or worse range)
- Same load as 2) but with the background QoS
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
I tried higher load, and can change the test to do this again. However, with background QoS I sometimes get much worse performance, so I would have to widen the allowed regression percentage so the test isn't failing all the time.. What would the first test show? Wouldn't it always be faster than tests 2 and 3?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
|
If admission control worked as intended, the idea is that test 3 should come close to test 1 and test 2 should do poorly |
msirek
left a comment
There was a problem hiding this comment.
I don't think the result is backwards. Giving higher admission control priority to TPCC than TPCH should allow it to run faster.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
If admission control worked as intended, the idea is that test 3 should come close to test 1 and test 2 should do poorly
Yeah, so far I am having a hard time finding a test that shows precisely this point, so I'm using the test to record the current level of performance and make sure we don't regress any further.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
ajwerner
left a comment
There was a problem hiding this comment.
Ah, sorry, I was looking at the latency profiles. At the end of the day, 96.88 vs 98.11 feels in the noise for a 3 minute run. I have a feeling that load-balancing variability could be the culprit. Both are clearly passing runs. Do we know that any admission control throttling kicked in? It'd be nice to validate that the delay over the time interval was non-zero. We have some other roachtest which validate metrics
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
No worries, OK, I'll increase the load and runtime. Any pointers on the other roachtests you're referring to is appreciated. Thanks.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
ajwerner
left a comment
There was a problem hiding this comment.
cockroach/pkg/cmd/roachtest/tests/ts_util.go
Lines 160 to 162 in 21a286f
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
I don't see a predefined metric.Metadata for admission control throttling. I guess a new one is needed.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
ajwerner
left a comment
There was a problem hiding this comment.
These are cockroach metrics. see like
cockroach/pkg/util/admission/work_queue.go
Lines 1457 to 1494 in ef29359
You can look at /_status/vars to find metrics too. admission_wait_durations_kv_stores_sum may be interesting
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
cucaroach
left a comment
There was a problem hiding this comment.
The admission.wait* metrics will show whether overload is being achieved, the overload page in the admin is good too.
Reviewed 2 of 4 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
Thanks. The latest code push shows my attempt to look up these metrics (see printAdmissionMetrics), though I am not seeing any entries in the response, at least on a local cluster. I am running again now on a cloud cluster, but if you see anything I'm doing wrong in the code, please let me know.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
|
I'm glad to see such a test -- I fully expect that we will see something surprising.
|
yuzefovich
left a comment
There was a problem hiding this comment.
Nice work! I have quite a few nits (sorry) and a suggestion.
Reviewed 3 of 4 files at r1, 1 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @msirek and @sumeerbhola)
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 36 at r3 (raw file):
Warehouses int Concurrency int OlapConcurrency int
nit: s/Olap/OLAP/ and s/Tpcc/TPCC/ throughout the file - I think we tend to either capitalize all letters in an acronym or none.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 42 at r3 (raw file):
} // run sets up TPCC then runs both TPCC plus TPCH OLAP queries involving the
nit: probably s/involving the/involving them/.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 46 at r3 (raw file):
// quality of service, then with OLAP queries using `regular` transaction // quality of service. The results are logged and compared. The test fails if // using `background` QoS doesn't result in a tpmC score from TPCC which is at
nit: s/which is at least/at least/.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 60 at r3 (raw file):
noThrottleTpmC = make([]float64, 0, s.numRuns) noThrottleTpccEfficiency = make([]float64, 0, s.numRuns) var tpmCQos, tpmCNoQoS float64
nit: s/Qos/QoS/ throughout the file for consistency.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 61 at r3 (raw file):
noThrottleTpccEfficiency = make([]float64, 0, s.numRuns) var tpmCQos, tpmCNoQoS float64 var efficiencyCQos, efficiencyNoQoS float64
nit: s/CQos/QoS/.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 67 at r3 (raw file):
crdbNodes, workloadNode := s.setupDatabases(ctx, t, c) s.runTpcc(ctx, t, c, crdbNodes, workloadNode, histogramsPath, false)
nit: it's good to add an inline comment for constant arguments like false /* useBackgroundQoS */.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 135 at r3 (raw file):
if percentImprovement < maxAllowedRegression { ttycolor.Stdout(ttycolor.Red) failMessage :=
nit: seems like this and the next lines should be squashed.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 207 at r3 (raw file):
response := mustGetMetrics(t, adminURL, start, end, admissionMetrics) // Drop the first two minutes of datapoints as a "ramp-up" period.
nit: seems like a leftover?
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 256 at r3 (raw file):
) { m := c.NewMonitor(ctx, crdbNodes) // Kick off TPC-H with concurrency
nit: missing period.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go, line 394 at r3 (raw file):
CPUs: 4, Concurrency: 32, OlapConcurrency: 64,
This concurrency might be a bit too high since it might push the cluster into the OOM territory. I'd probably drop it to 32. Or you had to bump it up to see the difference? If so, it might be better to only use TPCH Q1 and also do SET CLUSTER SETTING sql.distsql.temp_storage.hash_agg.enabled = false; - Q1 will be burning up CPU while staying within memory limits.
msirek
left a comment
There was a problem hiding this comment.
Thanks for the input. Right now I'm blocked a bit because printAdmissionMetrics is not able to retrieve metrics from the cluster. Any advice on what looks wrong in this code is appreciated.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @yuzefovich)
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 36 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/Olap/OLAP/ands/Tpcc/TPCC/throughout the file - I think we tend to either capitalize all letters in an acronym or none.
Done
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 42 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: probably
s/involving the/involving them/.
Removed involving the
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 46 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/which is at least/at least/.
It sounds better as is.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/Qos/QoS/throughout the file for consistency.
Sure. I was trying to follow our Go naming conventions because I thought I got review comments before indicating not to capitalize letters other than the first of an acronym. I have made this change.
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 61 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/CQos/QoS/.
Done
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 67 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: it's good to add an inline comment for constant arguments like
false /* useBackgroundQoS */.
Done
Code quote:
runTpcc(pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 135 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: seems like this and the next lines should be squashed.
Done
Code quote:
ttycolor.Stdout(ttycolor.Red)pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 207 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: seems like a leftover?
I don't know if I need this code yet. I am having trouble getting any metrics back at all, so I don't know if I need this code. I'll leave it in commented-out for now.
Code quote:
Drop the first two minutes of datapoints as apkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 256 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: missing period.
Added
Code quote:
Kick off TPC-H with concurrencypkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 394 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This concurrency might be a bit too high since it might push the cluster into the OOM territory. I'd probably drop it to 32. Or you had to bump it up to see the difference? If so, it might be better to only use TPCH Q1 and also do
SET CLUSTER SETTING sql.distsql.temp_storage.hash_agg.enabled = false;- Q1 will be burning up CPU while staying within memory limits.
Thanks for the advice, I'll try out different things. Right now the blocker for me is I'm unable to get any metrics at all in printAdmissionMetrics.
yuzefovich
left a comment
There was a problem hiding this comment.
Hm, I don't have any experience with those, so hopefully others can chime in.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @msirek and @sumeerbhola)
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):
Previously, msirek (Mark Sirek) wrote…
Sure. I was trying to follow our Go naming conventions because I thought I got review comments before indicating not to capitalize letters other than the first of an acronym. I have made this change.
I'm a big fan of consistency, and my nit came from the fact we had Qos and NoQoS previously - either Qos or QoS seems ok to me, but once picked, it should be used throughout the whole file.
msirek
left a comment
There was a problem hiding this comment.
Thanks. I'm talking with Tommy today, maybe he'll know the answer.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @yuzefovich)
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 60 at r3 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm a big fan of consistency, and my nit came from the fact we had
QosandNoQoSpreviously - eitherQosorQoSseems ok to me, but once picked, it should be used throughout the whole file.
Good point, thanks.
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @msirek, @sumeerbhola, and @yuzefovich)
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 64 at r4 (raw file):
var baselineTpmC, baselineEfficiency float64 for i := 0; i < s.numRuns; i++ {
numRuns is possibly redundant with roachtest --count
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 84 at r4 (raw file):
startNoQoS := timeutil.Now() s.runTPCCAndOLAPQueries(ctx, t, c, crdbNodes, workloadNode, histogramsPath, false /* useBackgroundQoS */)
Do we want to run 3 workloads back to back? We could also run 3 separate tests and not worry about having a pass fail. The semantics of restarting a workload are unknown to me, like does it try to insert the same records on subsequent runs and get a bunch of collisions? Consider also just leaving tpcc running the whole time and do a couple minutes
w/o tpch, a couple minutes w/ noqos, a couple minutes w/ qos and then a washout period at the end again with no tpch. Not sure how to get instantaneous tpmc metrics but maybe they can be copied and analyzed while the test is running?
pkg/cmd/roachtest/tests/qos_tpcc_background_olap.go line 392 at r4 (raw file):
{ CPUs: 4, Concurrency: 32,
I think in order for admission control queuing to occur you need sustained goroutines per cpu to be over 40. Check the overload page in the admin. I'd guess concurrency needs to be higher.
|
Try changing |
Thanks! |
This commit adds a new roachtest which tests admission control QoS by
running TPC-C concurrently with TPC-H OLAP queries, once with the OLAP
session having session variable `default_transaction_quality_of_service`
set to `background` and once with it set to `regular`. More often than
not this test should show a slight improvement using `background` QoS,
but since results are variable, the test passes if the TPC-C
transactions per minute regresses no more than 5% when the OLAP session
has `background` QoS.
The test can be run with the commands:
```
make bin/workload
make bin/roachtest
build/builder.sh mkrelease
roachtest run qos
```
Example abbreviated output:
```diff
...
15:56:34 util.go:52: up-replication complete
15:56:34 tpcc.go:160: test status: loading fixture
15:57:02 tpcc.go:176: test status:
15:57:02 qos_tpcc_background_olap.go:135: test status: loading TPCH tables
15:58:12 qos_tpcc_background_olap.go:181: test worker status: running tpcc
15:58:12 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16 with background quality of service
16:02:15 cluster.go:659: test status: getting perf/stats_qos.json
16:02:15 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats_qos.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats_qos.json
16:02:17 cluster.go:659: test status:
16:02:17 qos_tpcc_background_olap.go:181: test worker status: running tpcc
16:02:17 qos_tpcc_background_olap.go:166: test status: running TPCH with concurrency of 16
16:06:20 cluster.go:659: test status: getting perf/stats.json
16:06:20 cluster_synced.go:1595: msirek-1650297118-01-n4cpu8: getting (scp) perf/stats.json artifacts/qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8/run_1/stats.json
16:06:22 cluster.go:659: test status:
16:06:22 qos_tpcc_background_olap.go:231:
16:06:22 qos_tpcc_background_olap.go:232: TPCC results with OLAP queries running simultaneously
16:06:22 qos_tpcc_background_olap.go:233: -----------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 96.88, tpmC: 366.33
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 939.5 2550.1 2818.6 3623.9 5368.7 delivery
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.1 570.4 1476.4 1811.9 2415.9 4026.5 newOrder
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.5 176.2 385.9 453.0 637.5 738.2 orderStatus
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.2 335.5 906.0 1208.0 1610.6 2550.1 payment
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 96.5 419.4 536.9 1140.9 1811.9 stockLevel
16:06:22 qos_tpcc_background_olap.go:235:
16:06:22 qos_tpcc_background_olap.go:236: TPCC results with OLAP queries running simultaneously with background QoS
16:06:22 qos_tpcc_background_olap.go:237: -------------------------------------------------------------------------
16:06:22 qos_tpcc_background_olap.go:243: Duration: 2m59., Warehouses: 30, Efficiency: 98.11, tpmC: 371.00
16:06:22 qos_tpcc_background_olap.go:245: _elapsed___ops/sec(cum)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 1476.4 3489.7 4026.5 5100.3 6710.9 delivery
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.2 805.3 1946.2 2281.7 3087.0 4295.0 newOrder
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.5 209.7 436.2 570.4 704.6 872.4 orderStatus
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 6.0 503.3 1275.1 1543.5 2080.4 4295.0 payment
16:06:22 qos_tpcc_background_olap.go:254: 180.0s 0.6 192.9 453.0 486.5 1476.4 2013.3 stockLevel
16:06:22 qos_tpcc_background_olap.go:239:
16:06:22 qos_tpcc_background_olap.go:96: tpmC_No_QoS: 366.33 Efficiency: 96.88
16:06:22 qos_tpcc_background_olap.go:98: tpmC_Background_QoS: 371.00 Efficiency: 98.11
+ 16:06:22 qos_tpcc_background_olap.go:119: SUCCESS: TPCC run in parallel with OLAP queries using background QoS
+ had a tpmC score 1.3% higher than with regular QoS.
16:06:22 test_runner.go:886: tearing down after success; see teardown.log
--- PASS: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (751.23s)
16:06:22 test_runner.go:628: [w0] test passed: qos/tpcc_background_olap/nodes=3/cpu=8/w=30/c=8 (run 1)
...
```
Release note: none
|
Note: double-check metamorphic settings aren't in use |
|
Mark Sirek seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |





This commit adds a new roachtest which tests admission control QoS by
running TPC-C concurrently with TPC-H OLAP queries, once with the OLAP
session having session variable
default_transaction_quality_of_serviceset to
backgroundand once with it set toregular. More often thannot this test should show a slight improvement using
backgroundQoS,but since results are variable, the test passes if the TPC-C
transactions per minute regresses no more than 5% when the OLAP session
has
backgroundQoS.The test can be run with the commands:
Example abbreviated output:
Release note: none