sql: add multitenant fairness tests#77481
Conversation
f936f0a to
69c6b95
Compare
f2c5351 to
33fe46c
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 63 at r1 (raw file):
} // Test that the kvserver fairly distributes CPU token on a highly concurrent 4 MT workload.
4 MT?
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 146 at r1 (raw file):
fmt.Printf("Max latency delta: %d%% %v\n", int(maxDeltaPercentage*100), meanLatencies) // TODO: Can we do anything to verify that admission control kicked in? Check some stats?
What does the admin console look like when this is run, just as a rough sanity check?
Is there enough concurrency from each tenant to saturate some resource? I thought we would need --concurrency to be high (e.g. see roachtest/tests/kv.go).
Also, if each tenant has the same concurrency, I suspect they will get similar allocation even without admission control. We could manually turn off admission control and see if there is any change in the observed throughput for each tenant. An alternative would be to have tenants with different concurrency values, say tenant1 with C concurrency, tenant2 with 2C, tenant3 with 4C, tenant4 with 8*C, such that C is enough to use significantly more than 25% of the resource, and then see if everyone gets equal throughput.
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 63 at r1 (raw file):
Previously, sumeerbhola wrote…
4 MT?
Sorry, what terminology should I use, 4 sql pod?
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 146 at r1 (raw file):
Previously, sumeerbhola wrote…
What does the admin console look like when this is run, just as a rough sanity check?
Is there enough concurrency from each tenant to saturate some resource? I thought we would need --concurrency to be high (e.g. see roachtest/tests/kv.go).
Also, if each tenant has the same concurrency, I suspect they will get similar allocation even without admission control. We could manually turn off admission control and see if there is any change in the observed throughput for each tenant. An alternative would be to have tenants with different concurrency values, say tenant1 with C concurrency, tenant2 with 2C, tenant3 with 4C, tenant4 with 8*C, such that C is enough to use significantly more than 25% of the resource, and then see if everyone gets equal throughput.
What part of the console? The kv server looks very un busy since the console bias is towards SQL activity. It only shows %25 CPU usage but I know from top et al that the CPU is tapped out. Each workload instance has default concurrency 8 so there's 32 concurrency. I can see from custom metrics that there are 40 kv slot so I'll try to crank up concurrency to see if I can get any queuing to happen. Then I'll play with varying concurrency per tenant. Thanks for the suggestions!
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 146 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
What part of the console? The kv server looks very un busy since the console bias is towards SQL activity. It only shows %25 CPU usage but I know from top et al that the CPU is tapped out. Each workload instance has default concurrency 8 so there's 32 concurrency. I can see from custom metrics that there are 40 kv slot so I'll try to crank up concurrency to see if I can get any queuing to happen. Then I'll play with varying concurrency per tenant. Thanks for the suggestions!
the overload page should show queuing in admission control if we are overloading cpu (higher runnable goroutine count) or causing higher read amp in the store.
sumeerbhola
left a comment
There was a problem hiding this comment.
The mean delay is < 150us, and the 75th percentile is ~10ms. The slots don't look anywhere close to full. There must be some tiny spikes that are not caught by the low sampling frequency of the metric, which is why admission delay is not 0. What does the runnable goroutine count on the overload dashboard look like? I think we need significantly more load.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
|
I think the machine is so overloaded that we're adjusting the number of available slots down? Isn't the dark line in the slots graph the number of slots we think we can handle? The CPU is pegged at %100 (this is a 4 CPU machine running 4 workload processes, 4 SQL pods and 1 kvserver). Maybe going from 32 to 400 was too big a jump in concurrency, I'll experiment some more... |
sumeerbhola
left a comment
There was a problem hiding this comment.
I think the machine is so overloaded that we're adjusting the number of available slots down? Isn't the dark line in the slots graph the number of slots we think we can handle?
I have no idea what the dark line is. What do the labels say when you hover over it? I doubt we would see such low admission control latency if the slots were truly scarce.
The CPU is pegged at %100 (this is a 4 CPU machine running 4 workload processes, 4 SQL pods and 1 kvserver). Maybe going from 32 to 400 was too big a jump in concurrency, I'll experiment some more...
We shouldn't run the kv server (or storage server in the terminology doc) on a shared machine with the others. It's going to add noise due to cpu fluctuations caused by the cpu usage in the SQL pods and the workload processes.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 63 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Sorry, what terminology should I use, 4 sql pod?
yes, that's fine https://docs.google.com/presentation/d/1IvqkP3EOiKLPPf7lPsMpJsoUk6hcaaDm4xg5VqJ97TI/edit#slide=id.g1125de19347_0_185
tbg
left a comment
There was a problem hiding this comment.
Gave an initial review with some high-level suggestions. Sorry this took so long, I was out on PTO until earlier this week and had been catching up.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cucaroach)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 38 at r1 (raw file):
Run: func(ctx context.Context, t test.Test, c cluster.Cluster) { runMultiTenantFairness(ctx, t, c, *t.BuildVersion(), 10, // init duration
I'd introduce a multiTenantFairnessOptions struct to hold these values. It also seems like you can just have a BlockSize parameter, since all uses currently specify min=max.
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 69 at r1 (raw file):
c cluster.Cluster, v version.Version, initDuration int,
Use time.Duration here or the caller can invoke this with time.Second expecting to get a second but it's really going to be 1E9 seconds. The workload flags are also time.Duration so you shouldn't need to perform any conversions.
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 100 at r1 (raw file):
// Init kv on each tenant. var wg sync.WaitGroup
Throughout: Use a c.NewMonitor, not a WaitGroup. A WaitGroup is oblivious to the cluster becoming unavailable and will cause test hangs that the test harness has a hard time recovering from. Monitor does the right thing.
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 105 at r1 (raw file):
go func(pgurl string) { defer wg.Done() cmd := fmt.Sprintf("./cockroach workload run kv --min-block-bytes %d --max-block-bytes %d --splits 10 --init --batch 100 --duration=%ds %s", minBlockSize, maxBlockSize, initDuration, pgurl)
nit: I think it's deprecated to use run --init. If you're just initializing, ./cockroach workload init kv is the right thing to use.
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 122 at r1 (raw file):
wg.Wait() // Pull workload performance from crdb_internal.statement_statistics.
I think this is one of the tests that would really benefit from using prometheus instead of ad-hoc metrics stuff. For a recent example, see https://github.com/cockroachdb/cockroach/pull/76147/files#diff-28a39375143eb2ad1e76868f4086035dc8099dc89f98658c3f62f52987945b1eR139-R152. There are also the multi-region DRT tests, drt.go, which do something similar (but that's where I learned from, so my code is probably more easily digested for your use case).
The main reasons I am suggestion prometheus here is that a) you can test whatever metrics you like without any limits and b) if the test fails, you can easily go through the metrics and figure out exactly what was wrong since there'll be a prometheus snapshot in the artifacts.
d389d2f to
5f6c554
Compare
|
Here's the results from the latest full run, some mixed signals but overall admission control on looks like a win: Going to look at some metrics for all these to make sure we're really hitting overload conditions. |
|
I'm at a loss on how to turn these tests into reliable pass/fail style tests and am leaning towards just having numbers published so we track over time how we're doing, any thoughts on this would be much appreciated. I really don't want to add a flakey test that needs constant attention... |
That would be my suggestion as well. I would think of these as performance tests so that data accumulates over time and improvements/regressions can be identified. @stevendanna could you help Tommy figure out how to do that in case he needs help (which I assume he will). I also wonder if we wrote down the basics somewhere (I'd like to read that myself, because I'm shaky on what exactly we support and where changes need to be made), for example in our knowledge base: https://cockroachlabs.atlassian.net/wiki/spaces/TE/pages/2208629091/Knowledge+Base |
5f6c554 to
6aa79f9
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Numbers are quite nice!
I am wondering about the following size-skew.
admission-control-on/kv-fairness/size-skew/run_1/test.log
27:19:03:22 multitenant_fairness.go:236: Max throughput delta: 14% [2.249276e+06 2.351748e+06 2.490747e+06 1.923401e+06]
28:19:03:22 multitenant_fairness.go:247: Max latency delta: 10% [0.0034406955788691936 0.0038719533858636 0.003989169634826118 0.004103228254822516]
is the one with the lowest throughput the one with the largest size? Could it be encountering more hits when reading keys (vs the key not being found)? Also, I am confused how the default hashGenerator for the kv workload behaves when the read percent is 100. Isn't it going to keep reading key 0 since it hasn't seen any writes?
Reviewed 1 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 54 at r3 (raw file):
name: "concurrency-skew", warmUpDuration: func(i int) time.Duration { return 5 * time.Minute }, concurrency: func(i int) int { return i * 8 },
looks like this will be called with i in [1,5]. Is that correct? So lowest concurrency is 8 and highest is 40.
And we are using a 4 cpu KV node?
Do you know if 4 tenants with a concurrency of 8 each are sufficient to saturate cpu for this no writes (kv-fairness) case, and sufficient to saturate the store in the only writes (store-fairness) case below?
Perhaps the delta number below is because a concurrency of 8 is too low to utilize 25% of the store capacity.
admission-control-on/store-fairness/concurrency-skew/run_1/test.log
27:19:04:54 multitenant_fairness.go:236: Max throughput delta: 37% [4335 7058 7875 8621]
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 90 at r3 (raw file):
{ name: "size-skew", warmUpDuration: func(i int) time.Duration { return time.Duration(i*2) * time.Minute },
Do you expect initial size skew to matter in this case that does only writes? I am assuming the writes after setup are not limited to overwriting the existing keys, yes?
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 250 at r3 (raw file):
} // TODO: Can we do anything to verify that admission control kicked in? Check some stats?
there are no tables with admission control stats. Only metrics. We may be able to do something with prometheus wrt doing an aggregation over time. I am fine with leaving a TODO if you've eyeballed these runs and seen admission control kick in.
cucaroach
left a comment
There was a problem hiding this comment.
Yes, although there's no clear relationship. I wonder if I should use --max-ops to get an exact size distribution instead of using time which is probably noisy. I will dig into the how the reads behave I have wondered about that. I also wonder if I should simplify this and do just one kv run invocation and just concurrently reset the stats after warmup period expires. That should eliminate a lot of pitfalls. Then maybe instead of doing a duration based size skew I can skew the size based on min/max block sizes..
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
a discussion (no related file):
Thanks for the review feedback! Here's version 2, I've moved the pods/workloads to their own nodes and expanded things a bit to see how AC handles heterogenous concurrency and database size. Still a work in progress and need to analyze some runs and how well this is exercising things. Any and all feedback welcome!
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 122 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think this is one of the tests that would really benefit from using prometheus instead of ad-hoc metrics stuff. For a recent example, see https://github.com/cockroachdb/cockroach/pull/76147/files#diff-28a39375143eb2ad1e76868f4086035dc8099dc89f98658c3f62f52987945b1eR139-R152. There are also the multi-region DRT tests,
drt.go, which do something similar (but that's where I learned from, so my code is probably more easily digested for your use case).The main reasons I am suggestion prometheus here is that a) you can test whatever metrics you like without any limits and b) if the test fails, you can easily go through the metrics and figure out exactly what was wrong since there'll be a prometheus snapshot in the artifacts.
Sounds nice, what can I do with a prometheus snapshot, can that encompass any metrics from the db or is that just for metrics I produce from my test?
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 54 at r3 (raw file):
Previously, sumeerbhola wrote…
looks like this will be called with i in [1,5]. Is that correct? So lowest concurrency is 8 and highest is 40.
And we are using a 4 cpu KV node?
Do you know if 4 tenants with a concurrency of 8 each are sufficient to saturate cpu for this no writes (kv-fairness) case, and sufficient to saturate the store in the only writes (store-fairness) case below?
Perhaps the delta number below is because a concurrency of 8 is too low to utilize 25% of the store capacity.
admission-control-on/store-fairness/concurrency-skew/run_1/test.log 27:19:04:54 multitenant_fairness.go:236: Max throughput delta: 37% [4335 7058 7875 8621]
Here's what CPU % looks like for admission-control-on/kv-fairness/same:
Maybe not fully saturated but close...
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 90 at r3 (raw file):
Previously, sumeerbhola wrote…
Do you expect initial size skew to matter in this case that does only writes? I am assuming the writes after setup are not limited to overwriting the existing keys, yes?
Hmm, since I'm not passing write-seq I guess its possible we're overwriting existing keys if the seed is stable. One more reason to get rid of multi pass and just have one workload run.
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 250 at r3 (raw file):
Previously, sumeerbhola wrote…
there are no tables with admission control stats. Only metrics. We may be able to do something with prometheus wrt doing an aggregation over time. I am fine with leaving a TODO if you've eyeballed these runs and seen admission control kick in.
Seems like its kicking in a little. I'll try to crank it up a bit.
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 54 at r3 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Here's what CPU % looks like for admission-control-on/kv-fairness/same:
Maybe not fully saturated but close...
it is not close enough for admission control to kick in for cpu. The default for admission.kv_slot_adjuster.overload_threshold (for runnable goroutines) is 32.
pkg/cmd/roachtest/tests/multitenant_fairness.go, line 250 at r3 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Seems like its kicking in a little. I'll try to crank it up a bit.
We do want the admission delay rate to consistently stay positive.
And the delay rate/work rate should be a few ms at least (this is the mean admission latency). Even the spike in admission delay rate above of 120k, when divided by 20K work rate is 6 micros, which is too low.
91d8cb1 to
09c411e
Compare
|
Here's the latest results that I think look pretty good. |
f8649f2 to
9ea06a7
Compare
|
@cucaroach these are older than the ones you shared on slack on april 29th yes? |
9ea06a7 to
213dd57
Compare
|
After bumping up kv/same concurrency to 250 we get more impressive throughput improvements but still tput_delta I think its just too consistently low for no-admission. But we can hang our hats on the concurrency-skew tput_delta looking good. |
213dd57 to
2b1d989
Compare
|
Rebased and ready to go I think. |
|
Looks good! "max_tput_delta": 0.129229 is certainly better than "max_tput_delta": 0.457215.
|
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 3 of 7 files at r11, 1 of 6 files at r12, 1 of 2 files at r14, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
pkg/cmd/roachtest/tests/multitenant_fairness.go line 54 at r14 (raw file):
{ name: "concurrency-skew", concurrency: func(i int) int { return i * 150 },
this should be i * 250.
We've verified that a concurrency of 250 is enough to consume 25% of the cpu when running "same" with no-admission, since it causes enough queueing in the goroutine scheduler. In the skewed case we want the smallest tenant to continue to be as demanding: with perfect fairness it would get 25%.
pkg/cmd/roachtest/tests/multitenant_fairness.go line 83 at r14 (raw file):
{ name: "concurrency-skew", concurrency: func(i int) int { return i * 20 },
same question. Why not i * 50.
2b1d989 to
7683c27
Compare
|
Here's the latest results with beta.5, I don't what happened but for store at least no-admission looks better. Any thoughts for digging into that? I think when we jacked up the operation size to get more lsm 0 files this happened, I'll revert those changes and see what happens. |
7683c27 to
84459a2
Compare
|
Actually reverting the value size changes didn't change much, I keep forgetting store admission control can and should lower throughput to prevent LSM from getting unhappy, although it makes me queasy to take performance off the table. I wonder what happen if we ran this test for days, would the LSM implode at some point with too many L0 files? What would that look like? |
I think we'd need to periodically log the slots and tokens allocated to each tenant to debug this.
The node will probably fail heartbeats or other badness would happen. Tolerating a very poorly shaped LSM in order to get higher write throughput does not work out in a production setting. |
Just pointing out |
84459a2 to
045c2d6
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @cucaroach, @sumeerbhola, and @tbg)
045c2d6 to
2191564
Compare
Informs: cockroachdb#65954 Roachtests intended to validate that kv and store admission control queues distribute database resources fairly. There are 8 tests: 2 for "kv" ie CPU stressing and 2 for "store" that are intended to stress the LSM. One test is "same" where each of N sql pods hits the kvserver with equal concurrency and another "concurrency-skew" which varies the concurrency from each pod. We measure the variation in througphput across the pod ("max_tput_delta") and the max/min throughput and latency across the pods. Sample results: ``` multitenant/fairness/kv/concurrency-skew/admission/run_1/1.perf/stats.json { "max_tput_delta": 0.340925, "max_tput": 483.200000, "min_tput": 256.056667, "max_latency": 3.282347, "min_latency": 0.771148} multitenant/fairness/kv/concurrency-skew/no-admission/run_1/1.perf/stats.json { "max_tput_delta": 0.330760, "max_tput": 205.740000, "min_tput": 108.903333, "max_latency": 7.151178, "min_latency": 1.618236} multitenant/fairness/kv/same/admission/run_1/1.perf/stats.json { "max_tput_delta": 0.245294, "max_tput": 293.990000, "min_tput": 197.026667, "max_latency": 0.831686, "min_latency": 0.762475} multitenant/fairness/kv/same/no-admission/run_1/1.perf/stats.json { "max_tput_delta": 0.031199, "max_tput": 132.443333, "min_tput": 124.676667, "max_latency": 1.915801, "min_latency": 1.776664} multitenant/fairness/store/same/admission/run_1/1.perf/stats.json { "max_tput_delta": 0.018095, "max_tput": 139.950000, "min_tput": 136.336667, "max_latency": 0.346295, "min_latency": 0.341212} multitenant/fairness/store/same/no-admission/run_1/1.perf/stats.json { "max_tput_delta": 0.001886, "max_tput": 149.296667, "min_tput": 148.878333, "max_latency": 0.306853, "min_latency": 0.303392} multitenant/fairness/store/concurrency-skew/admission/run_1/1.perf/stats.json { "max_tput_delta": 0.024872, "max_tput": 143.875000, "min_tput": 138.346667, "max_latency": 1.094262, "min_latency": 0.346674} multitenant/fairness/store/concurrency-skew/no-admission/run_1/1.perf/stats.json { "max_tput_delta": 0.005439, "max_tput": 148.848333, "min_tput": 147.500000, "max_latency": 1.007741, "min_latency": 0.313597} ``` Release note: None
2191564 to
02cd048
Compare
|
bors r+ |
|
Build succeeded: |




Informs: #65954
Roachtests intended to validate that kv and store admission control
queues distribute database resources fairly.
There are 8 tests: 2 for "kv" ie CPU stressing and 2 for "store" that are
intended to stress the LSM. One test is "same" where each of N sql pods hits the
kvserver with equal concurrency and another "concurrency-skew" which
varies the concurrency from each pod.
We measure the variation in througphput across the pod ("max_tput_delta")
and the max/min throughput and latency across the pods. Sample results:
Release note: None