roachtest: harmonize GCE, AWS, Azure machine types#117852
roachtest: harmonize GCE, AWS, Azure machine types#117852craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
e86b049 to
0dc3016
Compare
3109acd to
d960ff1
Compare
|
I've manually tested most Azure machine types by randomizing |
220a5d9 to
327d82e
Compare
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/cluster_test.go line 467 at r1 (raw file):
} func TestAzureMachineType(t *testing.T) {
Would this be better as a datadriven test? It would be nice to not have to replicate the same logic from the code, and instead have a smaller number of targeted test cases that can be visually inspected.
I would make a test where each testcase has an arch, and memory spec as input and the output is the chosen machine type for all clouds for all cpu sizes. They can be tabulated, e.g. something like:
cpus | 1,2,4,8 | 16,32,64 | 96,128
----------------------------------------------
GCE | n2-standard-X | n-standard-X | ..
AWS | .. | .. |
It would be a useful visual inspection of how they map to each other.
pkg/cmd/roachtest/cluster_test.go line 484 at r1 (raw file):
testCases = append(testCases, machineTypeTestCase{1, mem, false, arch, fmt.Sprintf("Standard_%s", strings.Replace(series, "?", strconv.Itoa(2), 1)), arch}) for i := 2; i <= 96; i *= 2 {
96 isn't a power of two, this will stop at 64
pkg/cmd/roachtest/cluster_test.go line 663 at r1 (raw file):
// n2-highcpu-128 amd64 } func ExampleSelectAWSMachineType() {
These kind of tests are hard to update, it's better to write a datadriven test. The datadriven "command" can just be the cloud type.
renatolabs
left a comment
There was a problem hiding this comment.
Agree with Radu's comment that datadriven would be a nicer experience that using go Examples (separate test and output files, well understood/supported rewrite dev flag, etc).
Otherwise, this looks great!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @srosenberg)
327d82e to
ea8abaf
Compare
|
Converted "example" tests into datadriven, using suggested tabulation. Much more readable, thanks! PTAL. |
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
ea8abaf to
5490f98
Compare
|
TFTR! bors r=renatolabs |
|
Build succeeded: |
|
blathers backport 23.1 23.2 |
…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. [1] cockroachdb#117852 Epic: none Release note: None
In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] cockroachdb#117852 Epic: none Release note: None
120172: roachprod(azure): use machine type to determine hypervisor generation r=DarrylWong a=srosenberg In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] #117852 Epic: none Release note: None 120205: server, ccl, sql: skip recent failures r=abarganier a=dhartunian Epic: None Release note: None 120225: kvserver: move some tests to heavier pools under `race`, `deadlock` r=celiala a=rickystewart Epic: CRDB-8308 Release note: None 120229: release: released CockroachDB version 24.1.0-alpha.2. Next version: 24.1.0-alpha.3 r=DarrylWong a=cockroach-teamcity Release note: None Epic: None Release justification: non-production (release infra) change. Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Justin Beaver <teamcity@cockroachlabs.com>
In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] #117852 Epic: none Release note: None
In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] #117852 Epic: none Release note: None
In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] #117852 Epic: none Release note: None
…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] cockroachdb#117852 Epic: none Release note: None
119900: roachtest: SelectAWSMachineType should fall back to `c6a` without loc… r=herkolategan,renatolabs a=srosenberg …al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. [1] #117852 Epic: none Release note: None Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] #117852 Epic: none Release note: None
…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] cockroachdb#117852 Epic: none Release note: None
…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] cockroachdb#117852 Epic: none Release note: None
…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] cockroachdb#117852 Epic: none Release note: None
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.
This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.
Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.
Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1],
SelectAzureMachineTypehas been added.MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)
For complete lists of machine types see
ExampleXXXMachineType.[1] #111140
[2] #111633
Epic: none
Fixes: #106570
Release note: None