roachtest: harmonize GCE, AWS, Azure machine types by srosenberg · Pull Request #117852 · cockroachdb/cockroach

srosenberg · 2024-01-17T05:38:25Z

Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], SelectAzureMachineType has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see ExampleXXXMachineType.

[1] #111140
[2] #111633

Epic: none
Fixes: #106570

Release note: None

cockroach-teamcity · 2024-01-17T05:38:32Z

This change is

srosenberg · 2024-01-27T20:09:41Z

I've manually tested most Azure machine types by randomizing tpccbench,

  srosenberg-1705717757-02-n4cpu4sm                         [azure]     4     Standard_D4ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-03-n4cpu16                          [azure]     4    Standard_D16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-08-n4cpu16                          [azure]     4    Standard_D16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-11-n4cpu4sm                         [azure]     4    Standard_D4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705717757-12-n4cpu16hm                        [azure]     4    Standard_E16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-13-n4cpu16hm                        [azure]     4    Standard_E16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-14-n4cpu4sm                         [azure]     4    Standard_D4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705717757-15-n4cpu4hm                         [azure]     4    Standard_E4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705717757-16-n12cpu4sm-geo                    [azure]     8    Standard_D4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705718037-01-n4cpu16sm                        [azure]     4   Standard_D16pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-04-n4cpu4sm                         [azure]     4    Standard_D4pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-05-n4cpu16hm                        [azure]     4    Standard_E16ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-09-n4cpu4sm                         [azure]     4     Standard_D4ds_v5                                    59m0s   11h27m6s
  srosenberg-1705718037-10-n4cpu4hm                         [azure]     4    Standard_E4pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-11-n4cpu16                          [azure]     4   Standard_D16pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-12-n4cpu16sm                        [azure]     4    Standard_D16ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-13-n4cpu4hm                         [azure]     4     Standard_E4ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-14-n4cpu4                           [azure]     4     Standard_D4ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-21-n10cpu4                          [azure]    10    Standard_D4pds_v5         arm64                      46m0s   11h27m6s
  srosenberg-1705720133-06-n4cpu8                           [azure]     4    Standard_D8pds_v5         arm64                      24m0s   12h27m6s
  srosenberg-1705720133-08-n4cpu16                          [azure]     4   Standard_D16pds_v5         arm64                      17m0s   12h27m6s
  srosenberg-1705720133-10-n4cpu4                           [azure]     4     Standard_D4ds_v5                                    21m0s   12h27m6s
  srosenberg-1705720133-12-n4cpu64                          [azure]     4   Standard_D64lds_v5                                    21m0s   12h27m6s
  srosenberg-1705720133-13-n4cpu64                          [azure]     4  Standard_D64plds_v5         arm64                      20m0s   12h27m6s
  srosenberg-1705720133-15-n4cpu16sm                        [azure]     4    Standard_D16ds_v5                                    18m0s   12h27m6s
  srosenberg-1705720133-17-n10cpu16hm                       [azure]     1   Standard_E16pds_v5         arm64                      18m0s   12h27m6s

srosenberg · 2024-01-30T06:34:41Z

Kicked GCE and AWS runs with SELECT_PROBABILITY=0.5.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @renatolabs, and @srosenberg)

pkg/cmd/roachtest/cluster_test.go line 467 at r1 (raw file):

}

func TestAzureMachineType(t *testing.T) {

Would this be better as a datadriven test? It would be nice to not have to replicate the same logic from the code, and instead have a smaller number of targeted test cases that can be visually inspected.

I would make a test where each testcase has an arch, and memory spec as input and the output is the chosen machine type for all clouds for all cpu sizes. They can be tabulated, e.g. something like:

cpus | 1,2,4,8       | 16,32,64     | 96,128
----------------------------------------------
GCE  | n2-standard-X | n-standard-X | ..
AWS  | ..            | ..           |

It would be a useful visual inspection of how they map to each other.

pkg/cmd/roachtest/cluster_test.go line 484 at r1 (raw file):

			testCases = append(testCases, machineTypeTestCase{1, mem, false, arch,
				fmt.Sprintf("Standard_%s", strings.Replace(series, "?", strconv.Itoa(2), 1)), arch})
			for i := 2; i <= 96; i *= 2 {

96 isn't a power of two, this will stop at 64

pkg/cmd/roachtest/cluster_test.go line 663 at r1 (raw file):

	//		n2-highcpu-128 amd64
}
func ExampleSelectAWSMachineType() {

These kind of tests are hard to update, it's better to write a datadriven test. The datadriven "command" can just be the cloud type.

renatolabs

Agree with Radu's comment that datadriven would be a nicer experience that using go Examples (separate test and output files, well understood/supported rewrite dev flag, etc).

Otherwise, this looks great!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @srosenberg)

srosenberg · 2024-02-08T05:58:34Z

Converted "example" tests into datadriven, using suggested tabulation. Much more readable, thanks! PTAL.

Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None

srosenberg · 2024-02-09T22:29:03Z

TFTR!

bors r=renatolabs

craig · 2024-02-09T23:53:59Z

Build succeeded:

Bazel Essential CI (Cockroach)

srosenberg · 2024-02-10T02:52:30Z

blathers backport 23.1 23.2

See cockroachdb/cockroach#117852.

…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. [1] cockroachdb#117852 Epic: none Release note: None

In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] cockroachdb#117852 Epic: none Release note: None

120172: roachprod(azure): use machine type to determine hypervisor generation r=DarrylWong a=srosenberg In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] #117852 Epic: none Release note: None 120205: server, ccl, sql: skip recent failures r=abarganier a=dhartunian Epic: None Release note: None 120225: kvserver: move some tests to heavier pools under `race`, `deadlock` r=celiala a=rickystewart Epic: CRDB-8308 Release note: None 120229: release: released CockroachDB version 24.1.0-alpha.2. Next version: 24.1.0-alpha.3 r=DarrylWong a=cockroach-teamcity Release note: None Epic: None Release justification: non-production (release infra) change. Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com> Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Justin Beaver <teamcity@cockroachlabs.com>

In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types. As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`). [1] #117852 Epic: none Release note: None

…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] cockroachdb#117852 Epic: none Release note: None

119900: roachtest: SelectAWSMachineType should fall back to `c6a` without loc… r=herkolategan,renatolabs a=srosenberg …al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. [1] #117852 Epic: none Release note: None Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>

…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] #117852 Epic: none Release note: None

…al SSD In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs. Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it. We also bump `EstimatedMaxGCE` and `EstimatedMaxAWS` (both empirically derived) for `tpccbench/nodes=9/cpu=4/multi-region` in order to reduce the number of steps during the line search. Otherwise, the test has been seen timing out, owing largely in part due to being executed on Ice Lake vs. Cascade Lake (prior to [1]). [1] cockroachdb#117852 Epic: none Release note: None

srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch 2 times, most recently from e86b049 to 0dc3016 Compare January 23, 2024 04:42

renatolabs mentioned this pull request Jan 24, 2024

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed #113279

Closed

srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch 5 times, most recently from 3109acd to d960ff1 Compare January 27, 2024 20:08

srosenberg marked this pull request as ready for review January 27, 2024 20:08

srosenberg requested a review from a team as a code owner January 27, 2024 20:08

srosenberg requested review from DarrylWong and herkolategan and removed request for a team January 27, 2024 20:08

srosenberg requested review from RaduBerinde and renatolabs January 27, 2024 20:10

srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch 2 times, most recently from 220a5d9 to 327d82e Compare January 27, 2024 21:32

RaduBerinde reviewed Jan 30, 2024

View reviewed changes

renatolabs approved these changes Jan 30, 2024

View reviewed changes

srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch from 327d82e to ea8abaf Compare February 8, 2024 05:56

srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch from ea8abaf to 5490f98 Compare February 8, 2024 07:42

craig bot merged commit 2420e5c into cockroachdb:master Feb 9, 2024

jbowens added a commit to cockroachdb/pebble that referenced this pull request Feb 13, 2024

docs: annotate change in benchmark hardware

f934dad

See cockroachdb/cockroach#117852.

This was referenced Feb 13, 2024

release-23.1: roachtest: harmonize GCE, AWS, Azure machine types #119147

Merged

release-23.2: roachtest: harmonize GCE, AWS, Azure machine types #119204

Merged

release-22.2: roachtest: harmonize GCE, AWS, Azure machine types #119264

Merged

srosenberg mentioned this pull request Mar 5, 2024

roachtest: SelectAWSMachineType should fall back to c6a without loc… #119900

Merged

srosenberg mentioned this pull request Mar 10, 2024

roachprod(azure): use machine type to determine hypervisor generation #120172

Merged

blathers-crl bot mentioned this pull request Mar 11, 2024

release-22.2: roachprod(azure): use machine type to determine hypervisor generation #120247

Merged

blathers-crl bot mentioned this pull request Mar 11, 2024

release-23.1: roachprod(azure): use machine type to determine hypervisor generation #120248

Merged

blathers-crl bot mentioned this pull request Mar 11, 2024

release-23.2: roachprod(azure): use machine type to determine hypervisor generation #120249

Merged

blathers-crl bot mentioned this pull request Mar 14, 2024

release-23.2: roachtest: SelectAWSMachineType should fall back to c6a without loc… #120464

Merged

srosenberg mentioned this pull request Mar 14, 2024

release-23.1: roachtest: SelectAWSMachineType should fall back to c6a without loc… #120465

Merged

srosenberg mentioned this pull request Mar 14, 2024

release-22.2: roachtest: SelectAWSMachineType should fall back to c6a without loc… #120466

Merged

This was referenced Mar 27, 2024

release-23.1: roachtest/mixedversion: wrap original errors in test failures #121230

Merged

release-23.2: roachtest/mixedversion: wrap original errors in test failures #121231

Merged

srosenberg mentioned this pull request Nov 11, 2025

perf: evaluate new machine types in GCE, AWS, Azure #157155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: harmonize GCE, AWS, Azure machine types#117852

roachtest: harmonize GCE, AWS, Azure machine types#117852
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/roachtest_harmonize_machine_types2

srosenberg commented Jan 17, 2024 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jan 17, 2024

Uh oh!

srosenberg commented Jan 27, 2024

Uh oh!

srosenberg commented Jan 30, 2024

Uh oh!

RaduBerinde left a comment

Uh oh!

renatolabs left a comment

Uh oh!

srosenberg commented Feb 8, 2024

Uh oh!

srosenberg commented Feb 9, 2024

Uh oh!

craig bot commented Feb 9, 2024

Uh oh!

srosenberg commented Feb 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

srosenberg commented Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jan 17, 2024

Uh oh!

srosenberg commented Jan 27, 2024

Uh oh!

srosenberg commented Jan 30, 2024

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg commented Feb 8, 2024

Uh oh!

srosenberg commented Feb 9, 2024

Uh oh!

craig bot commented Feb 9, 2024

Uh oh!

srosenberg commented Feb 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

srosenberg commented Jan 17, 2024 •

edited

Loading