Skip to content

roachtest: harmonize GCE, AWS, Azure machine types#117852

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/roachtest_harmonize_machine_types2
Feb 9, 2024
Merged

roachtest: harmonize GCE, AWS, Azure machine types#117852
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/roachtest_harmonize_machine_types2

Conversation

@srosenberg
Copy link
Copy Markdown
Member

@srosenberg srosenberg commented Jan 17, 2024

Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], SelectAzureMachineType has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see ExampleXXXMachineType.

[1] #111140
[2] #111633

Epic: none
Fixes: #106570

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch 2 times, most recently from e86b049 to 0dc3016 Compare January 23, 2024 04:42
@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch 5 times, most recently from 3109acd to d960ff1 Compare January 27, 2024 20:08
@srosenberg srosenberg marked this pull request as ready for review January 27, 2024 20:08
@srosenberg srosenberg requested a review from a team as a code owner January 27, 2024 20:08
@srosenberg srosenberg requested review from DarrylWong and herkolategan and removed request for a team January 27, 2024 20:08
@srosenberg
Copy link
Copy Markdown
Member Author

I've manually tested most Azure machine types by randomizing tpccbench,

  srosenberg-1705717757-02-n4cpu4sm                         [azure]     4     Standard_D4ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-03-n4cpu16                          [azure]     4    Standard_D16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-08-n4cpu16                          [azure]     4    Standard_D16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-11-n4cpu4sm                         [azure]     4    Standard_D4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705717757-12-n4cpu16hm                        [azure]     4    Standard_E16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-13-n4cpu16hm                        [azure]     4    Standard_E16ds_v5                                   1h3m0s   11h27m6s
  srosenberg-1705717757-14-n4cpu4sm                         [azure]     4    Standard_D4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705717757-15-n4cpu4hm                         [azure]     4    Standard_E4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705717757-16-n12cpu4sm-geo                    [azure]     8    Standard_D4pds_v5         arm64                     1h3m0s   11h27m6s
  srosenberg-1705718037-01-n4cpu16sm                        [azure]     4   Standard_D16pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-04-n4cpu4sm                         [azure]     4    Standard_D4pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-05-n4cpu16hm                        [azure]     4    Standard_E16ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-09-n4cpu4sm                         [azure]     4     Standard_D4ds_v5                                    59m0s   11h27m6s
  srosenberg-1705718037-10-n4cpu4hm                         [azure]     4    Standard_E4pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-11-n4cpu16                          [azure]     4   Standard_D16pds_v5         arm64                      58m0s   11h27m6s
  srosenberg-1705718037-12-n4cpu16sm                        [azure]     4    Standard_D16ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-13-n4cpu4hm                         [azure]     4     Standard_E4ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-14-n4cpu4                           [azure]     4     Standard_D4ds_v5                                    58m0s   11h27m6s
  srosenberg-1705718037-21-n10cpu4                          [azure]    10    Standard_D4pds_v5         arm64                      46m0s   11h27m6s
  srosenberg-1705720133-06-n4cpu8                           [azure]     4    Standard_D8pds_v5         arm64                      24m0s   12h27m6s
  srosenberg-1705720133-08-n4cpu16                          [azure]     4   Standard_D16pds_v5         arm64                      17m0s   12h27m6s
  srosenberg-1705720133-10-n4cpu4                           [azure]     4     Standard_D4ds_v5                                    21m0s   12h27m6s
  srosenberg-1705720133-12-n4cpu64                          [azure]     4   Standard_D64lds_v5                                    21m0s   12h27m6s
  srosenberg-1705720133-13-n4cpu64                          [azure]     4  Standard_D64plds_v5         arm64                      20m0s   12h27m6s
  srosenberg-1705720133-15-n4cpu16sm                        [azure]     4    Standard_D16ds_v5                                    18m0s   12h27m6s
  srosenberg-1705720133-17-n10cpu16hm                       [azure]     1   Standard_E16pds_v5         arm64                      18m0s   12h27m6s

@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch 2 times, most recently from 220a5d9 to 327d82e Compare January 27, 2024 21:32
@srosenberg
Copy link
Copy Markdown
Member Author

Kicked GCE and AWS runs with SELECT_PROBABILITY=0.5.

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/cluster_test.go line 467 at r1 (raw file):

}

func TestAzureMachineType(t *testing.T) {

Would this be better as a datadriven test? It would be nice to not have to replicate the same logic from the code, and instead have a smaller number of targeted test cases that can be visually inspected.

I would make a test where each testcase has an arch, and memory spec as input and the output is the chosen machine type for all clouds for all cpu sizes. They can be tabulated, e.g. something like:

cpus | 1,2,4,8       | 16,32,64     | 96,128
----------------------------------------------
GCE  | n2-standard-X | n-standard-X | ..
AWS  | ..            | ..           |

It would be a useful visual inspection of how they map to each other.


pkg/cmd/roachtest/cluster_test.go line 484 at r1 (raw file):

			testCases = append(testCases, machineTypeTestCase{1, mem, false, arch,
				fmt.Sprintf("Standard_%s", strings.Replace(series, "?", strconv.Itoa(2), 1)), arch})
			for i := 2; i <= 96; i *= 2 {

96 isn't a power of two, this will stop at 64


pkg/cmd/roachtest/cluster_test.go line 663 at r1 (raw file):

	//		n2-highcpu-128 amd64
}
func ExampleSelectAWSMachineType() {

These kind of tests are hard to update, it's better to write a datadriven test. The datadriven "command" can just be the cloud type.

Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Radu's comment that datadriven would be a nicer experience that using go Examples (separate test and output files, well understood/supported rewrite dev flag, etc).

Otherwise, this looks great!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @srosenberg)

@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch from 327d82e to ea8abaf Compare February 8, 2024 05:56
@srosenberg
Copy link
Copy Markdown
Member Author

Converted "example" tests into datadriven, using suggested tabulation. Much more readable, thanks! PTAL.

Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.

This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified MemPerCPU:
Standard yields 4GB/cpu, High yields 8GB/cpu,
Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.
Low is supported only in GCE.
Consequently, n2-standard maps to m6i, n2-highmem maps to r6i,
n2-custom maps to c6i, modulo local SSDs in which case m6id is
used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake;
isomorphic AWS machine types are exclusively on Ice Lake.

Roachprod is extended to show cpu family and architecture on List.
Cost estimation now correctly deals with custom machine types.

Note, this PR essentially resurrects [1], after it was reverted
in [2]. Since [1], `SelectAzureMachineType` has been added.
MemPerCPU is preserved across all three cloud providers.
However, when mem is Auto (default) and cpus > 80, we switch
to AMD Milan, both in GCE and AWS, but not Azure. (The latter
doesn't support 2GB per AMD CPU.)

For complete lists of machine types see `ExampleXXXMachineType`.

[1] cockroachdb#111140
[2] cockroachdb#111633

Epic: none
Fixes: cockroachdb#106570

Release note: None
@srosenberg srosenberg force-pushed the sr/roachtest_harmonize_machine_types2 branch from ea8abaf to 5490f98 Compare February 8, 2024 07:42
@srosenberg
Copy link
Copy Markdown
Member Author

TFTR!

bors r=renatolabs

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 9, 2024

Build succeeded:

@craig craig bot merged commit 2420e5c into cockroachdb:master Feb 9, 2024
@srosenberg
Copy link
Copy Markdown
Member Author

blathers backport 23.1 23.2

jbowens added a commit to cockroachdb/pebble that referenced this pull request Feb 13, 2024
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Mar 5, 2024
…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it.

[1] cockroachdb#117852

Epic: none

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Mar 10, 2024
In [1], we switched to azure `v5` machine series. Some of these
newer machine types do not support hypervisor generation 1. By
hardcoding generation 2, we effectively broke backward compatibility
with older machine types.

As of this change, the hypervisor generation is dynamically
selected based on the machine type (see `imageSKU`).

[1] cockroachdb#117852

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Mar 11, 2024
120172: roachprod(azure): use machine type to determine hypervisor generation r=DarrylWong a=srosenberg

In [1], we switched to azure `v5` machine series. Some of these newer machine types do not support hypervisor generation 1. By hardcoding generation 2, we effectively broke backward compatibility with older machine types.

As of this change, the hypervisor generation is dynamically selected based on the machine type (see `imageSKU`).

[1] #117852

Epic: none

Release note: None

120205: server, ccl, sql: skip recent failures r=abarganier a=dhartunian

Epic: None
Release note: None


120225: kvserver: move some tests to heavier pools under `race`, `deadlock` r=celiala a=rickystewart

Epic: CRDB-8308
Release note: None

120229: release: released CockroachDB version 24.1.0-alpha.2. Next version: 24.1.0-alpha.3 r=DarrylWong a=cockroach-teamcity

Release note: None
Epic: None
Release justification: non-production (release infra) change.


Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Justin Beaver <teamcity@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Mar 11, 2024
In [1], we switched to azure `v5` machine series. Some of these
newer machine types do not support hypervisor generation 1. By
hardcoding generation 2, we effectively broke backward compatibility
with older machine types.

As of this change, the hypervisor generation is dynamically
selected based on the machine type (see `imageSKU`).

[1] #117852

Epic: none

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Mar 11, 2024
In [1], we switched to azure `v5` machine series. Some of these
newer machine types do not support hypervisor generation 1. By
hardcoding generation 2, we effectively broke backward compatibility
with older machine types.

As of this change, the hypervisor generation is dynamically
selected based on the machine type (see `imageSKU`).

[1] #117852

Epic: none

Release note: None
blathers-crl bot pushed a commit that referenced this pull request Mar 11, 2024
In [1], we switched to azure `v5` machine series. Some of these
newer machine types do not support hypervisor generation 1. By
hardcoding generation 2, we effectively broke backward compatibility
with older machine types.

As of this change, the hypervisor generation is dynamically
selected based on the machine type (see `imageSKU`).

[1] #117852

Epic: none

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Mar 14, 2024
…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it. We also bump `EstimatedMaxGCE`
and `EstimatedMaxAWS` (both empirically derived)
for `tpccbench/nodes=9/cpu=4/multi-region` in order
to reduce the number of steps during the line search.
Otherwise, the test has been seen timing out, owing
largely in part due to being executed on Ice Lake vs.
Cascade Lake (prior to [1]).

[1] cockroachdb#117852

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Mar 14, 2024
119900: roachtest: SelectAWSMachineType should fall back to `c6a` without loc… r=herkolategan,renatolabs a=srosenberg

…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in `SelectAWSMachineType`, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested, we now ignore it.

[1] #117852

Epic: none

Release note: None

Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
blathers-crl bot pushed a commit that referenced this pull request Mar 14, 2024
…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it. We also bump `EstimatedMaxGCE`
and `EstimatedMaxAWS` (both empirically derived)
for `tpccbench/nodes=9/cpu=4/multi-region` in order
to reduce the number of steps during the line search.
Otherwise, the test has been seen timing out, owing
largely in part due to being executed on Ice Lake vs.
Cascade Lake (prior to [1]).

[1] #117852

Epic: none

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Mar 14, 2024
…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it. We also bump `EstimatedMaxGCE`
and `EstimatedMaxAWS` (both empirically derived)
for `tpccbench/nodes=9/cpu=4/multi-region` in order
to reduce the number of steps during the line search.
Otherwise, the test has been seen timing out, owing
largely in part due to being executed on Ice Lake vs.
Cascade Lake (prior to [1]).

[1] cockroachdb#117852

Epic: none

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Mar 14, 2024
…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it. We also bump `EstimatedMaxGCE`
and `EstimatedMaxAWS` (both empirically derived)
for `tpccbench/nodes=9/cpu=4/multi-region` in order
to reduce the number of steps during the line search.
Otherwise, the test has been seen timing out, owing
largely in part due to being executed on Ice Lake vs.
Cascade Lake (prior to [1]).

[1] cockroachdb#117852

Epic: none

Release note: None
jasminejsun pushed a commit to jasminejsun/cockroach that referenced this pull request Mar 18, 2024
…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it. We also bump `EstimatedMaxGCE`
and `EstimatedMaxAWS` (both empirically derived)
for `tpccbench/nodes=9/cpu=4/multi-region` in order
to reduce the number of steps during the line search.
Otherwise, the test has been seen timing out, owing
largely in part due to being executed on Ice Lake vs.
Cascade Lake (prior to [1]).

[1] cockroachdb#117852

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: kv0/enc=false/nodes=1/size=64kb/conc=4096 failed

4 participants