roachtest: harmonize GCE and AWS machine types#111140
roachtest: harmonize GCE and AWS machine types#111140craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
4e356ad to
e478b80
Compare
e478b80 to
d315ea8
Compare
|
CI smoke test in progress: SELECT_PROBABILITY=0.4 |
pkg/cmd/roachtest/cluster_test.go
Outdated
| {"n2-standard-32", 32}, | ||
| {"n2-standard-64", 64}, | ||
| {"n2-standard-96", 96}, | ||
| // GCE machine types |
There was a problem hiding this comment.
Stray comment? There is already a comment above for "GCE machine types".
It might have a small bump on the cpu-bound workloads using smaller vCPU density (those are most likely |
|
If the effect would be small, them probably not. Otherwise, we'll have to directly compare the release roachperf graphs with each other to detect regressions, we can't simply look at the single graph for master -- but if we're ok with the extra manual work then that's fine too. |
herkolategan
left a comment
There was a problem hiding this comment.
Reviewed 13 of 13 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @renatolabs, @smg260, and @srosenberg)
-- commits line 4 at r1:
silly nit: "some"
-- commits line 26 at r1:
silly nit: "intended"
Right. My guess the effect is small, possibly negligible. Otherwise, I'll revert the change and postpone it until after branch cut. |
renatolabs
left a comment
There was a problem hiding this comment.
Thanks for taking the time to make sure GCE and AWS are consistent!
|
|
||
| } | ||
| if shouldSupportLocalSSD { | ||
| family = family + "d" |
There was a problem hiding this comment.
Can we simplify this function by only having one check at the end for if shouldSupportLocalSSH { family += "d" }?
pkg/roachprod/vm/gce/gcloud.go
Outdated
| } | ||
| MachineType string | ||
| // CPU platform corresponding to machine type; see https://cloud.google.com/compute/docs/cpu-platforms | ||
| CpuPlatform string |
There was a problem hiding this comment.
Nit: probably more idiomatic to call this CPUPlaform: https://github.com/golang/go/wiki/CodeReviewComments#initialisms (and also more consistent with CPUArch and CPUFamily).
pkg/roachprod/vm/gce/gcloud.go
Outdated
| if rand.Float64() < 0.75 { | ||
| zones = []string{defaultZones[0]} | ||
| } else { | ||
| zones = []string{defaultZones[1]} |
There was a problem hiding this comment.
This won't be great if we get unlucky and run a test that takes large backups and stores them in our backup testing buckets, which are only in us-east1 (not multiregion).
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @herkolategan, @renatolabs, and @smg260)
Previously, herkolategan (Herko Lategan) wrote…
silly nit: "some"
It's actually inteded (sic) to be same (sic), but I'll rephrase :) Basically, one roachtest executed in both clouds would have different multipliers.
Previously, herkolategan (Herko Lategan) wrote…
silly nit: "intended"
Fixing, thanks.
pkg/cmd/roachtest/spec/machine_type.go line 115 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Can we simplify this function by only having one check at the end for
if shouldSupportLocalSSH { family += "d" }?
Yep, good catch!
pkg/roachprod/vm/vm.go line 52 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Would be nice to include some examples of what kinds of inputs this is known to work well: e.g., GCE tools, binary detection tool, etc.
Yep, good idea!
pkg/roachprod/vm/gce/gcloud.go line 131 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nit: probably more idiomatic to call this
CPUPlaform: https://github.com/golang/go/wiki/CodeReviewComments#initialisms (and also more consistent withCPUArchandCPUFamily).
I struggled with this one; CPUPlatform just hurts my eyes :) I'll comply with the more idiomatic convention.
pkg/roachprod/vm/gce/gcloud.go line 973 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
This won't be great if we get unlucky and run a test that takes large backups and stores them in our backup testing buckets, which are only in us-east1 (not multiregion).
Yep, I am taking this out; to be dealt with in #111371
74bab9a to
f89da86
Compare
| if family == "c7g" && size == "24xlarge" { | ||
| family = "c6id" | ||
| // There is no m7gd.24xlarge, fall back to (c|m|r)6id.24xlarge. | ||
| if family == "m7gd" && size == "24xlarge" { |
There was a problem hiding this comment.
Good catch, I shouldn't have hurried!
f89da86 to
e506755
Compare
…l revert This is a backport of the "merged" diff of the following PRs: roachtest: harmonize GCE and AWS machine types cockroachdb#111140 roachtest: revert harmonize GCE and AWS machine types cockroachdb#111633 Release justification: test-only code, keeping roachtest in sync. Epic: none Release note: None
We should revisit this. It still causes the occasional test failure (e.g., #113279). |
Yep, I'll prepare a new PR. We can wait until |
TBD [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
117852: roachtest: harmonize GCE, AWS, Azure machine types r=renatolabs a=srosenberg Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] #111140 [2] #111633 Epic: none Fixes: #106570 Release note: None Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] #111140 [2] #111633 Epic: none Fixes: #106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified MemPerCPU: Standard yields 4GB/cpu, High yields 8GB/cpu, Auto yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. Low is supported only in GCE. Consequently, n2-standard maps to m6i, n2-highmem maps to r6i, n2-custom maps to c6i, modulo local SSDs in which case m6id is used, etc. Note, we also force --gce-min-cpu-platform to Ice Lake; isomorphic AWS machine types are exclusively on Ice Lake. Roachprod is extended to show cpu family and architecture on List. Cost estimation now correctly deals with custom machine types. Note, this PR essentially resurrects [1], after it was reverted in [2]. Since [1], `SelectAzureMachineType` has been added. MemPerCPU is preserved across all three cloud providers. However, when mem is Auto (default) and cpus > 80, we switch to AMD Milan, both in GCE and AWS, but not Azure. (The latter doesn't support 2GB per AMD CPU.) For complete lists of machine types see `ExampleXXXMachineType`. [1] cockroachdb#111140 [2] cockroachdb#111633 Epic: none Fixes: cockroachdb#106570 Release note: None
Previously, same (performance) roachtest executed in GCE and AWS
may have used a different memory (per CPU) multiplier and/or
cpu family, e.g., cascade lake vs ice lake. In the best case,
this resulted in different performance baselines on an otherwise
equivalent machine type. In the worst case, this resulted in OOMs
due to VMs in AWS having 2x less memory per CPU.
This change harmozines GCE and AWS machine types by making them
as isomorphic as possible, wrt memory, cpu family and price.
The following heuristics are used depending on specified
MemPerCPU:Standardyields 4GB/cpu,Highyields 8GB/cpu,Autoyields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu.Lowis supported only in GCE.Consequently,
n2-standardmaps tom6i,n2-highmemmaps tor6i,n2-custommaps toc6i, modulo local SSDs in which casem6idisused, etc. Note, we also force
--gce-min-cpu-platformtoIce Lake;isomorphic AWS machine types are exclusively on
Ice Lake.Roachprod is extended to show cpu family and architecture on
List.Cost estimation now correctly deals with custom machine types.
Finally, we change the default zone allocation in GCE from exclusively
us-east1-bto ~25%us-central1-band ~75%us-east1-b. This isinteded to balance the quotas for local SSDs until we eventually
switch to PD-SSDs.
Epic: none
Fixes: #106570
Release note: None