Skip to content

roachtest: upgrade GCE to n2 and AWS to m6id and c6id#104419

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/roachtest_bump_machine_types
Jun 11, 2023
Merged

roachtest: upgrade GCE to n2 and AWS to m6id and c6id#104419
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/roachtest_bump_machine_types

Conversation

@srosenberg
Copy link
Copy Markdown
Member

Previously, roachtest used n1 in GCE, m5d and c6d in AWS. CockroachDB Cloud hardware now uses n2 in GCE, m6i in AWS [1]. This change brings roachtest hardware into parity with CockroachDB Cloud; it is based on the draft PR in [2].

[1] https://cockroachlabs.atlassian.net/wiki/spaces/MC/pages/2799501550/CockroachDB+Cloud+Hardware
[2] #99991

Epic: none

Release note: None

@srosenberg srosenberg requested a review from a team as a code owner June 6, 2023 16:38
@srosenberg srosenberg requested review from herkolategan, nicktrav, renatolabs and tbg and removed request for a team June 6, 2023 16:38
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@srosenberg srosenberg requested a review from erikgrinaker June 6, 2023 16:38
@srosenberg srosenberg added backport-22.2.x backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only labels Jun 6, 2023
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Probably goes without saying that we'll want an annotation for this, as it's going to throw our benchmarks out of whack.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Also worth doing a canary run of 1-2 roachtests for both GCE and AWS, disregard if you've done it already.

@srosenberg
Copy link
Copy Markdown
Member Author

srosenberg commented Jun 7, 2023

Also worth doing a canary run of 1-2 roachtests for both GCE and AWS, disregard if you've done it already.

Absolutely! Canary runs are queued up; won't merge until we have the signal.

AWS: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyAwsBazel/10440661?buildTab=tests
GCE: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/10445493?buildTab=tests

@srosenberg
Copy link
Copy Markdown
Member Author

AWS run succeeded, but GCE run had a couple of issues,

grep -B 6 "test failed" test_runner-1686184082.log|grep "Quota 'N2_CPUS' exceeded" |sort |uniq -c
   9  - Quota 'N2_CPUS' exceeded.  Limit: 24.0 in region europe-west2.
grep -B 5 "test failed" test_runner-1686184082.log |grep "Number of local"|sort |uniq -c
  39  - Number of local SSDs for an instance of type n2-highcpu-32 should be one of [0, 4, 8, 16, 24], while [1] is requested.: exit status 1.
   3  - Number of local SSDs for an instance of type n2-highcpu-96 should be one of [0, 16, 24], while [1] is requested.: exit status 1.

@srosenberg srosenberg force-pushed the sr/roachtest_bump_machine_types branch 4 times, most recently from 0ee6b20 to ac232a7 Compare June 10, 2023 16:51
@srosenberg
Copy link
Copy Markdown
Member Author

After further tweaks, it looks like everything passed,

In the GCE run, there were two cluster creation errors, one of which is a known issue (resolved by [1]),

16:18:27 test_runner.go:668: [w13] Unable to create (or reuse) cluster for test restore/tpce/400GB/aws/nodes=9/cpus=8/zones=us-east-2b,us-west-2b,eu-west-1b due to: in provider: gce: Command: gcloud [compute instances create --subnet default --scopes cloud-platform --image ubuntu-2004-focal-v20210603 --image-project ubuntu-os-cloud --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --maintenance-policy MIGRATE --create-disk type=,size=1000GB,auto-delete=yes --machine-type n2-standard-8 --labels usage=roachtest,cluster=teamcity-10477812-1686369260-155-n9cpu8-geo,lifetime=12h0m0s,arch=amd64,created=2023-06-10t16_17_46z,roachprod=true --metadata-from-file startup-script=/tmp/gce-startup-script3417406819 --project cockroach-ephemeral --boot-disk-size=32GB --zone us-east-2b teamcity-10477812-1686369260-155-n9cpu8-geo-0001 teamcity-10477812-1686369260-155-n9cpu8-geo-0002 teamcity-10477812-1686369260-155-n9cpu8-geo-0003]
Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:

the other one is a small quota in europe-west4, which I missed in the previous attempt,

15:28:26 test_runner.go:668: [w4] Unable to create (or reuse) cluster for test import/tpcc/warehouses=4000/geo due to: in provider: gce: Command: gcloud [compute instances create --subnet default --scopes cloud-platform --image ubuntu-2004-focal-v20210603 --image-project ubuntu-os-cloud --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --maintenance-policy MIGRATE --local-ssd interface=NVME --local-ssd interface=NVME --machine-type n2-standard-16 --labels usage=roachtest,cluster=teamcity-10477812-1686369260-131-n8cpu16-geo,lifetime=12h0m0s,arch=amd64,created=2023-06-10t15_25_32z,roachprod=true --metadata-from-file startup-script=/tmp/gce-startup-script3945224579 --project cockroach-ephemeral --boot-disk-size=32GB --zone europe-west4-b teamcity-10477812-1686369260-131-n8cpu16-geo-0003 teamcity-10477812-1686369260-131-n8cpu16-geo-0004]
Output: Created [https://www.googleapis.com/compute/v1/projects/cockroach-ephemeral/zones/europe-west4-b/instances/teamcity-10477812-1686369260-131-n8cpu16-geo-0003].
WARNING: Some requests generated warnings:
 - Disk size: '32 GB' is larger than image size: '10 GB'. You might need to resize the root repartition manually if the operating system does not support automatic resizing. See https://cloud.google.com/compute/docs/disks/add-persistent-disk#resize_pd for details.
 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210603' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20230605'.
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - Quota 'N2_CPUS' exceeded.  Limit: 24.0 in region europe-west4.
	metric name = compute.googleapis.com/n2_cpus
	limit name = N2-CPUS-per-project-region
	dimensions = region: europe-west4

Previously, roachtest used n1 in GCE, m5d and c6d in AWS.
CockroachDB Cloud hardware now uses n2 in GCE, m6i in AWS [1].
This change brings roachtest hardware into parity with
CockroachDB Cloud; it is based on the draft PR in [2].

[1] https://cockroachlabs.atlassian.net/wiki/spaces/MC/pages/2799501550/CockroachDB+Cloud+Hardware
[2] cockroachdb#99991

Epic: none

Release note: None

Co-authored-by: Nick Travers <travers@cockroachlabs.com>
@srosenberg srosenberg force-pushed the sr/roachtest_bump_machine_types branch from ac232a7 to 5ad20cc Compare June 11, 2023 00:12
@srosenberg
Copy link
Copy Markdown
Member Author

@srosenberg
Copy link
Copy Markdown
Member Author

Both runs succeeded, modulo the know issue with restore/tpce/400GB/aws/nodes=9/cpus=8/zones=us-east-2b,us-west-2b,eu-west-1b (in GCE). It's instructive to compare durations after and before this change,

  • AWS ~7h 27m (3-run average on PR branch) vs. ~7h 46m (7-run average on master)
  • GCE ~16h 42m (3-run average on PR branch) vs. ~19h 4m (7-run average on master)

N.B. variance between runs is much higher on master. Further study is queued up once we have more data points for this change.

TL;DR: shaving ~3 hours from a GCE run is a quick win, considering we were approaching 24h for some GCE runs.

@srosenberg
Copy link
Copy Markdown
Member Author

TFTR!

bors r=erikgrinaker,tbg,herkolategan

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 11, 2023

Build succeeded:

@craig craig bot merged commit 65b478a into cockroachdb:master Jun 11, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jun 11, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5ad20cc to blathers/backport-release-22.2-104419: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from 5ad20cc to blathers/backport-release-23.1-104419: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

srosenberg added a commit to srosenberg/cockroach that referenced this pull request Jun 20, 2023
Since the bump to new instance types in GCE and AWS [1],
we are still experiencing occasional cluster creation
issues owing to "insufficient capacity". GCE quota has
already been bumped, with `asia-northeast1` being the
latest, and hopefully last.

The most recent cluster creation in AWS is owing to
"insufficient capacity" of `c6id.24xlarge` in us-east-2a.
As a workaround, we extend the existing zone override
to place `c6id.24xlarge` into us-east-2b, which
allegedly has sufficient capacity.

Note, the long-term fix is to rework how cluster creation
retry currently operates, by effectively trying other AZs.

[1] cockroachdb#104419

Epic: none
Fixes: cockroachdb#78601 (comment)

Release note: None
craig bot pushed a commit that referenced this pull request Jun 21, 2023
105234: roachprod: add aws AZ override for c6id.24xlarge r=renatolabs a=srosenberg

Since the bump to new instance types in GCE and AWS [1], we are still experiencing occasional cluster creation issues owing to "insufficient capacity". GCE quota has already been bumped, with `asia-northeast1` being the latest, and hopefully last.

The most recent cluster creation in AWS is owing to "insufficient capacity" of `c6id.24xlarge` in us-east-2a. As a workaround, we extend the existing zone override to place `c6id.24xlarge` into us-east-2b, which
allegedly has sufficient capacity.

Note, the long-term fix is to rework how cluster creation retry currently operates, by effectively trying other AZs.

[1] #104419

Epic: none
Fixes: #78601 (comment)

Release note: None

Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com>
blathers-crl bot pushed a commit that referenced this pull request Jun 21, 2023
Since the bump to new instance types in GCE and AWS [1],
we are still experiencing occasional cluster creation
issues owing to "insufficient capacity". GCE quota has
already been bumped, with `asia-northeast1` being the
latest, and hopefully last.

The most recent cluster creation in AWS is owing to
"insufficient capacity" of `c6id.24xlarge` in us-east-2a.
As a workaround, we extend the existing zone override
to place `c6id.24xlarge` into us-east-2b, which
allegedly has sufficient capacity.

Note, the long-term fix is to rework how cluster creation
retry currently operates, by effectively trying other AZs.

[1] #104419

Epic: none
Fixes: #78601 (comment)

Release note: None
srosenberg added a commit to srosenberg/cockroach that referenced this pull request Jun 22, 2023
Since the bump to new instance types in GCE and AWS [1],
we are still experiencing occasional cluster creation
issues owing to "insufficient capacity". GCE quota has
already been bumped, with `asia-northeast1` being the
latest, and hopefully last.

The most recent cluster creation in AWS is owing to
"insufficient capacity" of `c6id.24xlarge` in us-east-2a.
As a workaround, we extend the existing zone override
to place `c6id.24xlarge` into us-east-2b, which
allegedly has sufficient capacity.

Note, the long-term fix is to rework how cluster creation
retry currently operates, by effectively trying other AZs.

[1] cockroachdb#104419

Epic: none
Fixes: cockroachdb#78601 (comment)

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants