Skip to content

roachtest: SelectAWSMachineType should fall back to c6a without loc…#119900

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/fix_SelectAWSMachineType_c6a_no_local_ssd
Mar 14, 2024
Merged

roachtest: SelectAWSMachineType should fall back to c6a without loc…#119900
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/fix_SelectAWSMachineType_c6a_no_local_ssd

Conversation

@srosenberg
Copy link
Copy Markdown
Member

…al SSD

In [1], we introduced falling back to c6a (AMD Milan) in SelectAWSMachineType, when requested number of vCPUs > 80. However, that family type doesn't support local SSDs.

Thus, when shouldSupportLocalSSD=true is requested, we now ignore it.

[1] #117852

Epic: none

Release note: None

@srosenberg srosenberg requested a review from a team as a code owner March 5, 2024 04:22
@srosenberg srosenberg requested review from herkolategan and renatolabs and removed request for a team March 5, 2024 04:22
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@srosenberg srosenberg added backport-22.2.x backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only labels Mar 5, 2024
@srosenberg srosenberg force-pushed the sr/fix_SelectAWSMachineType_c6a_no_local_ssd branch from d1f5e12 to 58d1051 Compare March 8, 2024 05:00
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the step size tends to reduce the total number of line search iterations. E.g., two runs after the change,

grep -E "SEARCH ITER|MAX|success" artifacts/tpccbench/nodes\=9/cpu\=4/multi-region/run_1/test.log

00:55:09 tpcc.go:1678: --- SEARCH ITER FAIL: TPCC 2600 resulted in 19881.2 tpmC and failed due to efficiency value of 60.663357109930885 is below passing threshold of 85
01:14:14 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2570 resulted in 31678.6 tpmC (97.8% of max tpmC)
01:33:18 tpcc.go:1678: --- SEARCH ITER FAIL: TPCC 2585 resulted in 22991.8 tpmC and failed due to efficiency value of 70.56187811080218 is below passing threshold of 85
MAX WAREHOUSES = 2577
01:33:55 test_runner.go:1159: test completed successfully

03:32:24 tpcc.go:1678: --- SEARCH ITER FAIL: TPCC 2600 resulted in 27032.3 tpmC and failed due to efficiency value of 82.48333885126067 is below passing threshold of 85
03:51:31 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2570 resulted in 31911.8 tpmC (98.5% of max tpmC)
04:10:37 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2585 resulted in 32029.4 tpmC (98.3% of max tpmC)
MAX WAREHOUSES = 2592
04:11:12 test_runner.go:1159: test completed successfully

vs. before the change,

16:45:45 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2000 resulted in 25079.7 tpmC (99.5% of max tpmC)
17:04:47 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2015 resulted in 25286.7 tpmC (99.6% of max tpmC)
17:23:54 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2045 resulted in 25659.2 tpmC (99.5% of max tpmC)
17:43:00 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2105 resulted in 26349.3 tpmC (99.3% of max tpmC)
18:02:02 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2225 resulted in 27857.7 tpmC (99.3% of max tpmC)
18:21:07 tpcc.go:1678: --- SEARCH ITER FAIL: TPCC 2465 resulted in 23005.4 tpmC and failed due to efficiency value of 74.04061603766169 is below passing threshold of 85
18:40:13 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2345 resulted in 29223.6 tpmC (98.9% of max tpmC)
18:59:21 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2405 resulted in 29923.2 tpmC (98.7% of max tpmC)
19:18:30 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2435 resulted in 26741.5 tpmC (87.1% of max tpmC)
19:37:37 tpcc.go:1674: --- SEARCH ITER PASS: TPCC 2450 resulted in 27288.9 tpmC (88.4% of max tpmC)
MAX WAREHOUSES = 2457
19:38:14 test_runner.go:1159: test completed successfully

@srosenberg
Copy link
Copy Markdown
Member Author

Kicked off a pair of runs matching tpccbench|kv0/enc=false/nodes=3/cpu=96,

@srosenberg
Copy link
Copy Markdown
Member Author

srosenberg commented Mar 9, 2024

tpccbench/nodes=9/cpu=4/multi-region now takes ~3h 15m vs. 5h 18m before the change.

GCE Before,

gce_before

GCE After,

gce_after

AWS Before,

aws_before

AWS After,

aws_after

However, we're also seeing a slow down of other configurations. E.g., tpccbench/nodes=3/cpu=16/enc=true took ~43m before and ~2h 31m after. Upon a closer examination, the former executes in a freshly provisioned cluster whereas the latter, in the reused cluster. This unexplained variance predates the changes in this PR; we see it happening even within the same run; e.g., ``tpccbench/nodes=3/cpu=16inGCE Before` took ~`2h 16m` in a reused cluster. Why cluster reuse seems to harm the performance of `tpccbench` remains a mystery! 🕵️

Overall, the total, sequential running time of all tpccbench roachtests is less than half, after the change, so I think it's safe to roll with the change.

@srosenberg srosenberg force-pushed the sr/fix_SelectAWSMachineType_c6a_no_local_ssd branch from 58d1051 to 1df0644 Compare March 9, 2024 04:28
@srosenberg
Copy link
Copy Markdown
Member Author

Adjusted several other tpccbench variants based on the max warehouses found in the previous run. Kicking off one final run before the merge,

GCE
AWS

@srosenberg srosenberg force-pushed the sr/fix_SelectAWSMachineType_c6a_no_local_ssd branch from 1df0644 to 627dd5e Compare March 10, 2024 16:22
Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reminder that this test is only failing on the 23.1 release branch, so it would be nice to validate that we'd fix that issue if we merged and backported this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: might be clearer to avoid the double negative on the else. And maybe just set the "expectedMachineType" conditionally, to make it clearer that the rest remains the same.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will simplify.

@srosenberg
Copy link
Copy Markdown
Member Author

A reminder that this test is only failing on the 23.1 release branch, so it would be nice to validate that we'd fix that issue if we merged and backported this PR.

Why only 23.1? It's failing on master and other branches afaik; e.g., last night's failure on master: #115949 (comment)

…al SSD

In [1], we introduced falling back to `c6a` (AMD Milan) in
`SelectAWSMachineType`, when requested number of vCPUs > 80.
However, that family type doesn't support local SSDs.

Thus, when `shouldSupportLocalSSD=true` is requested,
we now ignore it. We also bump `EstimatedMaxGCE`
and `EstimatedMaxAWS` (both empirically derived)
for `tpccbench/nodes=9/cpu=4/multi-region` in order
to reduce the number of steps during the line search.
Otherwise, the test has been seen timing out, owing
largely in part due to being executed on Ice Lake vs.
Cascade Lake (prior to [1]).

[1] cockroachdb#117852

Epic: none

Release note: None
@srosenberg srosenberg force-pushed the sr/fix_SelectAWSMachineType_c6a_no_local_ssd branch from 627dd5e to d550913 Compare March 14, 2024 02:15
@srosenberg
Copy link
Copy Markdown
Member Author

Note that tpccbench/nodes=9/cpu=4/multi-region failed. The failure mode is very similar to several previously observed but unexplained failures. I had been investigating that failure, and I think I found something; will open a separate issue.

Meanwhile, I did another pair of full runs, both of which exhibited shorter, overall durations. Thus, I believe this PR is good to go.

@srosenberg
Copy link
Copy Markdown
Member Author

TFTR!

bors r=herkolategan,renatolabs

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 14, 2024

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 14, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d550913 to blathers/backport-release-22.2-119900: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from d550913 to blathers/backport-release-23.1-119900: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@renatolabs
Copy link
Copy Markdown

Why only 23.1? It's failing on master and other branches afaik; e.g., last night's failure on master: #115949 (comment)

I believe this was a bad paste; the link is a cluster creation error for a kv0 test. And yes, unless I'm missing something, the tpccbench search timeout has only been happening on 23.1. Did a quick issue search and didn't find a similar failure on master.

@srosenberg
Copy link
Copy Markdown
Member Author

I believe this was a bad paste; the link is a cluster creation error for a kv0 test. And yes, unless I'm missing something, the tpccbench search timeout has only been happening on 23.1. Did a quick issue search and didn't find a similar failure on master.

Oh right, there are actually two issues addressed by this PR,

  • tpccbench search timeout on 23.1
  • kv0 on AWS using an illegal machine type

Both will have been resolved once the backports are merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants