Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu) by huydhn · Pull Request #1376 · pytorch/test-infra

huydhn · 2023-01-10T00:16:44Z

I have obtained good results when trying to run Windows CUDA tests on G5 runners pytorch/pytorch#91727. It's not only faster but also cheaper with a 25% job duration gain and 45% cost reduction.

So I'm looking forward to roll this out by increase the max capacity of Windows G5 to 150 (same as windows.8xlarge.nvidia.gpu). Do we have any concern or limitation on the number of G5 runners on AWS us-east-1?

…gpu)

vercel · 2023-01-10T00:16:47Z

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2023-01-10T00:18:37Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Updated
torchci	⬜️ Ignored (Inspect)		Jan 10, 2023 at 0:18AM (UTC)

### Changelist * Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU * Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by #91979 * G5 runner has `AMD EPYC 7R32` CPU, not an Intel one * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test. ### Performance gain * (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration) * (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**) ### Pricing On demand hourly rate: * (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428 * (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36 So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%** ### Rolling out pytorch/test-infra#1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged. Pull Request resolved: #91727 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere

Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.…

e83f94c

…gpu)

huydhn requested review from a team, jeanschmidt and seemethere January 10, 2023 00:16

huydhn self-assigned this Jan 10, 2023

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2023

huydhn mentioned this pull request Jan 10, 2023

Switch Windows CI jobs to G5 runners pytorch/pytorch#91727

Closed

seemethere approved these changes Jan 10, 2023

View reviewed changes

huydhn merged commit 2f7cef8 into pytorch:main Jan 10, 2023

huydhn deleted the increase-windows-g5-count branch February 9, 2023 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu)#1376

Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu)#1376
huydhn merged 1 commit intopytorch:mainfrom
huydhn:increase-windows-g5-count

huydhn commented Jan 10, 2023

Uh oh!

vercel bot commented Jan 10, 2023

Uh oh!

vercel bot commented Jan 10, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huydhn commented Jan 10, 2023

Uh oh!

vercel bot commented Jan 10, 2023

Uh oh!

vercel bot commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Jan 10, 2023 •

edited

Loading