Skip to content

Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu)#1376

Merged
huydhn merged 1 commit intopytorch:mainfrom
huydhn:increase-windows-g5-count
Jan 10, 2023
Merged

Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu)#1376
huydhn merged 1 commit intopytorch:mainfrom
huydhn:increase-windows-g5-count

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented Jan 10, 2023

I have obtained good results when trying to run Windows CUDA tests on G5 runners pytorch/pytorch#91727. It's not only faster but also cheaper with a 25% job duration gain and 45% cost reduction.

So I'm looking forward to roll this out by increase the max capacity of Windows G5 to 150 (same as windows.8xlarge.nvidia.gpu). Do we have any concern or limitation on the number of G5 runners on AWS us-east-1?

@huydhn huydhn requested review from a team, jeanschmidt and seemethere January 10, 2023 00:16
@huydhn huydhn self-assigned this Jan 10, 2023
@vercel
Copy link
Copy Markdown

vercel bot commented Jan 10, 2023

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2023
@vercel
Copy link
Copy Markdown

vercel bot commented Jan 10, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
torchci ⬜️ Ignored (Inspect) Jan 10, 2023 at 0:18AM (UTC)

@huydhn huydhn merged commit 2f7cef8 into pytorch:main Jan 10, 2023
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jan 13, 2023
### Changelist

* Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU
* Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by #91979
* G5 runner has `AMD EPYC 7R32` CPU, not an Intel one
  * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`.  This might need to be investigated further (TODO: TRACKING ISSUE).  In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build
  * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.

###  Performance gain

* (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
* (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**)

### Pricing

On demand hourly rate:

* (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
* (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36

So the current runner is not only more expensive but is also slower.  Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%**

### Rolling out

pytorch/test-infra#1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.

Pull Request resolved: #91727
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere
@huydhn huydhn deleted the increase-windows-g5-count branch February 9, 2023 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants