Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu)#1376
Merged
huydhn merged 1 commit intopytorch:mainfrom Jan 10, 2023
Merged
Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu)#1376huydhn merged 1 commit intopytorch:mainfrom
huydhn merged 1 commit intopytorch:mainfrom
Conversation
|
@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
seemethere
approved these changes
Jan 10, 2023
pytorchmergebot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Jan 13, 2023
### Changelist * Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU * Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by #91979 * G5 runner has `AMD EPYC 7R32` CPU, not an Intel one * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test. ### Performance gain * (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration) * (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**) ### Pricing On demand hourly rate: * (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428 * (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36 So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%** ### Rolling out pytorch/test-infra#1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged. Pull Request resolved: #91727 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I have obtained good results when trying to run Windows CUDA tests on G5 runners pytorch/pytorch#91727. It's not only faster but also cheaper with a 25% job duration gain and 45% cost reduction.
So I'm looking forward to roll this out by increase the max capacity of Windows G5 to 150 (same as windows.8xlarge.nvidia.gpu). Do we have any concern or limitation on the number of G5 runners on AWS us-east-1?