Switch Windows CI jobs to G5 runners#91727
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91727
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit fbe2846: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…gpu) (#1376) I have obtained good results when trying to run Windows CUDA tests on G5 runners pytorch/pytorch#91727. It's not only faster but also cheaper with a 25% job duration gain and 45% cost reduction. So I'm looking forward to roll this out by increase the max capacity of Windows G5 to 150 (same as windows.8xlarge.nvidia.gpu). Do we have any concern or limitation on the number of G5 runners on AWS us-east-1?
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| self._test_gradients_helper(device, dtype, module_info, training, gradgradcheck) | ||
|
|
||
| @onlyCUDA | ||
| @with_tf32_off # Turn off TF32 to compute at full precision https://github.com/pytorch/pytorch/issues/86798 |
…#92264) This is a small follow-up from #91727 to fix the flaky same pointer check on Windows https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20CPUAllocationPlanTest.with_profiling_alloc. AFAICT, keeping the same memory pointer is not a guarantee in non-mobile memory allocator (or may be this is Windows-specific behavior). The test might be flaky when the tensor is copied to a different memory location with the default allocator. This's ok as long as the values remain equal. Pull Request resolved: #92264 Approved by: https://github.com/ZainRizvi
|
@pytorchbot revert -c weird -m "Is causing instability amongst queuing time since linux and windows are using the same instance type" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 91727 failedReason: Command Details for Dev Infra teamRaised by workflow job |
This reverts commit 61cdae0.
These periodic tests were introduced in #92137 They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898 Pull Request resolved: #92902 Approved by: https://github.com/malfet
…ts (#92902)" This reverts commit bcbc522. Reverted #92902 on behalf of https://github.com/atalman due to Fixed by reverting #91727
Changelist
7.0to8.6to compatible with NVIDIA A10G TPUAMD EPYC 7R32CPU, not an Intel oneGetDefaultMobileCPUAllocatorincpu_profiling_allocator_test. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to useGetDefaultCPUAllocatorcorrectly instead ofGetDefaultMobileCPUAllocatorfor mobile buildtest_cpu_gpu_parity_nn_Conv3d_cuda_float32fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.Performance gain
Pricing
On demand hourly rate:
So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = ~45%
Rolling out
pytorch/test-infra#1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.