Switch Windows CI jobs to G5 runners by huydhn · Pull Request #91727 · pytorch/pytorch

huydhn · 2023-01-04T21:51:39Z

Changelist

Change Windows TORCH_CUDA_ARCH_LIST from 7.0 to 8.6 to compatible with NVIDIA A10G TPU
Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by [SDPA] Guard mem efficient attention in deterministic mode #91979
G5 runner has AMD EPYC 7R32 CPU, not an Intel one
- This seems to change the behavior of GetDefaultMobileCPUAllocator in cpu_profiling_allocator_test. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use GetDefaultCPUAllocator correctly instead of GetDefaultMobileCPUAllocator for mobile build
- Also one periodic test test_cpu_gpu_parity_nn_Conv3d_cuda_float32 fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.

Performance gain

(CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
(NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (25%)

Pricing

On demand hourly rate:

(CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
(NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36

So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = ~45%

Rolling out

pytorch/test-infra#1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.

pytorch-bot · 2023-01-04T21:51:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91727

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fbe2846:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…gpu) (#1376) I have obtained good results when trying to run Windows CUDA tests on G5 runners pytorch/pytorch#91727. It's not only faster but also cheaper with a 25% job duration gain and 45% cost reduction. So I'm looking forward to roll this out by increase the max capacity of Windows G5 to 150 (same as windows.8xlarge.nvidia.gpu). Do we have any concern or limitation on the number of G5 runners on AWS us-east-1?

.github/workflows/_win-test.yml

huydhn · 2023-01-13T00:45:01Z

@pytorchbot merge

pytorchmergebot · 2023-01-13T00:46:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2023-01-13T03:13:46Z

test/test_modules.py

        self._test_gradients_helper(device, dtype, module_info, training, gradgradcheck)

    @onlyCUDA
+    @with_tf32_off  # Turn off TF32 to compute at full precision https://github.com/pytorch/pytorch/issues/86798


…#92264) This is a small follow-up from #91727 to fix the flaky same pointer check on Windows https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20CPUAllocationPlanTest.with_profiling_alloc. AFAICT, keeping the same memory pointer is not a guarantee in non-mobile memory allocator (or may be this is Windows-specific behavior). The test might be flaky when the tensor is copied to a different memory location with the default allocator. This's ok as long as the values remain equal. Pull Request resolved: #92264 Approved by: https://github.com/ZainRizvi

seemethere · 2023-01-19T19:20:53Z

@pytorchbot revert -c weird -m "Is causing instability amongst queuing time since linux and windows are using the same instance type"

pytorchmergebot · 2023-01-19T19:22:34Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-01-19T19:22:39Z

Reverting PR 91727 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 61cdae0ce58bcbe048b143356fd9ded821225657 returned non-zero exit code 1

Auto-merging .github/workflows/_win-build.yml
Auto-merging .github/workflows/_win-test.yml
CONFLICT (content): Merge conflict in .github/workflows/_win-test.yml
Auto-merging .github/workflows/periodic.yml
Auto-merging .github/workflows/pull.yml
Auto-merging .github/workflows/trunk.yml
Auto-merging aten/src/ATen/test/cpu_profiling_allocator_test.cpp
Auto-merging test/test_modules.py
error: could not revert 61cdae0ce5... Switch Windows CI jobs to G5 runners (#91727)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".

Details for Dev Infra team

Raised by workflow job

This reverts commit 61cdae0.

These periodic tests were introduced in #92137 They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898 Pull Request resolved: #92902 Approved by: https://github.com/malfet

…ts (#92902)" This reverts commit bcbc522. Reverted #92902 on behalf of https://github.com/atalman due to Fixed by reverting #91727

Switch Windows CI jobs to G5 runners

aaae337

huydhn added ciflow/trunk Trigger trunk jobs on your pull request test-config/functorch Use this label to run only functorch tests test-config/default labels Jan 4, 2023

huydhn self-assigned this Jan 4, 2023

pytorch-bot bot added the topic: not user facing topic category label Jan 4, 2023

huydhn added 8 commits January 6, 2023 17:57

Turn off mkl-dnn on Windows

b6769c1

Only turn off mkl-dnn for Windows CUDA build/test

a271bf5

Skip mkl for Windows CUDA build

4c0eba4

Try Windows openblas

ecad862

Skip tensor comparision using MKL on AMD CPU

9433fc6

Use default CPUAllocator instead of default MobileCPUAllocator

7f30403

Re-enable mkl-dnn

8415856

Check both the behavior of the mobile and default CPUAllocator

1a96b5f

huydhn changed the title ~~[EXPERIMENTATION] Switch Windows CI jobs to G5 runners~~ Switch Windows CI jobs to G5 runners Jan 9, 2023

huydhn marked this pull request as ready for review January 9, 2023 23:58

huydhn requested a review from a team as a code owner January 9, 2023 23:58

huydhn requested review from malfet and seemethere January 10, 2023 00:00

huydhn mentioned this pull request Jan 10, 2023

Increase max Windows G5 count to 150 (same as windows.8xlarge.nvidia.gpu) pytorch/test-infra#1376

Merged

huydhn added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR test-config/force_on_cpu labels Jan 10, 2023

Zero out the output tensor instead of normal distribution

a422c47

ZainRizvi approved these changes Jan 11, 2023

View reviewed changes

.github/workflows/_win-test.yml Show resolved Hide resolved

malfet approved these changes Jan 11, 2023

View reviewed changes

seemethere reviewed Jan 11, 2023

View reviewed changes

.github/workflows/_win-test.yml Show resolved Hide resolved

seemethere approved these changes Jan 11, 2023

View reviewed changes

Merge branch 'master' into windows-g5-runners

4ad45aa

huydhn added 2 commits January 12, 2023 11:59

Try to turn off TF32 for test_cpu_gpu_parity

2a6d202

Default Windows CUDA build to 8.6

fbe2846

pytorchmergebot added the Merged label Jan 13, 2023

pytorchmergebot closed this in 61cdae0 Jan 13, 2023

malfet reviewed Jan 13, 2023

View reviewed changes

huydhn mentioned this pull request Jan 16, 2023

Fix Windows cpu_profiling_allocator_test same pointer check flakiness #92264

Closed

huydhn added a commit to huydhn/pytorch that referenced this pull request Jan 19, 2023

Revert "Switch Windows CI jobs to G5 runners (pytorch#91727)"

dfd94c2

This reverts commit 61cdae0.

atalman mentioned this pull request Jan 25, 2023

[CI] Disable regularly failing CUDA 11.8 windows periodic tests #92902

Closed

huydhn deleted the windows-g5-runners branch March 21, 2023 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch Windows CI jobs to G5 runners#91727

Switch Windows CI jobs to G5 runners#91727
huydhn wants to merge 13 commits intopytorch:masterfrom
huydhn:windows-g5-runners

huydhn commented Jan 4, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 4, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

huydhn commented Jan 13, 2023

Uh oh!

pytorchmergebot commented Jan 13, 2023

Uh oh!

malfet Jan 13, 2023

Uh oh!

seemethere commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

huydhn commented Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelist

Performance gain

Pricing

Rolling out

Uh oh!

pytorch-bot bot commented Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91727

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

huydhn commented Jan 13, 2023

Uh oh!

pytorchmergebot commented Jan 13, 2023

Merge started

Uh oh!

malfet Jan 13, 2023

Choose a reason for hiding this comment

Uh oh!

seemethere commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Uh oh!

pytorchmergebot commented Jan 19, 2023

Reverting PR 91727 failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

huydhn commented Jan 4, 2023 •

edited

Loading

pytorch-bot bot commented Jan 4, 2023 •

edited

Loading