Skip to content

Switch Windows CI jobs to G5 runners#91727

Closed
huydhn wants to merge 13 commits intopytorch:masterfrom
huydhn:windows-g5-runners
Closed

Switch Windows CI jobs to G5 runners#91727
huydhn wants to merge 13 commits intopytorch:masterfrom
huydhn:windows-g5-runners

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented Jan 4, 2023

Changelist

  • Change Windows TORCH_CUDA_ARCH_LIST from 7.0 to 8.6 to compatible with NVIDIA A10G TPU
  • Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by [SDPA] Guard mem efficient attention in deterministic mode #91979
  • G5 runner has AMD EPYC 7R32 CPU, not an Intel one
    • This seems to change the behavior of GetDefaultMobileCPUAllocator in cpu_profiling_allocator_test. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use GetDefaultCPUAllocator correctly instead of GetDefaultMobileCPUAllocator for mobile build
    • Also one periodic test test_cpu_gpu_parity_nn_Conv3d_cuda_float32 fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.

Performance gain

  • (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
  • (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (25%)

Pricing

On demand hourly rate:

  • (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
  • (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36

So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = ~45%

Rolling out

pytorch/test-infra#1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.

@huydhn huydhn added ciflow/trunk Trigger trunk jobs on your pull request test-config/functorch Use this label to run only functorch tests test-config/default labels Jan 4, 2023
@huydhn huydhn self-assigned this Jan 4, 2023
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jan 4, 2023
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 4, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91727

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fbe2846:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@huydhn huydhn changed the title [EXPERIMENTATION] Switch Windows CI jobs to G5 runners Switch Windows CI jobs to G5 runners Jan 9, 2023
@huydhn huydhn marked this pull request as ready for review January 9, 2023 23:58
@huydhn huydhn requested a review from a team as a code owner January 9, 2023 23:58
@huydhn huydhn requested review from malfet and seemethere January 10, 2023 00:00
@huydhn huydhn added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR test-config/force_on_cpu labels Jan 10, 2023
huydhn added a commit to pytorch/test-infra that referenced this pull request Jan 10, 2023
…gpu) (#1376)

I have obtained good results when trying to run Windows CUDA tests on G5
runners pytorch/pytorch#91727. It's not only
faster but also cheaper with a 25% job duration gain and 45% cost
reduction.

So I'm looking forward to roll this out by increase the max capacity of
Windows G5 to 150 (same as windows.8xlarge.nvidia.gpu). Do we have any
concern or limitation on the number of G5 runners on AWS us-east-1?
@huydhn
Copy link
Copy Markdown
Contributor Author

huydhn commented Jan 13, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

self._test_gradients_helper(device, dtype, module_info, training, gradgradcheck)

@onlyCUDA
@with_tf32_off # Turn off TF32 to compute at full precision https://github.com/pytorch/pytorch/issues/86798
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 +1

pytorchmergebot pushed a commit that referenced this pull request Jan 17, 2023
…#92264)

This is a small follow-up from #91727 to fix the flaky same pointer check on Windows https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20CPUAllocationPlanTest.with_profiling_alloc.  AFAICT, keeping the same memory pointer is not a guarantee in non-mobile memory allocator (or may be this is Windows-specific behavior).

The test might be flaky when the tensor is copied to a different memory location with the default allocator.  This's ok as long as the values remain equal.
Pull Request resolved: #92264
Approved by: https://github.com/ZainRizvi
@seemethere
Copy link
Copy Markdown
Member

@pytorchbot revert -c weird -m "Is causing instability amongst queuing time since linux and windows are using the same instance type"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Reverting PR 91727 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 61cdae0ce58bcbe048b143356fd9ded821225657 returned non-zero exit code 1

Auto-merging .github/workflows/_win-build.yml
Auto-merging .github/workflows/_win-test.yml
CONFLICT (content): Merge conflict in .github/workflows/_win-test.yml
Auto-merging .github/workflows/periodic.yml
Auto-merging .github/workflows/pull.yml
Auto-merging .github/workflows/trunk.yml
Auto-merging aten/src/ATen/test/cpu_profiling_allocator_test.cpp
Auto-merging test/test_modules.py
error: could not revert 61cdae0ce5... Switch Windows CI jobs to G5 runners (#91727)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
Details for Dev Infra team Raised by workflow job

huydhn added a commit to huydhn/pytorch that referenced this pull request Jan 19, 2023
pytorchmergebot referenced this pull request Jan 25, 2023
These periodic tests were introduced in #92137

They've been consistently failing on trunk, so disabling them until they're fixed. Sample failures: https://hud.pytorch.org/pytorch/pytorch/commit/d8aa68c683bdf31f237bffb734b6038bc4f63898
Pull Request resolved: #92902
Approved by: https://github.com/malfet
pytorchmergebot added a commit that referenced this pull request Jan 25, 2023
…ts (#92902)"

This reverts commit bcbc522.

Reverted #92902 on behalf of https://github.com/atalman due to Fixed by reverting #91727
@huydhn huydhn deleted the windows-g5-runners branch March 21, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged test-config/default test-config/force_on_cpu test-config/functorch Use this label to run only functorch tests topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants