Refactoring Distributed test cases to be device agnostic [1/n]#145222
Refactoring Distributed test cases to be device agnostic [1/n]#145222AnantGulati wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145222
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 908427d with merge base 106acf0 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@kwen2501 could you please help land this PR |
f90320a to
979e06c
Compare
|
The most recent commit changes the way cuda is set in DIstributedTestBase. Here we explicitly pass |
|
Changing the device setting logic from using |
|
@kwen2501 Could you please help land this PR |
|
"Try to land this since the failure is unrelated." |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
|
"Try to land this since the failure is unrelated." |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge), trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic.
These changes will include the following approaches to do the same :
This should result in significant improvement in usability for all devices
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @mcarilli @ptrblck @leslie-fang-intel @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @xmfan