Refactoring pipeline parallelism test cases to be device agnostic [1/n]#146472
Refactoring pipeline parallelism test cases to be device agnostic [1/n]#146472AnantGulati wants to merge 5 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146472
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 44719de with merge base 1c87280 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@kwen2501 Could you please review this PR |
H-Huang
left a comment
There was a problem hiding this comment.
Overall looks fine to me, lets wait for CI before landing. These tests are mostly covering some pipeline parallelism utilities and unrelated to the actual model splitting / execution used in pipeline parallelism. I think there are still some gaps to support torch.distributed.pipelining on CPU and other devices
Yes, there is still some more effort required to add support for torch.distributed.pipelining on multiple devices. I am still analyzing the test cases that cover the model splitting and execution in more detail and I am hoping to add support to them in future PRs |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@H-Huang The merge is getting blocked due to one RoCm check not starting Could you please force advice Thanks |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
In this series of PR we intend to refactor pipeline parallelism test cases to enable to be completely device agnostic.
These changes will include the following approaches to do the same :
This should result in improvement in usability for all devices
For this PR we have shown support for the following devices:
To add other device new users can simply append their device to the device list
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o