Enable capturing of comm collective parameters (#98)#85368
Enable capturing of comm collective parameters (#98)#85368louisfeng wants to merge 1 commit intopytorch:masterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85368
Note: Links to docs will display an error until the docs builds have been completed. ✅ No Failures, 1 PendingAs of commit 7dcc84c: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
de77af2 to
67d6c15
Compare
67d6c15 to
58d1398
Compare
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
Summary: X-link: pytorch/pytorch#85368 Pull Request resolved: facebookresearch#98 Add tensor input, output, and other metadata for PyTorch comms. Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: 144852fffaba47434d15f121970121b2419026a1
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
58d1398 to
37f48cd
Compare
Summary: X-link: pytorch/pytorch#85368 Pull Request resolved: facebookresearch#98 Add tensor input, output, and other metadata for PyTorch comms. Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: 6945ed29905be2466fe1056f5ad77b332da70005
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
37f48cd to
fcd9e3c
Compare
H-Huang
left a comment
There was a problem hiding this comment.
Hi, thanks for the PR. I am actually not familiar with the original use case for RECORD_PARAM_COMMS. Where does it get logged and is there a way to see the before / after of this change? Thanks!
There was a problem hiding this comment.
Reason for adding this? (since initialization is not a collective)
There was a problem hiding this comment.
The network.ai team requested this feature. We can collect when a network collective is created during initialization. cc: Pavani.
There was a problem hiding this comment.
Since this is a general comm util. It probably shouldn't be named WorkNCCL (a backend specific object) and make instead named to just Work instead? I have another comment to clarify what is actually getting logged here.
There was a problem hiding this comment.
Yes, makes sense, will fix.
There was a problem hiding this comment.
This supposed to be the ProcessGroup object, with the seq number can uniquely identify each collective operation.
There was a problem hiding this comment.
Confused about recording this. Isn't this a pointer to ProcessGroupNCCL rather than a work obj? Same applies to the other collectives
There was a problem hiding this comment.
Modified the comment, the code is correct, but the comment was wrong. Should be the process group object.
Summary: X-link: pytorch/pytorch#85368 Pull Request resolved: facebookresearch#98 Add tensor input, output, and other metadata for PyTorch comms. Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: 51f4c616816e8fb5bc72e4d713df24a7fecdcd41
|
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
|
|
294a0ed to
7bec7ad
Compare
Summary: Pull Request resolved: pytorch#85368 X-link: facebookresearch/torch_ucc#98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: 71c256c6355abe8c384508d450c33fe8f8a51da4
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
7bec7ad to
86ca64f
Compare
|
|
Summary: Pull Request resolved: pytorch#85368 X-link: facebookresearch/torch_ucc#98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: 379a6b951d2051967cb4c90391241e42456c6497
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
86ca64f to
32171c3
Compare
Summary: Pull Request resolved: pytorch#85368 X-link: facebookresearch/torch_ucc#98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: ``` buck build mode/opt-split-dwarf -c fbcode.nvcc_arch=a100 //hpc/models/ads:ads_10x_launcher buck-out/gen/hpc/models/ads/ads_10x_launcher.par -- +launcher=local launcher.num_trainers=2 +data_loader=random ``` P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: f339b00ab09d5b7a21d2607ef54d21ff078af501
Summary: Pull Request resolved: pytorch#85368 X-link: facebookresearch/torch_ucc#98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: ``` buck build mode/opt-split-dwarf -c fbcode.nvcc_arch=a100 //hpc/models/ads:ads_10x_launcher buck-out/gen/hpc/models/ads/ads_10x_launcher.par -- +launcher=local launcher.num_trainers=2 +data_loader=random ``` P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: c5394ff04ebe6af223c0e0a69a9cb558cc4c3a3f
32171c3 to
7dcc84c
Compare
|
This pull request was exported from Phabricator. Differential Revision: D38357077 |
Summary: X-link: pytorch/pytorch#85368 Pull Request resolved: #98 Add tensor input, output, and other metadata for PyTorch comms. Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: e9545eeea3311b68762781a5f6aa585971aa08fe
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hey @louisfeng. |
Summary: Pull Request resolved: #85368 X-link: facebookresearch/torch_ucc#98 Add tensor input, output, and other metadata for PyTorch comms. Test Plan: ``` buck build mode/opt-split-dwarf -c fbcode.nvcc_arch=a100 //hpc/models/ads:ads_10x_launcher buck-out/gen/hpc/models/ads/ads_10x_launcher.par -- +launcher=local launcher.num_trainers=2 +data_loader=random ``` P517138779 Reviewed By: Pavani-Panakanti Differential Revision: D38357077 fbshipit-source-id: e9545eeea3311b68762781a5f6aa585971aa08fe
Summary:
X-link: facebookresearch/torch_ucc#98
Add tensor input, output, and other metadata for PyTorch comms.
Test Plan: P517138779
Reviewed By: Pavani-Panakanti
Differential Revision: D38357077