Skip to content

Enable capturing of comm collective parameters (#98)#85368

Closed
louisfeng wants to merge 1 commit intopytorch:masterfrom
louisfeng:export-D38357077
Closed

Enable capturing of comm collective parameters (#98)#85368
louisfeng wants to merge 1 commit intopytorch:masterfrom
louisfeng:export-D38357077

Conversation

@louisfeng
Copy link
Contributor

Summary:
X-link: facebookresearch/torch_ucc#98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 20, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85368

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 7dcc84c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 20, 2022
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

louisfeng added a commit to louisfeng/torch_ucc that referenced this pull request Sep 20, 2022
Summary:
X-link: pytorch/pytorch#85368

Pull Request resolved: facebookresearch#98

Add tensor input, output, and other metadata for PyTorch comms.

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: 144852fffaba47434d15f121970121b2419026a1
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

louisfeng added a commit to louisfeng/torch_ucc that referenced this pull request Sep 21, 2022
Summary:
X-link: pytorch/pytorch#85368

Pull Request resolved: facebookresearch#98

Add tensor input, output, and other metadata for PyTorch comms.

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: 6945ed29905be2466fe1056f5ad77b332da70005
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR. I am actually not familiar with the original use case for RECORD_PARAM_COMMS. Where does it get logged and is there a way to see the before / after of this change? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for adding this? (since initialization is not a collective)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The network.ai team requested this feature. We can collect when a network collective is created during initialization. cc: Pavani.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a general comm util. It probably shouldn't be named WorkNCCL (a backend specific object) and make instead named to just Work instead? I have another comment to clarify what is actually getting logged here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense, will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This supposed to be the ProcessGroup object, with the seq number can uniquely identify each collective operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused about recording this. Isn't this a pointer to ProcessGroupNCCL rather than a work obj? Same applies to the other collectives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the comment, the code is correct, but the comment was wrong. Should be the process group object.

louisfeng added a commit to louisfeng/torch_ucc that referenced this pull request Sep 28, 2022
Summary:
X-link: pytorch/pytorch#85368

Pull Request resolved: facebookresearch#98

Add tensor input, output, and other metadata for PyTorch comms.

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: 51f4c616816e8fb5bc72e4d713df24a7fecdcd41
@louisfeng louisfeng requested review from H-Huang and removed request for awgu, mingzhe09088, wz337 and zhaojuanmao October 3, 2022 20:18
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for updating

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 3, 2022
@facebook-github-bot
Copy link
Contributor

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 4, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: louisfeng / name: Louis Feng (294a0ed)

louisfeng added a commit to louisfeng/pytorch that referenced this pull request Oct 4, 2022
Summary:
Pull Request resolved: pytorch#85368

X-link: facebookresearch/torch_ucc#98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: 71c256c6355abe8c384508d450c33fe8f8a51da4
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 6, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: louisfeng / name: Louis Feng (7dcc84c)

louisfeng added a commit to louisfeng/pytorch that referenced this pull request Oct 6, 2022
Summary:
Pull Request resolved: pytorch#85368

X-link: facebookresearch/torch_ucc#98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan: P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: 379a6b951d2051967cb4c90391241e42456c6497
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

louisfeng added a commit to louisfeng/pytorch that referenced this pull request Oct 7, 2022
Summary:
Pull Request resolved: pytorch#85368

X-link: facebookresearch/torch_ucc#98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan:
```
buck build mode/opt-split-dwarf -c fbcode.nvcc_arch=a100 //hpc/models/ads:ads_10x_launcher
buck-out/gen/hpc/models/ads/ads_10x_launcher.par -- +launcher=local launcher.num_trainers=2 +data_loader=random
```

P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: f339b00ab09d5b7a21d2607ef54d21ff078af501
Summary:
Pull Request resolved: pytorch#85368

X-link: facebookresearch/torch_ucc#98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan:
```
buck build mode/opt-split-dwarf -c fbcode.nvcc_arch=a100 //hpc/models/ads:ads_10x_launcher
buck-out/gen/hpc/models/ads/ads_10x_launcher.par -- +launcher=local launcher.num_trainers=2 +data_loader=random
```

P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: c5394ff04ebe6af223c0e0a69a9cb558cc4c3a3f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D38357077

facebook-github-bot pushed a commit to facebookresearch/torch_ucc that referenced this pull request Oct 11, 2022
Summary:
X-link: pytorch/pytorch#85368

Pull Request resolved: #98

Add tensor input, output, and other metadata for PyTorch comms.

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: e9545eeea3311b68762781a5f6aa585971aa08fe
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link
Contributor

Hey @louisfeng.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Oct 11, 2022
Summary:
Pull Request resolved: #85368

X-link: facebookresearch/torch_ucc#98

Add tensor input, output, and other metadata for PyTorch comms.

Test Plan:
```
buck build mode/opt-split-dwarf -c fbcode.nvcc_arch=a100 //hpc/models/ads:ads_10x_launcher
buck-out/gen/hpc/models/ads/ads_10x_launcher.par -- +launcher=local launcher.num_trainers=2 +data_loader=random
```

P517138779

Reviewed By: Pavani-Panakanti

Differential Revision: D38357077

fbshipit-source-id: e9545eeea3311b68762781a5f6aa585971aa08fe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants