[ddp] use named_params and named_buffers explicitly by wanchaol · Pull Request #65181 · pytorch/pytorch

wanchaol · 2021-09-16T23:40:58Z

Stack from ghstack:

-> [ddp] use named_params and named_buffers explicitly #65181

This PR changes state_dict() during sync to named_parameters and named_buffers explicitly. the underlying motivation is that, state_dict() doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons).

Differential Revision: D31007085

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). Differential Revision: [D31007085](https://our.internmc.facebook.com/intern/diff/D31007085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D31007085/)! [ghstack-poisoned]

facebook-github-bot · 2021-09-16T23:41:02Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/65181
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 6420f09 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). Differential Revision: [D31007085](https://our.internmc.facebook.com/intern/diff/D31007085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D31007085/)! ghstack-source-id: 138306335 Pull Request resolved: #65181

rohan-varma

Looks good overall, but a couple of test failures that appear to be related.

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). Differential Revision: [D31007085](https://our.internmc.facebook.com/intern/diff/D31007085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D31007085/)! [ghstack-poisoned]

Pull Request resolved: #65181 This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). ghstack-source-id: 138378332 Differential Revision: [D31007085](https://our.internmc.facebook.com/intern/diff/D31007085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D31007085/)!

This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). Differential Revision: [D31007085](https://our.internmc.facebook.com/intern/diff/D31007085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D31007085/)! [ghstack-poisoned]

Pull Request resolved: #65181 This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). ghstack-source-id: 138701159 Differential Revision: [D31007085](https://our.internmc.facebook.com/intern/diff/D31007085/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D31007085/)!

wanchaol · 2021-09-22T17:00:46Z

Looks good overall, but a couple of test failures that appear to be related.

Thanks yeah just fixed them, not detaching gradients in the last try, should be ready.

rohan-varma

Thanks!

facebook-github-bot · 2021-09-23T00:38:34Z

This pull request has been merged in 2f67579.

wanchaol requested review from H-Huang, cbalioglu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma, wayi1 and zhaojuanmao as code owners September 16, 2021 23:40

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 16, 2021

facebook-github-bot added the cla signed label Sep 16, 2021

rohan-varma reviewed Sep 17, 2021

View reviewed changes

wanchaol requested a review from rohan-varma September 22, 2021 16:59

rohan-varma approved these changes Sep 22, 2021

View reviewed changes

facebook-github-bot closed this in 2f67579 Sep 23, 2021

facebook-github-bot added the Merged label Sep 23, 2021

facebook-github-bot deleted the gh/wanchaol/172/head branch September 26, 2021 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ddp] use named_params and named_buffers explicitly#65181

[ddp] use named_params and named_buffers explicitly#65181
wanchaol wants to merge 3 commits intogh/wanchaol/172/basefrom
gh/wanchaol/172/head

wanchaol commented Sep 16, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

facebook-github-bot commented Sep 16, 2021 •

edited

Loading

Uh oh!

rohan-varma left a comment

Uh oh!

wanchaol commented Sep 22, 2021

Uh oh!

rohan-varma left a comment

Uh oh!

facebook-github-bot commented Sep 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wanchaol commented Sep 16, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

wanchaol commented Sep 22, 2021

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wanchaol commented Sep 16, 2021 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Sep 16, 2021 •

edited

Loading