Skip to content

Fix SpectralNorm with DataParallel#12671

Closed
ssnl wants to merge 5 commits intopytorch:masterfrom
ssnl:dp_dup_module
Closed

Fix SpectralNorm with DataParallel#12671
ssnl wants to merge 5 commits intopytorch:masterfrom
ssnl:dp_dup_module

Conversation

@ssnl
Copy link
Copy Markdown
Collaborator

@ssnl ssnl commented Oct 15, 2018

There were two problems with SN + DP:

  1. In SN, the updated _u vector is saved back to module via a setattr. However, in DP, everything is run on a replica, so those updates are lost.
  2. In DP, the buffers are broadcast via a broadcast_coalesced, so on replicas they are all views. Therefore, the detach_ call won't work.

Fixes are:

  1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
  2. Do not call detach_.
  3. Added comments in SN about the subtlety.
  4. Added a note to the DP doc on this particular behavior of DP.

cc @crcrpar @taesung89 @t-vi @YaoshengFu

Fixes #11476

@ssnl
Copy link
Copy Markdown
Collaborator Author

ssnl commented Oct 15, 2018

Autograd in eval mode still has problems, but I decide to fix that in a later PR due to BC complications.

@ssnl
Copy link
Copy Markdown
Collaborator Author

ssnl commented Oct 16, 2018

@t-vi or @crcrpar Wanna review this one? :)

@crcrpar
Copy link
Copy Markdown
Collaborator

crcrpar commented Oct 16, 2018

@ssnl I read thru this and seems solid! 👍

@t-vi
Copy link
Copy Markdown
Collaborator

t-vi commented Oct 16, 2018

Looks all reasonable to me, but I lack the Distributed expertise for that to mean much.
One bit I'm not sure about is the following: Do we / should we test that the weight_orig/weight are do get nonzero gradients on backward if they require grad?

Out of curiosity: What is the BC breaking when you recompute the weight in eval mode instead of detaching?

@ssnl
Copy link
Copy Markdown
Collaborator Author

ssnl commented Oct 16, 2018

@crcrpar @t-vi Thanks for looking!

@t-vi The problem with recomputing eval mode is that we only store weight_orig, and weight_u. But we need weight_v to also recompute weight graph. Note that weight_v is not uniquely determined even though we also store weight. And because our power iteration updates v vector first and u vector second, it can't be computed from u either. Therefore, I want to also register weight_v as a buffer. Additionally, I want to not register weight as a buffer, but as a plain attribute. Registering as buffer makes the state dict redundant, and it is not really a buffer. To make all these changes, I need to figure out a way to add hook into load_state_dict when people register spectral_norm. I will add the test you mentioned when that patch happens, because weight_orig still doesn't have gradient in eval as of this patch.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@YaoshengFu
Copy link
Copy Markdown

YaoshengFu commented Oct 18, 2018

I updated my code with the latest spectral_norm implementation (I just replaced spectral_norm.py and function.py), but I got the following error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

This error disappeared if I switch back to the old spectral_norm implementation. @ssnl

@ssnl ssnl deleted the dp_dup_module branch October 19, 2018 00:07
@ssnl
Copy link
Copy Markdown
Collaborator Author

ssnl commented Oct 19, 2018

@YaoshengFu I can't reproduce the error you see. Could you install the nightly and check if the error still happens?

@YaoshengFu
Copy link
Copy Markdown

I have re-installed the latest version of pytorch from source and it still has the same error. I have tried it on different projects and I didn't see difference. For example, you can try to run code from this repo: https://github.com/rosinality/sagan-pytorch

Just replace the spectral_norm implementation in model.py or model_resnet.py (which seems to be replicated from the previous official implementation as well) with the official one and run it, you should be able to see the same error (at least I did).

@ssnl
Copy link
Copy Markdown
Collaborator Author

ssnl commented Oct 20, 2018

@YaoshengFu Thanks. I will do!

facebook-github-bot pushed a commit that referenced this pull request Nov 7, 2018
Summary:
Problems with SN and DP after #12671 :
1. in eval mode, `weight_orig` is not getting correct gradient #12737 .

    Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval.

2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong.

    Fix: Make `weight` not a buffer anymore and always calculate it as above.

3. #12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward.

    Fix: This PR clones `u` and `v` before using them.

To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done.

cc The controller you requested could not be found. crcrpar
Pull Request resolved: #13350

Differential Revision: D12931044

Pulled By: SsnL

fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef
vkuzo added a commit that referenced this pull request Apr 21, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Apr 21, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fe8d914
Pull Request resolved: #37032
vkuzo added a commit that referenced this pull request Apr 22, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Apr 22, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 48d40f4
Pull Request resolved: #37032
vkuzo added a commit that referenced this pull request Apr 30, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

```
python test/test_quantization.py TestFakeQuantizePerTensor.test_fake_quant_control
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: cc8aa30
Pull Request resolved: #37032
vkuzo added a commit that referenced this pull request Apr 30, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

TODO: #32684 needs to land before we can fix the graph mode test failures on this PR.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21206454](https://our.internmc.facebook.com/intern/diff/D21206454)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Apr 30, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates all observers and fake_quant modules to do their parameter update in-place. This will enable static quant and QAT to work correctly with DataParallel.

TODO: #32684 and #37185 needs to land before we can fix the graph mode test failures on this PR.

Test Plan:

Script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

Added tests to relevant quant modules to prevent regressions

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21206454](https://our.internmc.facebook.com/intern/diff/D21206454)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Apr 30, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

```
python test/test_quantization.py TestFakeQuantizePerTensor.test_fake_quant_control
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2430f5d
Pull Request resolved: #37032
vkuzo added a commit that referenced this pull request May 4, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates all observers and fake_quant modules to do their parameter update in-place. This will enable static quant and QAT to work correctly with DataParallel.

Depends on #32684 and #37185.

Test Plan:

Script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

Added integration and unit tests to cover:

```
python test/test_quantization.py TestDistributed
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21206454](https://our.internmc.facebook.com/intern/diff/D21206454)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request May 4, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

```
python test/test_quantization.py TestFakeQuantizePerTensor.test_fake_quant_control
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 075a725
Pull Request resolved: #37032
vkuzo added a commit that referenced this pull request May 4, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates all observers and fake_quant modules to do their parameter update in-place. This will enable static quant and QAT to work correctly with DataParallel.

Depends on #32684 and #37185.

Test Plan:

Script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

Added integration and unit tests to cover:

```
python test/test_quantization.py TestDistributed
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D21206454](https://our.internmc.facebook.com/intern/diff/D21206454)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request May 4, 2020
Summary:

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

Test Plan:

script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

```
python test/test_quantization.py TestFakeQuantizePerTensor.test_fake_quant_control
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 2e3f88e
Pull Request resolved: #37032
facebook-github-bot pushed a commit that referenced this pull request May 5, 2020
Summary:
Pull Request resolved: #37032

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see #12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

TODO: #32684 needs to land before we can fix the graph mode test failures on this PR.

Test Plan:
script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Imported from OSS

Differential Revision: D21206454

fbshipit-source-id: df6b4b04d0ae0f7ef582c82d81418163019e96f7
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
There were two problems with SN + DP:

1. In SN, the updated _u vector is saved back to module via a `setattr`. However, in DP, everything is run on a replica, so those updates are lost.
2. In DP, the buffers are broadcast via a `broadcast_coalesced`, so on replicas they are all views. Therefore, the `detach_` call won't work.

Fixes are:
1. Update _u vector in-place so, by the shared storage between 1st replica and the parallelized module, the update is retained
2. Do not call `detach_`.
3. Added comments in SN about the subtlety.
4. Added a note to the DP doc on this particular behavior of DP.

cc crcrpar taesung89 The controller you requested could not be found. yaoshengfu

Fixes pytorch#11476
Pull Request resolved: pytorch#12671

Differential Revision: D10410232

Pulled By: SsnL

fbshipit-source-id: c447951844a30366d8c196bf9436340e88f3b6d9
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Problems with SN and DP after pytorch#12671 :
1. in eval mode, `weight_orig` is not getting correct gradient pytorch#12737 .

    Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval.

2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong.

    Fix: Make `weight` not a buffer anymore and always calculate it as above.

3. pytorch#12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward.

    Fix: This PR clones `u` and `v` before using them.

To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done.

cc The controller you requested could not be found. crcrpar
Pull Request resolved: pytorch#13350

Differential Revision: D12931044

Pulled By: SsnL

fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#37032

DataParallel requires all params and buffers of child modules to be updated
in place because of how it implements model replication during the
forward pass (see pytorch#12671 for
context). Any params or buffers not updated in place are lost and not
propagated back to the master.

This diff updates (some quantized modules) (TBD: all quantized modules? determine a good cut
point) to do their parameter update in-place. This will enable static
quant and QAT to work correctly with DataParallel.

TODO: pytorch#32684 needs to land before we can fix the graph mode test failures on this PR.

Test Plan:
script failed before and passes after the diff:
https://gist.github.com/vkuzo/78b06c01f23f98ee2aaaeb37e55f8d40

TODO before land: add integration testing

Imported from OSS

Differential Revision: D21206454

fbshipit-source-id: df6b4b04d0ae0f7ef582c82d81418163019e96f7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DataParallel replicates network object for device 0

7 participants