Fix more spectral norm bugs#13350
Conversation
|
@YaoshengFu This PR will fix the spectral norm bug you see. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
I really always appreciate you guys. I have one question as to my understanding.
Is this correct? |
|
@crcrpar Almost!
|
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Sorry for my late response. Changes are good but I have some questions about test codes. |
|
Thank you for your comments @crcrpar :) |
|
I am still having a problem with spectral norm complaining about in place changes when followed by batch normalization and distributed across 2 or more GPUs using data parallel. This architecture arises in Big GAN (https://arxiv.org/abs/1809.11096). Interestingly this is not an issue with group normalization, instance normalization, or no normalization. It is also not an issue on a single GPU. |
|
So to get this a bit sorted, how many cases do we have? with/without DP * eval/training * weight.requires_grad=True/False * ???. |
|
@zmurez even with this patch? |
|
@t-vi Yeah, I think those 8 cases are all we need to consider. |
I think so. It complained regardless of normalization and number of GPU's prior to adding this patch. However, I just copied the relevant lines into my own spectral_norm.py file instead of pulling the entire branch... So it is possible I missed something... but it seems unlikely since all the other cases are fixed. Note, I am still using the latest stable release. To get this patch to work I also grabbed a copy of the normalize function, and implemented my on chain_matmul (a single if statement in this case with 3 matrices). Does this bug exist for you? If not I guess I will have to consider updating to the unstable version. Thanks |
|
@zmurez It is possible that other changes are needed. Do you have a small repro script I can try on my build? |
|
t-vi
left a comment
There was a problem hiding this comment.
Thanks for adding all the new tests! They seem to be comprehensive and I think the patch is good with them.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
It seems that when spectral norm is applied to a conv module, u is randomly initialized. Then v is computed such that the invariant holds.
|
|
@zmurez Good point. I'll think about ways to fix it. |
|
For the initialization I think it should be not terribly important whether to init u or v randomly, so that could likely be changed. For loading the state dict: this only happens when there you load an "old version" state dict. How about warning about the performance and (potentially) offering a switch to forward re-compute u instead of solving for v. (I think it might be save to "do the right, slow thing" by default rather than the other way round.) |
|
I agree with t-vi. Random initialization of |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain. Fixing the bug discovered at #13350 (comment) edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together. Pull Request resolved: #13594 Differential Revision: D12967311 Pulled By: SsnL fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5
Summary: Problems with SN and DP after pytorch#12671 : 1. in eval mode, `weight_orig` is not getting correct gradient pytorch#12737 . Fix: keep `v` vector around as a buffer and always calculate `W = W_orig / (u @ W_orig @ v)` even in eval. 2. in training mode, the `weight` buffer of the parallelized module is never updated, if someone touches `weight_orig` and/or `weight` and makes them not sharing storage. So in `eval` the weight used is wrong. Fix: Make `weight` not a buffer anymore and always calculate it as above. 3. pytorch#12671 changed SN to update `u` in-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN loss `D(real) - D(fake)`) because the vectors needed to backprop the 1st forward is changed in the 2nd forward. Fix: This PR clones `u` and `v` before using them. To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done. cc The controller you requested could not be found. crcrpar Pull Request resolved: pytorch#13350 Differential Revision: D12931044 Pulled By: SsnL fbshipit-source-id: 8be6f934eaa62414d76d2c644dedd7e1b7eb31ef
…13594) Summary: In `broadcast_coalesced`, since multiple variables can be "views" of a big flattened tensor, they can share the same version counter. However, this base flat tensor is not exposed and they don't share any memory locations, so this is not necessary. Furthermore, it can cause problems, e.g., when two buffers are broadcast together in `DataParallel` and one of them is modified in-place during `forward` but the other is needed in backward, autograd engine will complain. Fixing the bug discovered at pytorch#13350 (comment) edit: This is a very real problem. E.g., consider using Spectral Norm + Batch Norm together. Pull Request resolved: pytorch#13594 Differential Revision: D12967311 Pulled By: SsnL fbshipit-source-id: 52998dbabe149f575cf0fb79e7016f0b95e4b9e5
Summary: Causing a problem with spectral norm, although SN won't use that anymore after pytorch#13350 . Pull Request resolved: pytorch#13352 Differential Revision: D14209562 Pulled By: ezyang fbshipit-source-id: f5e3183e1e7050ac5a66d203de6f8cf56e775134
Problems with SN and DP after #12671 :
in eval mode,
weight_origis not getting correct gradient SpectralNorm in eval doesn't connect grad to weight_orig #12737 .Fix: keep
vvector around as a buffer and always calculateW = W_orig / (u @ W_orig @ v)even in eval.in training mode, the
weightbuffer of the parallelized module is never updated, if someone touchesweight_origand/orweightand makes them not sharing storage. So inevalthe weight used is wrong.Fix: Make
weightnot a buffer anymore and always calculate it as above.Fix SpectralNorm with DataParallel #12671 changed SN to update
uin-place to make DP work correctly, but then it breaks backward through two forwards (e.g., the common GAN lossD(real) - D(fake)) because the vectors needed to backprop the 1st forward is changed in the 2nd forward.Fix: This PR clones
uandvbefore using them.To maintain BC, I added a hook interface for producing and loading state_dict. This is ugly and we should really have better interface for spectral_norm. But for the purpose to fix this issue, I make this patch. Even if we have a better interface, BC mechanism for legacy loading legacy state_dict still needs to be done.
cc @t-vi @crcrpar