Fix SyncBatchNorm running var update issue#22248
Fix SyncBatchNorm running var update issue#22248unlimblue wants to merge 16 commits intopytorch:masterfrom unlimblue:master
Conversation
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@mrshenli pytorch_xla_linux_trusty_py3_6_gcc5_4_build could retest again. |
mrshenli
left a comment
There was a problem hiding this comment.
Thanks for contributing! Can you add a test in test/test_distributed.py to cover different input sizes?
| CUDA: batch_norm_elemt_cuda | ||
|
|
||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor) | ||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor) |
There was a problem hiding this comment.
Will this cause any backward compatibility issue? It's true that this method is only used in torch/nn/modules/_functions.py in PyTorch, but we do expose it as an API. I am a little worried it is used in other projects. @gchanan is it OK to break BC here? Or should we keep both API where a single count arg is automatically expanded to a list?
There was a problem hiding this comment.
Copy that, and is it possible change const Tensor& counts to const ArrayRef<index_t> counts? I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.
There was a problem hiding this comment.
Thanks, @hux999 is trying to fix Tensor counts issue.
There was a problem hiding this comment.
I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.
Can you try using the from_blob API (learnt from @yf225 ).
See forum post
Use IntArrayRef instead of Tensor for the type of counts
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…uple of ints, but found element of type list at pos 1
|
We have found the reason, and trying to fix that. |
|
@mrshenli are there some ci cache issue? Jul 01 06:22:58 ======================================================================
Jul 01 06:22:58 ERROR: test_clip_grad_norm (__main__.TestNN)
Jul 01 06:22:58 ----------------------------------------------------------------------
Jul 01 06:22:58 Traceback (most recent call last):
Jul 01 06:22:58 File "test_nn.py", line 1829, in test_clip_grad_norm
Jul 01 06:22:58 norm = clip_grad_norm_(l.parameters(), max_norm, norm_type=norm_type)
Jul 01 06:22:58 File "/opt/conda/lib/python3.6/site-packages/torch/nn/utils/clip_grad.py", line 36, in clip_grad_norm_
Jul 01 06:22:58 clip_coef = torch.tensor(max_norm, device=device) / (total_norm + 1e-6)
Jul 01 06:22:58 UnboundLocalError: local variable 'device' referenced before assignment
Jul 01 06:22:58
Jul 01 06:22:58 ---------------------------------------------------------------------- |
|
@unlimblue that's not your fault. Other PR (e.g., #22392) also hit this error. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| CUDA: batch_norm_elemt_cuda | ||
|
|
||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor) | ||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int[] counts) -> (Tensor, Tensor) |
There was a problem hiding this comment.
I believe this is still BC-breaking. As mentioned above, can you try keep both the original one and the new one?
There was a problem hiding this comment.
@mrshenli you should probably also deprecate the old one so we can get rid of it in the future.
|
There is an issue on the test error: #22052. Ignore it. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
mrshenli
left a comment
There was a problem hiding this comment.
Thanks for contributing!
Summary: ## Fix pytorch/pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch/pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
Summary: ## Fix pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Summary: Previous PR pytorch#22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: pytorch#36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Fix #22192
func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)