Revert "Move batch_norm to ATen/native, speed up (#12368)" by houseroad · Pull Request #13191 · pytorch/pytorch

houseroad · 2018-10-26T22:22:08Z

Revert #12368 since it's causing onnx related test cases failing.
#12368

@ssnl @t-vi

This reverts commit dc211c7.

facebook-github-bot

houseroad has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ssnl · 2018-10-26T23:01:09Z

Please revert to unbreak master. Thank you!

…

On Fri, Oct 26, 2018 at 18:42 bddppq ***@***.***> wrote: @bddppq <https://github.com/bddppq> requested your review on: #13191 <#13191> Revert "Move batch_norm to ATen/native, speed up (#12368 <#12368>)". — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#13191 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFaWZW9UqjZ89jiMICly_OcsknJFe_c6ks5uo4_vgaJpZM4X9Dj4> .

t-vi · 2018-10-27T06:20:54Z

I could really use a hint which test is failing.

houseroad · 2018-10-27T06:36:54Z

@t-vi here is the log of failing test: https://circleci.com/gh/pytorch/pytorch/129750?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Summary: Revert #12368 since it's causing onnx related test cases failing. pytorch/pytorch#12368 SsnL The controller you requested could not be found. Pull Request resolved: pytorch/pytorch#13191 Reviewed By: BIT-silence Differential Revision: D12810778 Pulled By: houseroad fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577

t-vi · 2018-10-27T07:44:31Z

So I merge master into the my batch norm branch, built and installed onnx from third_party/onnx and ran

PYTHON_PATH=$(pwd)/build/lib.linux-x86_64-3.6/ python3 test/onnx/test_pytorch_onnx_caffe2.py TestCaffe2Backend.test_instance_norm
PYTHON_PATH=$(pwd)/build/lib.linux-x86_64-3.6/ python3 test/onnx/test_models.py TestModels.test_dcgan_netG TestModels.test_inception

and couldn't get the failure locally. :(

houseroad · 2018-10-27T16:20:05Z

Since I already revert your changes in the master, to reproduce the problem, I would suggest to do the following:

git checkout dc211c7de41a4a7c05e480b6277a5f2e91beb71a
python setup.py clean --all # this will clean your local build cache
DEBUG=1 NO_CUDA=1 python setup.py build_deps develop

then run

 python test/onnx/test_models.py TestModels.test_dcgan_netG

This test became flaky, but locally, I can always reproduce the problem.

houseroad · 2018-10-27T16:21:26Z

@t-vi, please let me know if you still cannot reproduce the problem, I will dig into it on my side. :-)

houseroad · 2018-10-27T16:28:46Z

@t-vi also forgot to mentioning, thanks a lot for the contribution, it's very useful. we just temporarily revert it to unbreak the CI test, I will help you push it back.

t-vi · 2018-10-27T19:35:55Z

I do appreciate your help with reproducing it and once I get to that, I hope we'll be able to fix it soon, too.

t-vi · 2018-10-28T07:26:36Z

(Silly question, but) so it's the batchnorm on CPU that is problematic? I try to test that more systematically.

Unfortunately, I cannot seem to reproduce the error. What I did was roughly

git checkout dc211c7de41a4a7c05e480b6277a5f2e91beb71a
python setup.py clean --all # this will clean your local build cache
DEBUG=1 NO_CUDA=1 python setup.py bdist_wheel
sudo pip install dist/torch-1.0.0a0+dc211c7-cp27-cp27mu-linux_x86_64.whl
cd third_party/onnx
python setup.py bdist_wheel
sudo pip install dist/torchvision-0.2.1-py2.py3-none-any.whl
cd ../..

(also I installed future (past) and torchvision).

I ran the test python test/onnx/test_models.py TestModels.test_dcgan_netG 10 times without failure.

houseroad · 2018-10-29T01:31:14Z

No worries, I have uploaded the docker to my repo: https://hub.docker.com/r/houseroad/onnx-docker/tags/

docker pull houseroad/onnx-docker:repro
docker run -it houseroad/onnx-docker:repro /bin/bash

after launching docker, try

cd pytorch
python test/onnx/test_models.py TestModels.test_dcgan_netG

t-vi · 2018-10-29T17:46:31Z

Thank you for this! I can reproduce the error. Whyever it seemed a good idea at the time, not using sum inside parallel_for seems to fix the problem:

diff --git a/aten/src/ATen/native/Normalization.cpp b/aten/src/ATen/native/Normalization.cpp
index ca5f4ae..1774ec6 100644
--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -69,9 +69,12 @@ std::tuple<Tensor,Tensor,Tensor> batch_norm_cpu_template(const Tensor& input, co
 
 	if (train) {
 	  // compute mean per input
-	  accscalar_t sum = in.sum()._local_scalar().to<accscalar_t>(); // accumulation dtype ?
+	  accscalar_t sum = 0;
+	  CPU_tensor_apply1<scalar_t>(in, [&] (const scalar_t& i) {
+	      sum += i;
+	    });
 
-	  mean = (scalar_t) sum / n;
+	  mean = (scalar_t) (sum / n);
 	  save_mean_a[f] = mean;
 
 	  // compute variance per input

If it does for you, too, I'd have another go at the PR.

houseroad · 2018-10-29T20:30:05Z

@t-vi Awesome, it works for me on the local machine.

soumith · 2018-10-29T21:15:14Z

cc: @colesbury that seems like a bug with sum_parallel more than anything else?

…torch#13191) Summary: Revert pytorch#12368 since it's causing onnx related test cases failing. pytorch#12368 SsnL The controller you requested could not be found. Pull Request resolved: pytorch#13191 Reviewed By: BIT-silence Differential Revision: D12810778 Pulled By: houseroad fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577

Revert "Move batch_norm to ATen/native, speed up (pytorch#12368)"

919ddd3

This reverts commit dc211c7.

houseroad requested a review from bddppq October 26, 2018 22:22

facebook-github-bot reviewed Oct 26, 2018

View reviewed changes

bddppq approved these changes Oct 26, 2018

View reviewed changes

bddppq requested a review from ssnl October 26, 2018 22:42

facebook-github-bot closed this in f8864f0 Oct 27, 2018

ezyang added the merged label Jun 25, 2019

Conversation

houseroad commented Oct 26, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ssnl commented Oct 26, 2018 via email

Uh oh!

t-vi commented Oct 27, 2018

Uh oh!

houseroad commented Oct 27, 2018

Uh oh!

t-vi commented Oct 27, 2018

Uh oh!

houseroad commented Oct 27, 2018

Uh oh!

houseroad commented Oct 27, 2018

Uh oh!

houseroad commented Oct 27, 2018

Uh oh!

t-vi commented Oct 27, 2018

Uh oh!

t-vi commented Oct 28, 2018

Uh oh!

houseroad commented Oct 29, 2018

Uh oh!

t-vi commented Oct 29, 2018

Uh oh!

houseroad commented Oct 29, 2018

Uh oh!

soumith commented Oct 29, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants