Skip to content

Revert "Move batch_norm to ATen/native, speed up (#12368)"#13191

Closed
houseroad wants to merge 1 commit intopytorch:masterfrom
houseroad:revert_pr12368
Closed

Revert "Move batch_norm to ATen/native, speed up (#12368)"#13191
houseroad wants to merge 1 commit intopytorch:masterfrom
houseroad:revert_pr12368

Conversation

@houseroad
Copy link
Copy Markdown
Member

Revert #12368 since it's causing onnx related test cases failing.
#12368

@ssnl @t-vi

@houseroad houseroad requested a review from bddppq October 26, 2018 22:22
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

houseroad has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@bddppq bddppq requested a review from ssnl October 26, 2018 22:42
@ssnl
Copy link
Copy Markdown
Collaborator

ssnl commented Oct 26, 2018 via email

@t-vi
Copy link
Copy Markdown
Collaborator

t-vi commented Oct 27, 2018

I could really use a hint which test is failing.

@houseroad
Copy link
Copy Markdown
Member Author

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 27, 2018
Summary:
Revert #12368 since it's causing onnx related test cases failing.
pytorch/pytorch#12368

SsnL The controller you requested could not be found.
Pull Request resolved: pytorch/pytorch#13191

Reviewed By: BIT-silence

Differential Revision: D12810778

Pulled By: houseroad

fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577
@t-vi
Copy link
Copy Markdown
Collaborator

t-vi commented Oct 27, 2018

So I merge master into the my batch norm branch, built and installed onnx from third_party/onnx and ran

PYTHON_PATH=$(pwd)/build/lib.linux-x86_64-3.6/ python3 test/onnx/test_pytorch_onnx_caffe2.py TestCaffe2Backend.test_instance_norm
PYTHON_PATH=$(pwd)/build/lib.linux-x86_64-3.6/ python3 test/onnx/test_models.py TestModels.test_dcgan_netG TestModels.test_inception

and couldn't get the failure locally. :(

@houseroad
Copy link
Copy Markdown
Member Author

Since I already revert your changes in the master, to reproduce the problem, I would suggest to do the following:

git checkout dc211c7de41a4a7c05e480b6277a5f2e91beb71a
python setup.py clean --all # this will clean your local build cache
DEBUG=1 NO_CUDA=1 python setup.py build_deps develop

then run

 python test/onnx/test_models.py TestModels.test_dcgan_netG

This test became flaky, but locally, I can always reproduce the problem.

@houseroad
Copy link
Copy Markdown
Member Author

@t-vi, please let me know if you still cannot reproduce the problem, I will dig into it on my side. :-)

@houseroad
Copy link
Copy Markdown
Member Author

@t-vi also forgot to mentioning, thanks a lot for the contribution, it's very useful. we just temporarily revert it to unbreak the CI test, I will help you push it back.

@t-vi
Copy link
Copy Markdown
Collaborator

t-vi commented Oct 27, 2018

I do appreciate your help with reproducing it and once I get to that, I hope we'll be able to fix it soon, too.

@t-vi
Copy link
Copy Markdown
Collaborator

t-vi commented Oct 28, 2018

(Silly question, but) so it's the batchnorm on CPU that is problematic? I try to test that more systematically.

Unfortunately, I cannot seem to reproduce the error. What I did was roughly

git checkout dc211c7de41a4a7c05e480b6277a5f2e91beb71a
python setup.py clean --all # this will clean your local build cache
DEBUG=1 NO_CUDA=1 python setup.py bdist_wheel
sudo pip install dist/torch-1.0.0a0+dc211c7-cp27-cp27mu-linux_x86_64.whl
cd third_party/onnx
python setup.py bdist_wheel
sudo pip install dist/torchvision-0.2.1-py2.py3-none-any.whl
cd ../..

(also I installed future (past) and torchvision).

I ran the test python test/onnx/test_models.py TestModels.test_dcgan_netG 10 times without failure.

@houseroad
Copy link
Copy Markdown
Member Author

No worries, I have uploaded the docker to my repo: https://hub.docker.com/r/houseroad/onnx-docker/tags/

docker pull houseroad/onnx-docker:repro
docker run -it houseroad/onnx-docker:repro /bin/bash

after launching docker, try

cd pytorch
python test/onnx/test_models.py TestModels.test_dcgan_netG

@t-vi
Copy link
Copy Markdown
Collaborator

t-vi commented Oct 29, 2018

Thank you for this! I can reproduce the error. Whyever it seemed a good idea at the time, not using sum inside parallel_for seems to fix the problem:

diff --git a/aten/src/ATen/native/Normalization.cpp b/aten/src/ATen/native/Normalization.cpp
index ca5f4ae..1774ec6 100644
--- a/aten/src/ATen/native/Normalization.cpp
+++ b/aten/src/ATen/native/Normalization.cpp
@@ -69,9 +69,12 @@ std::tuple<Tensor,Tensor,Tensor> batch_norm_cpu_template(const Tensor& input, co
 
 	if (train) {
 	  // compute mean per input
-	  accscalar_t sum = in.sum()._local_scalar().to<accscalar_t>(); // accumulation dtype ?
+	  accscalar_t sum = 0;
+	  CPU_tensor_apply1<scalar_t>(in, [&] (const scalar_t& i) {
+	      sum += i;
+	    });
 
-	  mean = (scalar_t) sum / n;
+	  mean = (scalar_t) (sum / n);
 	  save_mean_a[f] = mean;
 
 	  // compute variance per input

If it does for you, too, I'd have another go at the PR.

@houseroad
Copy link
Copy Markdown
Member Author

@t-vi Awesome, it works for me on the local machine.

@soumith
Copy link
Copy Markdown
Collaborator

soumith commented Oct 29, 2018

cc: @colesbury that seems like a bug with sum_parallel more than anything else?

@ezyang ezyang added the merged label Jun 25, 2019
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…torch#13191)

Summary:
Revert pytorch#12368 since it's causing onnx related test cases failing.
pytorch#12368

SsnL The controller you requested could not be found.
Pull Request resolved: pytorch#13191

Reviewed By: BIT-silence

Differential Revision: D12810778

Pulled By: houseroad

fbshipit-source-id: 1c373b92628580097cffcd237dccc5b3d8697577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants