Fused weightnorm for ATen by mcarilli · Pull Request #10842 · pytorch/pytorch

mcarilli · 2018-08-24T01:06:51Z

This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.

If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).

If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.

Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to @colesbury for useful guidance on this front.

I do have a few lingering questions:

Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)?
What is the significance of non_blocking when calling e.g. auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);? I am currently omitting non_blocking, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.
Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in AccumulateType.h? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself ( at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();).
Are sparse tensors a concern? Should I include another check for sparse tensors in the _weight_norm entry point, and send those along the fallback CPU path as well?

…weightnorm

… reason didn't merge properly

…weightnorm

…nged tham at some point

Sign in to view

+  {
+    std::vector<int64_t> output_size(v.dim(), 1);
+    output_size[0] = v.size(0);
+    return v.contiguous().view({v.size(0), -1}).norm(pow, 1).view(output_size);


mcarilli · 2018-08-24T16:23:40Z

Thanks for the heart even though my builds are crashing and burning :P

Why are they crashing and burning? Everything builds and runs without trouble on my local machine. Do the new files (native/WeightNorm.cpp and native/cuda/WeightNorm.cu) need to be registered with the build system in some way that I overlooked? On my machine it appears they are autodetected.

As near as I can tell, the common factor is that I'm getting linker errors when linking caffe2 tests:

01:15:40 [ 88%] Linking CXX executable ../bin/tbb_init_test
01:15:40 /var/lib/jenkins/workspace/build/lib/libcaffe2.so: undefined reference to `at::native::weight_norm_differentiable_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long)'

which may mean that the linker can't find the object file containing those functions, or those object files were never created in the first place.

The rocm builds are failing as well with variants of

/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/WeightNorm.cu:353:9: error: no matching function for call to 'hipLaunchKernelGGL'

Looks like a compile time error, otherwise I have no idea.

@ezyang I'm told you're the build system expert. Any help would be greatly appreciated!

zou3519 · 2018-08-28T16:36:11Z

Native files shouldn't need to be registered. Have you tried a combination of:

python setup.py clean
rebuild
git pull --rebase upstream master (pull in changes from master)

…weightnorm

mcarilli · 2018-09-05T21:20:39Z

@ezyang @zou3519
I fixed most of the failing builds, which turned out to be user error. I was mixing the CPU and GPU code paths in a way that caused CPU-only builds to fail (this was tricky to identify, because some of the build failures were nominally cuda builds). Thanks to @ngimel for proposing that I attempt a local build in a cuda-free container, which enabled me to reproduce and fix the compilation.

I've reorganized my dispatch paths in a way that properly disentangles CPU-only builds from any GPU-specific function dependencies.

Unfortunately, I still have two problems:

Rocm builds are still failing, with the same error as before, variants of

22:04:03 /var/lib/jenkins/workspace/aten/src/ATen/native/cuda/WeightNorm.cu:353:9: error: no matching function for call to 'hipLaunchKernelGGL'
...
22:04:03 /opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:86:13: note: candidate function [with Args = <float *, float *, float *, float *, int>, F = void (*)(float *, float *, float *, float *, int)] not viable: no overload of 'weight_norm_fwd_first_dim_kernel' matching 'void (*)(float *, float *, float *, float *, int)' for 1st argument

Onnx tests are failing, with errors like

23:00:16 E           RuntimeError: ONNX export failed: Couldn't export operator aten::_weight_norm

Again, I'd appreciate any advice you can give, or at least a pointer to the right people to ask.

houseroad · 2018-09-07T04:44:22Z

@mcarilli I think https://github.com/mcarilli/pytorch/pull/1 should fix your onnx problem :-)

bddppq · 2018-09-07T06:28:20Z

@mcarilli The ROCM hcc compiler has some difficulties on doing template type deductions, so you need to annotate the type params of the templated kernel function like here:

pytorch/aten/src/ATen/native/cuda/Dropout.cu

Line 121 in f9595e7

    
           fused_dropout_kernel<scalar_t, accscalar_t, unsigned int, 1><<<grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(self_info, ret_info, mask_info, nelem, pa, next_philox_seed(gen,counter_offset));

@Jorghi12 @iotamudelta

mcarilli · 2018-09-07T18:13:29Z

@houseroad @bddppq Thank you very much! I'll work on getting these changes integrated and hopefully resubmit today.

houseroad · 2018-09-07T18:59:07Z

@mcarilli no problem, you directly merge my pr to this branch :-)

…weightnorm

Sign in to view

+- func: _weight_norm(Tensor v, Tensor g, int64_t dim=0) -> Tensor
+  variants: function
+
+- func: weight_norm_cuda_interface(Tensor v, Tensor g, int64_t dim=0) -> (Tensor, Tensor)


facebook-github-bot

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2018-09-11T03:26:56Z

Don't worry about the CircleCI results.

weiyangfb · 2018-09-11T22:08:13Z

is this ready to merge? @ezyang

facebook-github-bot

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

* master: (165 commits) Aibench for asr decoder Explicitly set locale on docs build. (pytorch#11595) Documentation for debugging JIT Fused weightnorm for ATen (pytorch#10842) Move Type, Tensor, TensorMethods to core. Add reminder % to the jit Fix reloading modules back into python (pytorch#11552) Add trigonometry functions to docs/source/onnx.rst Add EndToEndHybridModel CUDA tests (pytorch#11544) minor formatting error log (pytorch#11528) Warn that export+import module always load onto the CPU (pytorch#11485) caffe2::StorageImpl use at::DataPtr (pytorch#11282) Sync all libnccl soversions, not just libnccl.so.1 (pytorch#11575) Document BatchNorm and update default behavior (pytorch#11484) Typo fix in randomness.rst (pytorch#11571) Move some bmm/baddbmm to ATen (pytorch#11292) Make c10d test work on CPU only build (pytorch#11567) Clean up some C++ cruftiness in the script lexer. Allow setting deletion constant Make C10d support CPU only build (pytorch#11513) ...

Summary: This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged. If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel). If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops. Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to colesbury for useful guidance on this front. I do have a few lingering questions: - Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)? - What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`? I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means. - Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (` at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`). - Are sparse tensors a concern? Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well? Pull Request resolved: pytorch/pytorch#10842 Differential Revision: D9735531 Pulled By: ezyang fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee

Summary: This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged. If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel). If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops. Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to colesbury for useful guidance on this front. I do have a few lingering questions: - Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)? - What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`? I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means. - Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (` at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`). - Are sparse tensors a concern? Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well? Pull Request resolved: pytorch#10842 Differential Revision: D9735531 Pulled By: ezyang fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee

definitelynotmcarilli added 11 commits August 17, 2018 16:01

All code paths in place. Need to clean up, compile, and test.

f9ced84

Adding Python-side changes, and updating for THC numerics

1d0b8fc

Merge remote-tracking branch 'upstream/master' into upstream_weightnorm

6d40425

Compilation and import fixes, WIP

5d45450

Merge remote-tracking branch 'remotes/upstream/master' into upstream_…

6774c46

…weightnorm

Patching in upstream changes to native_functions.yaml, which for some…

6bae25c

… reason didn't merge properly

removing debugging return statement

287fe3e

Half precision fixes

1eb358e

Merge remote-tracking branch 'remotes/upstream/master' into upstream_…

247e089

…weightnorm

Tests passing

36138e4

Resetting subrepo commits to match master, must have accidentally cha…

59f1407

…nged tham at some point

mcarilli requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 24, 2018 01:06

ssnl reviewed Aug 24, 2018

View reviewed changes

definitelynotmcarilli added 6 commits August 30, 2018 13:30

Merge remote-tracking branch 'remotes/upstream/master' into upstream_…

dd2d787

…weightnorm

Reorganizing to enable CPU-only builds

bb5707c

fix typos

ebd205b

Merge remote-tracking branch 'remotes/upstream/master' into upstream_…

c4864c0

…weightnorm

Locking in onnx subrepo commit from upstream

e26020a

Locking in onnx subrepo commit from upstream for real

870833b

definitelynotmcarilli added 3 commits September 7, 2018 15:37

Patching in rocm and onnx fixes

c977bdd

Merge remote-tracking branch 'remotes/upstream/master' into upstream_…

d1b496c

…weightnorm

Updating for flake8 compliance

9d0da15

mcarilli requested review from Yangqing, anderspapitto, bddppq, dzhulgakov, houseroad, jamesr66a and smessmer as code owners September 7, 2018 23:21

Formatting and stylistic updates

c197680

ezyang reviewed Sep 9, 2018

View reviewed changes

facebook-github-bot reviewed Sep 9, 2018

View reviewed changes

ezyang approved these changes Sep 9, 2018

View reviewed changes

Decorating non-public functions with underscores

a8e98ce

facebook-github-bot reviewed Sep 12, 2018

View reviewed changes

facebook-github-bot closed this in a3036b3 Sep 12, 2018

ezyang mentioned this pull request Sep 13, 2018

weight_norm no longer works with None dim #11660

Closed

ezyang added open source merged labels Jun 24, 2019

zasdfgbnm mentioned this pull request Sep 16, 2019

Port fused layer_norm from APEX to ATen #26201

Closed

Conversation

mcarilli commented Aug 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

mcarilli commented Aug 24, 2018

Uh oh!

zou3519 commented Aug 28, 2018

Uh oh!

mcarilli commented Sep 5, 2018

Uh oh!

houseroad commented Sep 7, 2018

Uh oh!

bddppq commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcarilli commented Sep 7, 2018

Uh oh!

houseroad commented Sep 7, 2018

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 11, 2018

Uh oh!

weiyangfb commented Sep 11, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

mcarilli commented Aug 24, 2018 •

edited

Loading

bddppq commented Sep 7, 2018 •

edited

Loading