Skip to content

Fused weightnorm for ATen#10842

Closed
mcarilli wants to merge 22 commits intopytorch:masterfrom
mcarilli:upstream_weightnorm
Closed

Fused weightnorm for ATen#10842
mcarilli wants to merge 22 commits intopytorch:masterfrom
mcarilli:upstream_weightnorm

Conversation

@mcarilli
Copy link
Copy Markdown
Collaborator

@mcarilli mcarilli commented Aug 24, 2018

This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.

If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).

If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.

Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to @colesbury for useful guidance on this front.

I do have a few lingering questions:

  • Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)?
  • What is the significance of non_blocking when calling e.g. auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);? I am currently omitting non_blocking, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.
  • Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in AccumulateType.h? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself ( at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();).
  • Are sparse tensors a concern? Should I include another check for sparse tensors in the _weight_norm entry point, and send those along the fallback CPU path as well?

{
std::vector<int64_t> output_size(v.dim(), 1);
output_size[0] = v.size(0);
return v.contiguous().view({v.size(0), -1}).norm(pow, 1).view(output_size);

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@mcarilli
Copy link
Copy Markdown
Collaborator Author

Thanks for the heart even though my builds are crashing and burning :P

Why are they crashing and burning? Everything builds and runs without trouble on my local machine. Do the new files (native/WeightNorm.cpp and native/cuda/WeightNorm.cu) need to be registered with the build system in some way that I overlooked? On my machine it appears they are autodetected.

As near as I can tell, the common factor is that I'm getting linker errors when linking caffe2 tests:

01:15:40 [ 88%] Linking CXX executable ../bin/tbb_init_test
01:15:40 /var/lib/jenkins/workspace/build/lib/libcaffe2.so: undefined reference to `at::native::weight_norm_differentiable_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long)'

which may mean that the linker can't find the object file containing those functions, or those object files were never created in the first place.

The rocm builds are failing as well with variants of

/var/lib/jenkins/workspace/aten/src/ATen/native/cuda/WeightNorm.cu:353:9: error: no matching function for call to 'hipLaunchKernelGGL'

Looks like a compile time error, otherwise I have no idea.

@ezyang I'm told you're the build system expert. Any help would be greatly appreciated!

@zou3519
Copy link
Copy Markdown
Contributor

zou3519 commented Aug 28, 2018

Native files shouldn't need to be registered. Have you tried a combination of:

  • python setup.py clean
  • rebuild
  • git pull --rebase upstream master (pull in changes from master)

@mcarilli
Copy link
Copy Markdown
Collaborator Author

mcarilli commented Sep 5, 2018

@ezyang @zou3519
I fixed most of the failing builds, which turned out to be user error. I was mixing the CPU and GPU code paths in a way that caused CPU-only builds to fail (this was tricky to identify, because some of the build failures were nominally cuda builds). Thanks to @ngimel for proposing that I attempt a local build in a cuda-free container, which enabled me to reproduce and fix the compilation.

I've reorganized my dispatch paths in a way that properly disentangles CPU-only builds from any GPU-specific function dependencies.

Unfortunately, I still have two problems:

  1. Rocm builds are still failing, with the same error as before, variants of
22:04:03 /var/lib/jenkins/workspace/aten/src/ATen/native/cuda/WeightNorm.cu:353:9: error: no matching function for call to 'hipLaunchKernelGGL'
...
22:04:03 /opt/rocm/hip/include/hip/hcc_detail/functional_grid_launch.hpp:86:13: note: candidate function [with Args = <float *, float *, float *, float *, int>, F = void (*)(float *, float *, float *, float *, int)] not viable: no overload of 'weight_norm_fwd_first_dim_kernel' matching 'void (*)(float *, float *, float *, float *, int)' for 1st argument
  1. Onnx tests are failing, with errors like
23:00:16 E           RuntimeError: ONNX export failed: Couldn't export operator aten::_weight_norm

Again, I'd appreciate any advice you can give, or at least a pointer to the right people to ask.

@houseroad
Copy link
Copy Markdown
Member

@mcarilli I think https://github.com/mcarilli/pytorch/pull/1 should fix your onnx problem :-)

@bddppq
Copy link
Copy Markdown
Contributor

bddppq commented Sep 7, 2018

@mcarilli The ROCM hcc compiler has some difficulties on doing template type deductions, so you need to annotate the type params of the templated kernel function like here:

fused_dropout_kernel<scalar_t, accscalar_t, unsigned int, 1><<<grid, dim_block, 0, at::cuda::getCurrentCUDAStream()>>>(self_info, ret_info, mask_info, nelem, pa, next_philox_seed(gen,counter_offset));

@Jorghi12 @iotamudelta

@mcarilli
Copy link
Copy Markdown
Collaborator Author

mcarilli commented Sep 7, 2018

@houseroad @bddppq Thank you very much! I'll work on getting these changes integrated and hopefully resubmit today.

@houseroad
Copy link
Copy Markdown
Member

@mcarilli no problem, you directly merge my pr to this branch :-)

- func: _weight_norm(Tensor v, Tensor g, int64_t dim=0) -> Tensor
variants: function

- func: weight_norm_cuda_interface(Tensor v, Tensor g, int64_t dim=0) -> (Tensor, Tensor)

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented Sep 11, 2018

Don't worry about the CircleCI results.

@weiyangfb
Copy link
Copy Markdown
Contributor

is this ready to merge? @ezyang

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

petrex pushed a commit to petrex/pytorch that referenced this pull request Sep 12, 2018
* master: (165 commits)
  Aibench for asr decoder
  Explicitly set locale on docs build. (pytorch#11595)
  Documentation for debugging JIT
  Fused weightnorm for ATen (pytorch#10842)
  Move Type, Tensor, TensorMethods to core.
  Add reminder % to the jit
  Fix reloading modules back into python (pytorch#11552)
  Add trigonometry functions to docs/source/onnx.rst
  Add EndToEndHybridModel CUDA tests (pytorch#11544)
  minor formatting error log (pytorch#11528)
  Warn that export+import module always load onto the CPU (pytorch#11485)
  caffe2::StorageImpl use at::DataPtr (pytorch#11282)
  Sync all libnccl soversions, not just libnccl.so.1 (pytorch#11575)
  Document BatchNorm and update default behavior (pytorch#11484)
  Typo fix in randomness.rst (pytorch#11571)
  Move some bmm/baddbmm to ATen (pytorch#11292)
  Make c10d test work on CPU only build (pytorch#11567)
  Clean up some C++ cruftiness in the script lexer.
  Allow setting deletion constant
  Make C10d support CPU only build (pytorch#11513)
  ...
zdevito pushed a commit to zdevito/ATen that referenced this pull request Sep 12, 2018
Summary:
This PR contains a C++ implementation of weight norm.  The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.

If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16.  In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel.  If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).

If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.

Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed.  Thanks to colesbury for useful guidance on this front.

I do have a few lingering questions:

- Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)?
- What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`?  I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.
- Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)?  I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (`  at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`).
- Are sparse tensors a concern?  Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well?
Pull Request resolved: pytorch/pytorch#10842

Differential Revision: D9735531

Pulled By: ezyang

fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
This PR contains a C++ implementation of weight norm.  The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.

If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16.  In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel.  If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).

If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.

Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed.  Thanks to colesbury for useful guidance on this front.

I do have a few lingering questions:

- Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)?
- What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`?  I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.
- Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)?  I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (`  at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`).
- Are sparse tensors a concern?  Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well?
Pull Request resolved: pytorch#10842

Differential Revision: D9735531

Pulled By: ezyang

fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants