Fused weightnorm for ATen#10842
Fused weightnorm for ATen#10842mcarilli wants to merge 22 commits intopytorch:masterfrom mcarilli:upstream_weightnorm
Conversation
… reason didn't merge properly
…nged tham at some point
| { | ||
| std::vector<int64_t> output_size(v.dim(), 1); | ||
| output_size[0] = v.size(0); | ||
| return v.contiguous().view({v.size(0), -1}).norm(pow, 1).view(output_size); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Thanks for the heart even though my builds are crashing and burning :P Why are they crashing and burning? Everything builds and runs without trouble on my local machine. Do the new files (native/WeightNorm.cpp and native/cuda/WeightNorm.cu) need to be registered with the build system in some way that I overlooked? On my machine it appears they are autodetected. As near as I can tell, the common factor is that I'm getting linker errors when linking caffe2 tests: which may mean that the linker can't find the object file containing those functions, or those object files were never created in the first place. The rocm builds are failing as well with variants of Looks like a compile time error, otherwise I have no idea. @ezyang I'm told you're the build system expert. Any help would be greatly appreciated! |
|
Native files shouldn't need to be registered. Have you tried a combination of:
|
|
@ezyang @zou3519 I've reorganized my dispatch paths in a way that properly disentangles CPU-only builds from any GPU-specific function dependencies. Unfortunately, I still have two problems:
Again, I'd appreciate any advice you can give, or at least a pointer to the right people to ask. |
|
@mcarilli I think https://github.com/mcarilli/pytorch/pull/1 should fix your onnx problem :-) |
|
@mcarilli The ROCM hcc compiler has some difficulties on doing template type deductions, so you need to annotate the type params of the templated kernel function like here: pytorch/aten/src/ATen/native/cuda/Dropout.cu Line 121 in f9595e7 |
|
@houseroad @bddppq Thank you very much! I'll work on getting these changes integrated and hopefully resubmit today. |
|
@mcarilli no problem, you directly merge my pr to this branch :-) |
| - func: _weight_norm(Tensor v, Tensor g, int64_t dim=0) -> Tensor | ||
| variants: function | ||
|
|
||
| - func: weight_norm_cuda_interface(Tensor v, Tensor g, int64_t dim=0) -> (Tensor, Tensor) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Don't worry about the CircleCI results. |
|
is this ready to merge? @ezyang |
facebook-github-bot
left a comment
There was a problem hiding this comment.
ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
* master: (165 commits) Aibench for asr decoder Explicitly set locale on docs build. (pytorch#11595) Documentation for debugging JIT Fused weightnorm for ATen (pytorch#10842) Move Type, Tensor, TensorMethods to core. Add reminder % to the jit Fix reloading modules back into python (pytorch#11552) Add trigonometry functions to docs/source/onnx.rst Add EndToEndHybridModel CUDA tests (pytorch#11544) minor formatting error log (pytorch#11528) Warn that export+import module always load onto the CPU (pytorch#11485) caffe2::StorageImpl use at::DataPtr (pytorch#11282) Sync all libnccl soversions, not just libnccl.so.1 (pytorch#11575) Document BatchNorm and update default behavior (pytorch#11484) Typo fix in randomness.rst (pytorch#11571) Move some bmm/baddbmm to ATen (pytorch#11292) Make c10d test work on CPU only build (pytorch#11567) Clean up some C++ cruftiness in the script lexer. Allow setting deletion constant Make C10d support CPU only build (pytorch#11513) ...
Summary: This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged. If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel). If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops. Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to colesbury for useful guidance on this front. I do have a few lingering questions: - Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)? - What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`? I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means. - Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (` at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`). - Are sparse tensors a concern? Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well? Pull Request resolved: pytorch/pytorch#10842 Differential Revision: D9735531 Pulled By: ezyang fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee
Summary: This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged. If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel). If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops. Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to colesbury for useful guidance on this front. I do have a few lingering questions: - Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)? - What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`? I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means. - Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (` at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`). - Are sparse tensors a concern? Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well? Pull Request resolved: pytorch#10842 Differential Revision: D9735531 Pulled By: ezyang fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee
This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.
If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).
If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.
Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to @colesbury for useful guidance on this front.
I do have a few lingering questions:
non_blockingwhen calling e.g.auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);? I am currently omittingnon_blocking, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();)._weight_normentry point, and send those along the fallback CPU path as well?