move softmax/logsoftmax to ATen by ngimel · Pull Request #6786 · pytorch/pytorch

ngimel · 2018-04-19T23:40:20Z

THCUNN kernels are mostly unchanged, with minimum changes to types so that more intermediate values are preserved in AccumT.
Softmax/LogSoftmax from THNN are combined into a single templated function.
Remaining issues

in THNN accumulation for float used to be in double, ATen does not support this now, would need something like https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/AccumulateType.cuh, should we add it? With this PR, accumulation for float is in float.
in softmax_backward, self argument is added just so that gradient of self can be computed in double backward. Self tensor itself is not necessary neither for softmax_backward, nor for softmax_double_backward. I don't know if there's a better way of handling it.
scalar handling is somewhat awkward (I'm checking .dim() == 0, and do .view(1) for scalars), is there a better way of doing it?
legacy tests are failing, because softmax is no longer legacy. I can rewrite legacy softmax forward to use torch.softmax instead of [backend].SoftMax_UpdateOutput, but what to do with backward? Softmax backward is no longer exposed.

bddppq · 2018-04-20T04:47:58Z

@onnxbot retest this please

vadimkantorov · 2018-04-20T09:16:44Z

a naming question: we have logsigmoid and log_softmax. If log_softmax is being moved to ATen, should it be renamed to match logsigmoid and have no underscore? or left as is

ngimel · 2018-04-20T15:40:07Z

@vadimkantorov good point, I can rename if there's an agreement from core devs.

ezyang · 2018-04-22T18:00:32Z

CC @colesbury on the legacy test issues. Last time I chatted with folks about it, we weren't planning to delete the legacy support, so something might have to be done.

ezyang · 2018-04-22T18:13:48Z

in THNN accumulation for float used to be in double, ATen does not support this now, would need something like https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/AccumulateType.cuh, should we add it? With this PR, accumulation for float is in float.

Yes, that would be great. I'd probably make another version of this template for "AccAccumulateType" or similar (following the TH convention)

in softmax_backward, self argument is added just so that gradient of self can be computed in double backward. Self tensor itself is not necessary neither for softmax_backward, nor for softmax_double_backward.

I'm a little perplexed by the double-backwards situation; if _backward has no data dependence on self, then there shouldn't be any gradient formula for the double backwards, right?

CC @gchanan about scalars

ngimel · 2018-04-22T21:36:40Z

For legacy, I'd need to figure out how to expose softmax_backward to python. Any advice?
I'll add cpu accumulate type template that follows what TH/THNN has now, and use it on the cpu paths (as a side note, I see cpu reductions now being moved from TH to ATen, they should face the same issue).
It has an implicit dependence on self via output/result that is actually used in the gradient formula (a side note: in derivatives.yaml, for the first derivative softmax_backward has to be declared with result argument, not output, otherwise generator does not properly insert unpacking code. For the second derivative, softmax_backward has to be declared with output argument, otherwise compiler complains about something, don't remember what).
So in this piece of code

name: softmax_backward(Tensor grad_output, Tensor output, int64_t dim, Tensor self)
grad_output: softmax_backward(grad, output, dim, self)
self: softmax_double_backward(grad, grad_output, dim, output)

self in the first line is just so that there can be self in the third line. Does it make sense?

apaszke · 2018-04-23T10:36:29Z

Re 3: that makes sense, but IIRC that's not what we used to do some time ago. Isn't it possible to rewrite the derivative of grad_output in terms of output?

ezyang · 2018-04-23T17:11:57Z

@apaszke No, the issue is different; none of the gradient formulas reference self.

I think we should leave it for now but eventually we should figure out what exactly is going on here. I would have thought that maybe autograd should have worked out that if there's a differential on output, then you need to in turn compute the differential on self because output was computed from self.

apaszke · 2018-04-23T21:43:20Z

If they don't reference self then why does it need to be passed around at all? I think we should resolve this before we merge this. Softmax is often used in places that are pretty critical for memory usage, and should be efficient.

ngimel · 2018-04-23T22:16:42Z

It's passed around so that double backward on self can be defined, in the 3-line snippet I posted above. It's no worse than it is now (THNN/THCUNN also does not use input in backward, yet it is there as an argument)
https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml#L761
https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml#L962
but I'm all for resolving it.

apaszke · 2018-04-23T22:41:01Z

Hmm I see now. Alright, that sounds a bit complicated, so we don't have to block on that.

ngimel · 2018-04-27T18:49:10Z

Rebased on master, fixed legacy by calling *backward *backward_data (native functions ending in backward are blacklisted from being exposed in python). Don't particularly like this solution, but can't find any other way.
Moved AccumulateType to be cpu/gpu accumulate type, with accum types set to what they are now in TH/THC. For cpu float softmax, behavior now replicates THNN, accumulation is in double. Had to change includes in a few existing files to ATen/AccumulateType.h instead of ATen/cuda/AccumulateType.cuh, and fix the source.

vadimkantorov · 2018-04-27T22:30:30Z

@apaszke what's your view on making log_softmax and logsigmoid naming consistent? #6786 (comment)

apaszke · 2018-04-28T19:17:45Z

If there's something to be renamed it should be logsigmoid. log_softmax is used in half of PyTorch scripts and that's too much to change. I also like the separated syntax more.

apaszke

Not a complete review by any means, but looks ok at a glance. Just curious why some T changed to AccumT in the kernels

torch/nn/functional.py

aten/src/ATen/native/cuda/SoftMax.cu

ssnl · 2018-05-01T09:11:18Z

@ngimel can you kindly rebase please? :)

ngimel · 2018-05-01T21:37:18Z

I don't see an actual error on failed trusty build, other than gcc exited with error code 1 - did it time out? For failed xenial-cuda, the failure also does not look related (there's no softmax in the test script).

ezyang · 2018-05-03T16:22:32Z

Yeah, the gcc7.2 is a known flakiness #7202

@pytorchbot retest this please

ngimel · 2018-05-03T16:30:14Z

Hmm, cuda9-cudnn7 failed again, how do I run test/cpp/ tests locally?

ngimel · 2018-05-03T20:50:23Z

Interestingly, I can make adagrad fail on master by commenting out integration.cpp from test_api sources (because I was too lazy to wait for mnist training), in this case adagrad goes for 3000 epochs with loss stuck at 0.48.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_api is a Catch v2.2.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
optim
  adagrad
-------------------------------------------------------------------------------
/raid/pytorch/test/cpp/api/optim.cpp:56
...............................................................................

/raid/pytorch/test/cpp/api/optim.cpp:64: FAILED:

Optimizer tests seem kind of flaky

ngimel · 2018-05-04T18:00:28Z

So given that the failing test is flaky (#7288), and this PR actually does not touch any of the things the failing test is testing this should be good to go?

cpuhrsch · 2018-05-07T17:09:58Z

Hey @ngimel, are you planning on further changing the CPU implementation? I'm currently working on vectorizing softmax and I have just finished what you merged on my end as well haha.

ngimel · 2018-05-07T17:11:37Z

No, I did not plan on further optimizing CPU path, but now that it's in ATen it should be easier for you to work on.

cpuhrsch · 2018-05-07T17:12:10Z

Thanks @ngimel!

* move softmax/logsoftmax to ATen * specify cpu and gpu accum types * use accreal for CPU * expose softmax backward to python, fix legacy interface * fix Distributions.cu to use common AccumulateType * fix cuda 8 build * delete commented out lines * rebase on master, fix breakages

Summary: **Summary**: This PR is a followup of mruberry's #9318. It tries to achieve the following: - Specializing std common math functions for `at::Half` type. - Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`. - Update `THCNumerics.cuh` with new usage and comments to demonstrate the best practice for developers and hence, making way for its deprecation. - Remove legacy/redundant code path. - Remove unused CUDA HALF macros (see separate PR #10147) **Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed: - All arithmetic can now be done in ATen using binary cuda kernel or CUDA tensor pointwise apply (check #8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float. - Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h` - Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call. - Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for `at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: ROCm/hip#374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP). Here are some reference PRs that was handy in refactoring TH into ATen: - #6786 - #5475 - #9401 - #8689 - #8919 Pull Request resolved: #10301 Differential Revision: D9204758 Pulled By: soumith fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a

Summary: **Summary**: This PR is a followup of mruberry's pytorch/pytorch#9318. It tries to achieve the following: - Specializing std common math functions for `at::Half` type. - Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`. - Update `THCNumerics.cuh` with new usage and comments to demonstrate the best practice for developers and hence, making way for its deprecation. - Remove legacy/redundant code path. - Remove unused CUDA HALF macros (see separate PR pytorch/pytorch#10147) **Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed: - All arithmetic can now be done in ATen using binary cuda kernel or CUDA tensor pointwise apply (check pytorch/pytorch#8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float. - Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h` - Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call. - Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for `at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: ROCm/hip#374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP). Here are some reference PRs that was handy in refactoring TH into ATen: - pytorch/pytorch#6786 - pytorch/pytorch#5475 - pytorch/pytorch#9401 - pytorch/pytorch#8689 - pytorch/pytorch#8919 Pull Request resolved: pytorch/pytorch#10301 Differential Revision: D9204758 Pulled By: soumith fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a

…rch#10301) Summary: **Summary**: This PR is a followup of mruberry's pytorch#9318. It tries to achieve the following: - Specializing std common math functions for `at::Half` type. - Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`. - Update `THCNumerics.cuh` with new usage and comments to demonstrate the best practice for developers and hence, making way for its deprecation. - Remove legacy/redundant code path. - Remove unused CUDA HALF macros (see separate PR pytorch#10147) **Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed: - All arithmetic can now be done in ATen using binary cuda kernel or CUDA tensor pointwise apply (check pytorch#8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float. - Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h` - Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call. - Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for `at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: ROCm/hip#374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP). Here are some reference PRs that was handy in refactoring TH into ATen: - pytorch#6786 - pytorch#5475 - pytorch#9401 - pytorch#8689 - pytorch#8919 Pull Request resolved: pytorch#10301 Differential Revision: D9204758 Pulled By: soumith fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a

ngimel requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners April 19, 2018 23:40

onnxbot-worker-3 mentioned this pull request Apr 19, 2018

[auto] pytorch-pr-6786 onnxbot/onnx-fb-universe#1684

Closed

ngimel force-pushed the softmaxATen branch from 0ead774 to 14a945e Compare April 27, 2018 17:05

apaszke reviewed Apr 30, 2018

View reviewed changes

ezyang approved these changes Apr 30, 2018

View reviewed changes

Natalia Gimelshein added 5 commits May 1, 2018 17:59

move softmax/logsoftmax to ATen

e88f730

specify cpu and gpu accum types

8386718

use accreal for CPU

7963d9c

expose softmax backward to python, fix legacy interface

927634e

fix Distributions.cu to use common AccumulateType

0146d12

Natalia Gimelshein added 3 commits May 1, 2018 17:59

fix cuda 8 build

beba16d

delete commented out lines

3a81e3e

rebase on master, fix breakages

204cae7

ngimel force-pushed the softmaxATen branch from 25bfe33 to 204cae7 Compare May 1, 2018 18:35

vadimkantorov mentioned this pull request May 4, 2018

Naming inconsistencies #1969

Closed

ezyang merged commit a015d57 into pytorch:master May 4, 2018

ngimel deleted the softmaxATen branch May 7, 2018 20:40

syed-ahmed mentioned this pull request Aug 7, 2018

Refactor THCNumerics and add common math functions for at::Half #10301

Closed

syed-ahmed mentioned this pull request Apr 10, 2019

[CPU] Refactor Random Number Generators in ATen - Part 1/n #16604

Closed

ezyang added the open source label Jun 24, 2019

ngimel mentioned this pull request Aug 26, 2021

Softmax/logsoftmax use more memory for backward than needed #64000

Closed

Conversation

ngimel commented Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bddppq commented Apr 20, 2018

Uh oh!

vadimkantorov commented Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Apr 20, 2018

Uh oh!

ezyang commented Apr 22, 2018

Uh oh!

ezyang commented Apr 22, 2018

Uh oh!

ngimel commented Apr 22, 2018

Uh oh!

apaszke commented Apr 23, 2018

Uh oh!

ezyang commented Apr 23, 2018

Uh oh!

apaszke commented Apr 23, 2018

Uh oh!

ngimel commented Apr 23, 2018

Uh oh!

apaszke commented Apr 23, 2018

Uh oh!

ngimel commented Apr 27, 2018

Uh oh!

vadimkantorov commented Apr 27, 2018

Uh oh!

apaszke commented Apr 28, 2018

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

ssnl commented May 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented May 1, 2018

Uh oh!

ezyang commented May 3, 2018

Uh oh!

ngimel commented May 3, 2018

Uh oh!

ngimel commented May 3, 2018

Uh oh!

ngimel commented May 4, 2018

Uh oh!

cpuhrsch commented May 7, 2018

Uh oh!

ngimel commented May 7, 2018

Uh oh!

cpuhrsch commented May 7, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ngimel commented Apr 19, 2018 •

edited

Loading

vadimkantorov commented Apr 20, 2018 •

edited

Loading

ssnl commented May 1, 2018 •

edited

Loading