Skip to content

move softmax/logsoftmax to ATen#6786

Merged
ezyang merged 8 commits intopytorch:masterfrom
ngimel:softmaxATen
May 4, 2018
Merged

move softmax/logsoftmax to ATen#6786
ezyang merged 8 commits intopytorch:masterfrom
ngimel:softmaxATen

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Apr 19, 2018

THCUNN kernels are mostly unchanged, with minimum changes to types so that more intermediate values are preserved in AccumT.
Softmax/LogSoftmax from THNN are combined into a single templated function.
Remaining issues

  • in THNN accumulation for float used to be in double, ATen does not support this now, would need something like https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/AccumulateType.cuh, should we add it? With this PR, accumulation for float is in float.
  • in softmax_backward, self argument is added just so that gradient of self can be computed in double backward. Self tensor itself is not necessary neither for softmax_backward, nor for softmax_double_backward. I don't know if there's a better way of handling it.
  • scalar handling is somewhat awkward (I'm checking .dim() == 0, and do .view(1) for scalars), is there a better way of doing it?
  • legacy tests are failing, because softmax is no longer legacy. I can rewrite legacy softmax forward to use torch.softmax instead of [backend].SoftMax_UpdateOutput, but what to do with backward? Softmax backward is no longer exposed.

@bddppq
Copy link
Contributor

bddppq commented Apr 20, 2018

@onnxbot retest this please

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Apr 20, 2018

a naming question: we have logsigmoid and log_softmax. If log_softmax is being moved to ATen, should it be renamed to match logsigmoid and have no underscore? or left as is

@ngimel
Copy link
Collaborator Author

ngimel commented Apr 20, 2018

@vadimkantorov good point, I can rename if there's an agreement from core devs.

@ezyang
Copy link
Contributor

ezyang commented Apr 22, 2018

CC @colesbury on the legacy test issues. Last time I chatted with folks about it, we weren't planning to delete the legacy support, so something might have to be done.

@ezyang
Copy link
Contributor

ezyang commented Apr 22, 2018

in THNN accumulation for float used to be in double, ATen does not support this now, would need something like https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/AccumulateType.cuh, should we add it? With this PR, accumulation for float is in float.

Yes, that would be great. I'd probably make another version of this template for "AccAccumulateType" or similar (following the TH convention)

in softmax_backward, self argument is added just so that gradient of self can be computed in double backward. Self tensor itself is not necessary neither for softmax_backward, nor for softmax_double_backward.

I'm a little perplexed by the double-backwards situation; if _backward has no data dependence on self, then there shouldn't be any gradient formula for the double backwards, right?

CC @gchanan about scalars

@ngimel
Copy link
Collaborator Author

ngimel commented Apr 22, 2018

  1. For legacy, I'd need to figure out how to expose softmax_backward to python. Any advice?
  2. I'll add cpu accumulate type template that follows what TH/THNN has now, and use it on the cpu paths (as a side note, I see cpu reductions now being moved from TH to ATen, they should face the same issue).
  3. It has an implicit dependence on self via output/result that is actually used in the gradient formula (a side note: in derivatives.yaml, for the first derivative softmax_backward has to be declared with result argument, not output, otherwise generator does not properly insert unpacking code. For the second derivative, softmax_backward has to be declared with output argument, otherwise compiler complains about something, don't remember what).
    So in this piece of code
name: softmax_backward(Tensor grad_output, Tensor output, int64_t dim, Tensor self)
grad_output: softmax_backward(grad, output, dim, self)
self: softmax_double_backward(grad, grad_output, dim, output)

self in the first line is just so that there can be self in the third line. Does it make sense?

@apaszke
Copy link
Contributor

apaszke commented Apr 23, 2018

Re 3: that makes sense, but IIRC that's not what we used to do some time ago. Isn't it possible to rewrite the derivative of grad_output in terms of output?

@ezyang
Copy link
Contributor

ezyang commented Apr 23, 2018

@apaszke No, the issue is different; none of the gradient formulas reference self.

I think we should leave it for now but eventually we should figure out what exactly is going on here. I would have thought that maybe autograd should have worked out that if there's a differential on output, then you need to in turn compute the differential on self because output was computed from self.

@apaszke
Copy link
Contributor

apaszke commented Apr 23, 2018

If they don't reference self then why does it need to be passed around at all? I think we should resolve this before we merge this. Softmax is often used in places that are pretty critical for memory usage, and should be efficient.

@ngimel
Copy link
Collaborator Author

ngimel commented Apr 23, 2018

It's passed around so that double backward on self can be defined, in the 3-line snippet I posted above. It's no worse than it is now (THNN/THCUNN also does not use input in backward, yet it is there as an argument)
https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml#L761
https://github.com/pytorch/pytorch/blob/master/tools/autograd/derivatives.yaml#L962
but I'm all for resolving it.

@apaszke
Copy link
Contributor

apaszke commented Apr 23, 2018

Hmm I see now. Alright, that sounds a bit complicated, so we don't have to block on that.

@ngimel
Copy link
Collaborator Author

ngimel commented Apr 27, 2018

Rebased on master, fixed legacy by calling *backward *backward_data (native functions ending in backward are blacklisted from being exposed in python). Don't particularly like this solution, but can't find any other way.
Moved AccumulateType to be cpu/gpu accumulate type, with accum types set to what they are now in TH/THC. For cpu float softmax, behavior now replicates THNN, accumulation is in double. Had to change includes in a few existing files to ATen/AccumulateType.h instead of ATen/cuda/AccumulateType.cuh, and fix the source.

@vadimkantorov
Copy link
Contributor

@apaszke what's your view on making log_softmax and logsigmoid naming consistent? #6786 (comment)

@apaszke
Copy link
Contributor

apaszke commented Apr 28, 2018

If there's something to be renamed it should be logsigmoid. log_softmax is used in half of PyTorch scripts and that's too much to change. I also like the separated syntax more.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review by any means, but looks ok at a glance. Just curious why some T changed to AccumT in the kernels

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@ssnl
Copy link
Collaborator

ssnl commented May 1, 2018

@ngimel can you kindly rebase please? :)

@ngimel
Copy link
Collaborator Author

ngimel commented May 1, 2018

I don't see an actual error on failed trusty build, other than gcc exited with error code 1 - did it time out? For failed xenial-cuda, the failure also does not look related (there's no softmax in the test script).

@ezyang
Copy link
Contributor

ezyang commented May 3, 2018

Yeah, the gcc7.2 is a known flakiness #7202

@pytorchbot retest this please

@ngimel
Copy link
Collaborator Author

ngimel commented May 3, 2018

Hmm, cuda9-cudnn7 failed again, how do I run test/cpp/ tests locally?

@ngimel
Copy link
Collaborator Author

ngimel commented May 3, 2018

Interestingly, I can make adagrad fail on master by commenting out integration.cpp from test_api sources (because I was too lazy to wait for mnist training), in this case adagrad goes for 3000 epochs with loss stuck at 0.48.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_api is a Catch v2.2.1 host application.
Run with -? for options

-------------------------------------------------------------------------------
optim
  adagrad
-------------------------------------------------------------------------------
/raid/pytorch/test/cpp/api/optim.cpp:56
...............................................................................

/raid/pytorch/test/cpp/api/optim.cpp:64: FAILED:

Optimizer tests seem kind of flaky

@ngimel
Copy link
Collaborator Author

ngimel commented May 4, 2018

So given that the failing test is flaky (#7288), and this PR actually does not touch any of the things the failing test is testing this should be good to go?

@ezyang ezyang merged commit a015d57 into pytorch:master May 4, 2018
@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 7, 2018

Hey @ngimel, are you planning on further changing the CPU implementation? I'm currently working on vectorizing softmax and I have just finished what you merged on my end as well haha.

@ngimel
Copy link
Collaborator Author

ngimel commented May 7, 2018

No, I did not plan on further optimizing CPU path, but now that it's in ATen it should be easier for you to work on.

@cpuhrsch
Copy link
Contributor

cpuhrsch commented May 7, 2018

Thanks @ngimel!

@ngimel ngimel deleted the softmaxATen branch May 7, 2018 20:40
Jorghi12 pushed a commit to wsttiger/pytorch that referenced this pull request May 10, 2018
* move softmax/logsoftmax to ATen

* specify cpu and gpu accum types

* use accreal for CPU

* expose softmax backward to python, fix legacy interface

* fix Distributions.cu to use common AccumulateType

* fix cuda 8 build

* delete commented out lines

* rebase on master, fix breakages
weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018
* move softmax/logsoftmax to ATen

* specify cpu and gpu accum types

* use accreal for CPU

* expose softmax backward to python, fix legacy interface

* fix Distributions.cu to use common AccumulateType

* fix cuda 8 build

* delete commented out lines

* rebase on master, fix breakages
facebook-github-bot pushed a commit that referenced this pull request Aug 24, 2018
Summary:
**Summary**: This PR is a followup of mruberry's #9318. It tries to achieve the following:
- Specializing std common math functions for `at::Half` type.
- Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`.
- Update `THCNumerics.cuh` with new usage and comments to  demonstrate the best practice for developers and hence, making way for its deprecation.
- Remove legacy/redundant code path.
- Remove unused CUDA HALF macros (see separate PR #10147)

**Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed:
- All arithmetic can now be done in ATen using binary cuda kernel  or CUDA tensor pointwise apply (check #8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float.
- Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h`
- Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call.
- Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for
`at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: ROCm/hip#374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP).

Here are some reference PRs that was handy in refactoring TH into ATen:
- #6786
- #5475
- #9401
- #8689
- #8919
Pull Request resolved: #10301

Differential Revision: D9204758

Pulled By: soumith

fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a
zdevito pushed a commit to zdevito/ATen that referenced this pull request Aug 25, 2018
Summary:
**Summary**: This PR is a followup of mruberry's pytorch/pytorch#9318. It tries to achieve the following:
- Specializing std common math functions for `at::Half` type.
- Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`.
- Update `THCNumerics.cuh` with new usage and comments to  demonstrate the best practice for developers and hence, making way for its deprecation.
- Remove legacy/redundant code path.
- Remove unused CUDA HALF macros (see separate PR pytorch/pytorch#10147)

**Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed:
- All arithmetic can now be done in ATen using binary cuda kernel  or CUDA tensor pointwise apply (check pytorch/pytorch#8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float.
- Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h`
- Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call.
- Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for
`at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: ROCm/hip#374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP).

Here are some reference PRs that was handy in refactoring TH into ATen:
- pytorch/pytorch#6786
- pytorch/pytorch#5475
- pytorch/pytorch#9401
- pytorch/pytorch#8689
- pytorch/pytorch#8919
Pull Request resolved: pytorch/pytorch#10301

Differential Revision: D9204758

Pulled By: soumith

fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
…rch#10301)

Summary:
**Summary**: This PR is a followup of mruberry's pytorch#9318. It tries to achieve the following:
- Specializing std common math functions for `at::Half` type.
- Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`.
- Update `THCNumerics.cuh` with new usage and comments to  demonstrate the best practice for developers and hence, making way for its deprecation.
- Remove legacy/redundant code path.
- Remove unused CUDA HALF macros (see separate PR pytorch#10147)

**Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed:
- All arithmetic can now be done in ATen using binary cuda kernel  or CUDA tensor pointwise apply (check pytorch#8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float.
- Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h`
- Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call.
- Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for
`at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: ROCm/hip#374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP).

Here are some reference PRs that was handy in refactoring TH into ATen:
- pytorch#6786
- pytorch#5475
- pytorch#9401
- pytorch#8689
- pytorch#8919
Pull Request resolved: pytorch#10301

Differential Revision: D9204758

Pulled By: soumith

fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants