Conversation
|
Also fusing ReLU + Dropout should be possible (https://discuss.pytorch.org/t/relu-dropout-inplace/13467) |
|
@pytorchbot retest this please |
|
@vadimkantorov that type of fusion will be better handled when random generation is available in the jit compiler. |
|
I don't know what's with rocm build failure, |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
torch/nn/_functions/dropout.py
Outdated
| @staticmethod | ||
| def backward(ctx, grad_output): | ||
| if ctx.p > 0 and ctx.train: | ||
| if hasattr(ctx, 'use_fused_kernel') and ctx.use_fused_kernel: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/nn/_functions/dropout.py
Outdated
| import torch | ||
| from torch.autograd.function import InplaceFunction | ||
| from itertools import repeat | ||
| from torch.autograd.function import once_differentiable |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| - func: logsumexp_out(Tensor result, Tensor self, int64_t dim, bool keepdim=False) -> Tensor | ||
| variants: function | ||
|
|
||
| - func: masked_scale(Tensor self, Tensor mask, double scale) -> Tensor |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| - func: dot_out(Tensor result, Tensor self, Tensor tensor) -> Tensor | ||
| variants: function | ||
|
|
||
| - func: fused_dropout(Tensor self, double p, Generator* generator=nullptr) -> (Tensor, Tensor) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/cuda/Dropout.cu
Outdated
|
|
||
| namespace { | ||
|
|
||
| //due to limitations of philox generator UNROLL has to be 4 |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/cuda/Dropout.cu
Outdated
| idx, | ||
| seeds.second, | ||
| &state); | ||
| IndexType rounded_size = ((totalElements - 1)/(blockDim.x*gridDim.x*UNROLL)+1)*blockDim.x*gridDim.x*UNROLL; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/cuda/Dropout.cu
Outdated
| Tensor ret = at::empty_like(self); | ||
| Tensor mask = self.type().toScalarType(kByte).tensor(self.sizes()); | ||
| const int64_t nelem = self.numel(); | ||
| int64_t block_size = 256; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aten/src/ATen/native/cuda/Dropout.cu
Outdated
| dim3 dim_block(block_size); | ||
| dim3 grid((nelem + block_size -1)/block_size); | ||
| grid.x = std::min((unsigned int)at::cuda::getCurrentDeviceProperties()->multiProcessorCount * blocks_per_sm, grid.x); | ||
| int64_t nrep = ((nelem - 1)/(block_size*grid.x*UNROLL)+1)*UNROLL; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| ret_info.collapseDims(); | ||
| mask_info.collapseDims(); //ret and mask are collapsed to 1d contiguous tensor | ||
| switch (self_info.dims) { | ||
| case 1: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Any advice on fixing rocm build failure? Put curandPhiloxState inside device function? cc @iotamudelta |
|
Philox is unsupported with hcRNG on ROCm currently. We are working on integrating rocRAND (ROCm repo #41 )which will support it. |
|
Ok, I'll disable Dropout.cu in |
|
SGTM |
|
cc: @ngimel for conflicting file |
|
@ailzhang done |
|
Anything I should do here to move forward? |
|
Nay, let's try to land it. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
There's an internal build system issue; the fix is easy though; I'll push it to this branch. |
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
facebook-github-bot
left a comment
There was a problem hiding this comment.
ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask. Once dropout is moved to aten, these kernels still can be used for efficient implementation. Pull Request resolved: pytorch/pytorch#9666 Reviewed By: SsnL Differential Revision: D8948077 Pulled By: ezyang fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
Summary: While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask. Once dropout is moved to aten, these kernels still can be used for efficient implementation. Pull Request resolved: pytorch#9666 Reviewed By: SsnL Differential Revision: D8948077 Pulled By: ezyang fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
Summary: While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask. Once dropout is moved to aten, these kernels still can be used for efficient implementation. Pull Request resolved: pytorch#9666 Reviewed By: SsnL Differential Revision: D8948077 Pulled By: ezyang fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates
return selffor inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.Once dropout is moved to aten, these kernels still can be used for efficient implementation.