Skip to content

add fused dropout kernels#9666

Closed
ngimel wants to merge 9 commits intopytorch:masterfrom
ngimel:dropout
Closed

add fused dropout kernels#9666
ngimel wants to merge 9 commits intopytorch:masterfrom
ngimel:dropout

Conversation

@ngimel
Copy link
Collaborator

@ngimel ngimel commented Jul 20, 2018

While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates return self for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.

@vadimkantorov
Copy link
Contributor

Also fusing ReLU + Dropout should be possible (https://discuss.pytorch.org/t/relu-dropout-inplace/13467)

@soumith
Copy link
Collaborator

soumith commented Jul 21, 2018

@pytorchbot retest this please

@ngimel
Copy link
Collaborator Author

ngimel commented Jul 21, 2018

Build breaks because of #9435, I'll submit a fix for this PR, but #9435 also breaks other pytorch cpp extensions in the wild (e.g. apex)

@ngimel
Copy link
Collaborator Author

ngimel commented Jul 21, 2018

@vadimkantorov that type of fusion will be better handled when random generation is available in the jit compiler.

@ngimel
Copy link
Collaborator Author

ngimel commented Jul 21, 2018

I don't know what's with rocm build failure, Distributions.cu seem to be using philox state in the same way, without special rocm handling.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@staticmethod
def backward(ctx, grad_output):
if ctx.p > 0 and ctx.train:
if hasattr(ctx, 'use_fused_kernel') and ctx.use_fused_kernel:

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

import torch
from torch.autograd.function import InplaceFunction
from itertools import repeat
from torch.autograd.function import once_differentiable

This comment was marked as off-topic.

This comment was marked as off-topic.

- func: logsumexp_out(Tensor result, Tensor self, int64_t dim, bool keepdim=False) -> Tensor
variants: function

- func: masked_scale(Tensor self, Tensor mask, double scale) -> Tensor

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

- func: dot_out(Tensor result, Tensor self, Tensor tensor) -> Tensor
variants: function

- func: fused_dropout(Tensor self, double p, Generator* generator=nullptr) -> (Tensor, Tensor)

This comment was marked as off-topic.


namespace {

//due to limitations of philox generator UNROLL has to be 4

This comment was marked as off-topic.

This comment was marked as off-topic.

idx,
seeds.second,
&state);
IndexType rounded_size = ((totalElements - 1)/(blockDim.x*gridDim.x*UNROLL)+1)*blockDim.x*gridDim.x*UNROLL;

This comment was marked as off-topic.

Tensor ret = at::empty_like(self);
Tensor mask = self.type().toScalarType(kByte).tensor(self.sizes());
const int64_t nelem = self.numel();
int64_t block_size = 256;

This comment was marked as off-topic.

dim3 dim_block(block_size);
dim3 grid((nelem + block_size -1)/block_size);
grid.x = std::min((unsigned int)at::cuda::getCurrentDeviceProperties()->multiProcessorCount * blocks_per_sm, grid.x);
int64_t nrep = ((nelem - 1)/(block_size*grid.x*UNROLL)+1)*UNROLL;

This comment was marked as off-topic.

This comment was marked as off-topic.

ret_info.collapseDims();
mask_info.collapseDims(); //ret and mask are collapsed to 1d contiguous tensor
switch (self_info.dims) {
case 1:

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator Author

ngimel commented Jul 24, 2018

Any advice on fixing rocm build failure? Put curandPhiloxState inside device function? cc @iotamudelta
Also, MacOS failure is unclear, and looks like windows build was just broken yesterday.

@iotamudelta
Copy link
Contributor

iotamudelta commented Jul 24, 2018

Philox is unsupported with hcRNG on ROCm currently. We are working on integrating rocRAND (ROCm repo #41 )which will support it.

@ngimel
Copy link
Collaborator Author

ngimel commented Jul 24, 2018

Ok, I'll disable Dropout.cu in pytorch/tools/amd_build/disabled_features.yaml then.

@ezyang
Copy link
Contributor

ezyang commented Jul 24, 2018

SGTM

@ailzhang
Copy link
Contributor

cc: @ngimel for conflicting file

@ngimel
Copy link
Collaborator Author

ngimel commented Jul 31, 2018

@ailzhang done

@ngimel
Copy link
Collaborator Author

ngimel commented Aug 6, 2018

Anything I should do here to move forward?

@ezyang
Copy link
Contributor

ezyang commented Aug 6, 2018

Nay, let's try to land it.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Aug 7, 2018

There's an internal build system issue; the fix is easy though; I'll push it to this branch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Aug 7, 2018
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: pytorch/pytorch#9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Aug 10, 2018
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: pytorch#9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: pytorch#9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
@ngimel ngimel deleted the dropout branch September 23, 2018 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants