add fused dropout kernels#9666

Closed

ngimel wants to merge 9 commits intopytorch:masterfrom

Collaborator

ngimel commented Jul 20, 2018

While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates return self for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.

ngimel requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners

July 20, 2018 23:42

Contributor

vadimkantorov commented Jul 21, 2018

Also fusing ReLU + Dropout should be possible (https://discuss.pytorch.org/t/relu-dropout-inplace/13467)

Collaborator

soumith commented Jul 21, 2018

@pytorchbot retest this please


          add fused dropout kernels

30c0373

Collaborator Author

ngimel commented Jul 21, 2018

Build breaks because of #9435, I'll submit a fix for this PR, but #9435 also breaks other pytorch cpp extensions in the wild (e.g. apex)


          fixes for pytorch#9435

ae176af

ngimel force-pushed the dropout branch from bd418e0 to ae176af Compare

July 21, 2018 05:39

Collaborator Author

ngimel commented Jul 21, 2018 •

edited

Loading

@vadimkantorov that type of fusion will be better handled when random generation is available in the jit compiler.

Natalia Gimelshein added 2 commits

July 20, 2018 22:47


          fix typos in launch params

a28a161


          fix AlphaDropout

54a7d06

Collaborator Author

ngimel commented Jul 21, 2018

I don't know what's with rocm build failure, Distributions.cu seem to be using philox state in the same way, without special rocm handling.

facebook-github-bot reviewed

View reviewed changes

Contributor

facebook-github-bot left a comment

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke reviewed

View reviewed changes

torch/nn/_functions/dropout.py Outdated

                   @staticmethod
                   def backward(ctx, grad_output):
-                      if ctx.p > 0 and ctx.train:
+                      if hasattr(ctx, 'use_fused_kernel') and ctx.use_fused_kernel:

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

torch/nn/_functions/dropout.py Outdated

               import torch
               from torch.autograd.function import InplaceFunction
               from itertools import repeat
+              from torch.autograd.function import once_differentiable

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/native_functions.yaml Outdated

               - func: logsumexp_out(Tensor result, Tensor self, int64_t dim, bool keepdim=False) -> Tensor
                 variants: function
+              - func: masked_scale(Tensor self, Tensor mask, double scale) -> Tensor

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/native_functions.yaml Outdated

               - func: dot_out(Tensor result, Tensor self, Tensor tensor) -> Tensor
                 variants: function
+              - func: fused_dropout(Tensor self, double p, Generator* generator=nullptr) -> (Tensor, Tensor)

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/cuda/Dropout.cu Outdated


		namespace {

		//due to limitations of philox generator UNROLL has to be 4

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/cuda/Dropout.cu Outdated

+                      idx,
+                      seeds.second,
+                      &state);
+                IndexType rounded_size = ((totalElements - 1)/(blockDim.x*gridDim.x*UNROLL)+1)*blockDim.x*gridDim.x*UNROLL;

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/cuda/Dropout.cu Outdated

+                Tensor ret = at::empty_like(self);
+                Tensor mask = self.type().toScalarType(kByte).tensor(self.sizes());
+                const int64_t nelem = self.numel();
+                int64_t block_size = 256;

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/cuda/Dropout.cu Outdated

+                dim3 dim_block(block_size);
+                dim3 grid((nelem + block_size -1)/block_size);
+                grid.x = std::min((unsigned int)at::cuda::getCurrentDeviceProperties()->multiProcessorCount * blocks_per_sm, grid.x);
+                int64_t nrep = ((nelem - 1)/(block_size*grid.x*UNROLL)+1)*UNROLL;

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

aten/src/ATen/native/cuda/Dropout.cu

+                    ret_info.collapseDims();
+                    mask_info.collapseDims(); //ret and mask are collapsed to 1d contiguous tensor
+                    switch (self_info.dims) {
+                      case 1:

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view


          address review comments

3475eea

facebook-github-bot reviewed

View reviewed changes

Contributor

facebook-github-bot left a comment

@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Collaborator Author

ngimel commented Jul 24, 2018

Any advice on fixing rocm build failure? Put curandPhiloxState inside device function? cc @iotamudelta
Also, MacOS failure is unclear, and looks like windows build was just broken yesterday.

Contributor

iotamudelta commented Jul 24, 2018 •

edited

Loading

Philox is unsupported with hcRNG on ROCm currently. We are working on integrating rocRAND (ROCm repo #41 )which will support it.

Collaborator Author

ngimel commented Jul 24, 2018

Ok, I'll disable Dropout.cu in pytorch/tools/amd_build/disabled_features.yaml then.

Contributor

ezyang commented Jul 24, 2018

SGTM


          try to fix ROCm build

9191aa5

Contributor

ailzhang commented Jul 31, 2018

cc: @ngimel for conflicting file


          Merge branch 'master' into dropout

6657fcd

Collaborator Author

ngimel commented Jul 31, 2018


          fix windows build

dcffec7

Collaborator Author

ngimel commented Aug 6, 2018

Anything I should do here to move forward?

Contributor

ezyang commented Aug 6, 2018

Nay, let's try to land it.

facebook-github-bot reviewed

View reviewed changes

Contributor

facebook-github-bot left a comment

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Contributor

ezyang commented Aug 7, 2018

There's an internal build system issue; the fix is easy though; I'll push it to this branch.


          Use absolute paths for all header includes.

dabffec

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

facebook-github-bot reviewed

View reviewed changes

Contributor

facebook-github-bot left a comment

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot closed this in

5bb2149

zdevito pushed a commit to zdevito/ATen that referenced this pull request


          add fused dropout kernels (#9666)

9b2d925

Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: pytorch/pytorch#9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a

PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request


          add fused dropout kernels (pytorch#9666)

a90484d

Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: pytorch#9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a

goodlux pushed a commit to goodlux/pytorch that referenced this pull request


          add fused dropout kernels (pytorch#9666)

45c124d

Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: pytorch#9666

Reviewed By: SsnL

Differential Revision: D8948077

Pulled By: ezyang

fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a

ngimel deleted the dropout branch

September 23, 2018 21:09

ezyang added open source merged labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

facebook-github-bot facebook-github-bot left review comments

colesbury Awaiting requested review from colesbury

ezyang Awaiting requested review from ezyang

gchanan Awaiting requested review from gchanan

soumith Awaiting requested review from soumith

zdevito Awaiting requested review from zdevito

+1 more reviewer

apaszke apaszke left review comments

Labels