Skip to content

Implement gpu_kernel_multiple_outputs#37969

Closed
crcrpar wants to merge 12 commits intopytorch:masterfrom
crcrpar:multiple-outputs-gpu-kernel
Closed

Implement gpu_kernel_multiple_outputs#37969
crcrpar wants to merge 12 commits intopytorch:masterfrom
crcrpar:multiple-outputs-gpu-kernel

Conversation

@crcrpar
Copy link
Copy Markdown
Collaborator

@crcrpar crcrpar commented May 6, 2020

This PR introduces a variant of gpu_kernel for functions that return multiple values with thrust::tuple.
With this I simplified prelu_cuda_backward_share_weights_kernel.

Why using thrust::tuple?

Because std::tuple does not support operator= on device code which makes the implementation complicated.

crcrpar added 6 commits May 6, 2020 15:13
make return value of lambda argument of Loader struct

rewrite `prelu_cuda_backward_kernel_share_weight`

remove `legacy` methods

fix result store function

reduce copy-paste

remove unused array

fix except for tuple
revert changes in CUDALoops

move `is_tuple` to Loops.cuh
- changes some asserts
- typename out_t -> typename ...Args and thrust::tuple<Args...>
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 6, 2020

💊 CI failures summary and remediations

As of commit 43289f8 (more details on the Dr. CI page):


  • 3/3 failures possibly* introduced in this PR
    • 2/3 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda11.0_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Error generating file
Retry attempt 3: 
C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(236): error: identifier "cusparseScsrmm2" is undefined

C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu(259): error: identifier "cusparseDcsrmm2" is undefined

2 errors detected in the compilation of "C:/Users/circleci/project/aten/src/ATen/native/sparse/cuda/SparseCUDABlas.cu".
SparseCUDABlas.cu
-- Removing C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj
C:/Jenkins/Miniconda3/Library/bin/cmake.exe -E remove C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj
CMake Error at torch_cuda_generated_SparseCUDABlas.cu.obj.Release.cmake:281 (message):
  Error generating file
  C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/sparse/cuda/./torch_cuda_generated_SparseCUDABlas.cu.obj


a3\Library\bin\cmake.exe -D verbose:BOOL=ON -D build_configuration:STRING=Release -D generated_file:STRING=C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/./torch_cuda_generated_context_gpu.cu.obj -D generated_cubin_file:STRING=C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/./torch_cuda_generated_context_gpu.cu.obj.cubin.txt -P C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/torch_cuda_generated_context_gpu.cu.obj.Release.cmake" 
-- Removing C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/./torch_cuda_generated_context_gpu.cu.obj
C:/Jenkins/Miniconda3/Library/bin/cmake.exe -E remove C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/./torch_cuda_generated_context_gpu.cu.obj
-- Generating dependency file: C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/torch_cuda_generated_context_gpu.cu.obj.NVCC-depend
hird_party/catch/single_include -IC:/Users/circleci/project/aten/src/ATen/.. -IC:/Users/circleci/project/build/caffe2/aten/src/ATen -IC:/Users/circleci/project/c10/cuda/../.. -IC:/Users/circleci/project/c10/../ "-IC:/Program Files/NVIDIA Corporation/NvToolsExt/include" -IC:/Users/circleci/project/torch/csrc/api -IC:/Users/circleci/project/torch/csrc/api/include -IC:/Users/circleci/project/build/third_party/ideep/mkl-dnn/include -IC:/Users/circleci/project/third_party/ideep/mkl-dnn/src/../include
context_gpu.cu 
-- Generating temporary cmake readable file: C:/Users/circleci/project/build/caffe2/CMakeFiles/torch_cuda.dir/core/torch_cuda_generated_context_gpu.cu.obj.depend.tmp

ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

Copy link
Copy Markdown
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel This refactors the logics of TensorIterator offset calculation, 32bit indexing splitting, etc. of prelu kernel into gpu_kernel_multiple_outputs, so that the codes in Activation.cu become cleaner without these logics. It is a gpu kernel that supports multiple outputs, but does not support dynamic casting for simplicity. It is supposed to cover some cases where gpu_kernel is not usable.

@zasdfgbnm zasdfgbnm requested a review from ngimel May 7, 2020 01:23
@z-a-f
Copy link
Copy Markdown

z-a-f commented Jul 31, 2020

@zasdfgbnm, @ngimel I will import it to phabricator (want to try it out internally). Please, commandeer the diff is ready to land.

@crcrpar Please, resolve the merge conflicts whenever you have a chance

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@z-a-f has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment thread aten/src/ATen/native/cuda/MemoryAccess.cuh
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@crcrpar
Copy link
Copy Markdown
Collaborator Author

crcrpar commented Aug 1, 2020

@z-a-f Thanks. Looks like @zasdfgbnm fixed conflicts on behalf of me including applying the fix suggested by @ngimel.

Thank you @zasdfgbnm, @ngimel

@z-a-f
Copy link
Copy Markdown

z-a-f commented Aug 5, 2020

@ngimel, Should I land this? Is it ready to be landed?

@zasdfgbnm
Copy link
Copy Markdown
Collaborator

zasdfgbnm commented Aug 5, 2020

@z-a-f This is not ready. Some cut-pastes are needed to make it build on ROCm. Working on it.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel merged this pull request in eb9ae7c.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
This PR introduces a variant of `gpu_kernel` for functions that return multiple values with `thrust::tuple`.
With this I simplified `prelu_cuda_backward_share_weights_kernel`.

### Why using `thrust::tuple`?
Because `std::tuple` does not support `operator=` on device code which makes the implementation complicated.

Pull Request resolved: pytorch#37969

Reviewed By: paulshaoyuqiao

Differential Revision: D22868670

Pulled By: ngimel

fbshipit-source-id: eda0a29ac0347ad544b24bf60e3d809a7db1a929
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants