Skip to content

Use explicit templates in gpu_kernel_with_scalars#40992

Closed
malfet wants to merge 1 commit intopytorch:masterfrom
malfet:malfet/CUDALoops-expilcit-templates
Closed

Use explicit templates in gpu_kernel_with_scalars#40992
malfet wants to merge 1 commit intopytorch:masterfrom
malfet:malfet/CUDALoops-expilcit-templates

Conversation

@malfet
Copy link
Copy Markdown
Contributor

@malfet malfet commented Jul 5, 2020

This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeofBinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb
@malfet malfet requested review from ezyang, ngimel and zasdfgbnm July 5, 2020 21:00
Copy link
Copy Markdown
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the binary size reduced?

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@malfet
Copy link
Copy Markdown
Contributor Author

malfet commented Jul 6, 2020

@zasdfgbnm I'm not entirely sure, to tell the truth, but my guess is that too many lambdas confuse both host and GPU compiler to have multiple identical instances of the same template.
I.e. nm torch_cuda_generated_BinaryMulDivKernel.cu.o return 2578 symbols before the change, but only 2325 after.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@malfet merged this pull request in 87f9b55.

@malfet malfet deleted the malfet/CUDALoops-expilcit-templates branch July 7, 2020 00:24
csarofeen pushed a commit to csarofeen/pytorch that referenced this pull request Jul 7, 2020
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

Pull Request resolved: pytorch#40992

Differential Revision: D22398733

Pulled By: malfet

fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
facebook-github-bot pushed a commit that referenced this pull request Jul 9, 2020
Summary:
Follow up after #40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: #41059

Differential Revision: D22458928

Pulled By: malfet

fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
csarofeen added a commit to csarofeen/pytorch that referenced this pull request Aug 16, 2020
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

Pull Request resolved: pytorch#40992

Differential Revision: D22398733

Pulled By: malfet

fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Follow up after pytorch#40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: pytorch#41059

Differential Revision: D22458928

Pulled By: malfet

fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants