Use explicit templates in `gpu_kernel_with_scalars` by malfet · Pull Request #40992 · pytorch/pytorch

malfet · 2020-07-05T21:00:21Z

This trick should have no effect on performance, but it reduces size of kernels using the template by 10%
For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

This trick should have no effect on performance, but it reduces size of kernels using the template by 10% For example, sizeofBinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

zasdfgbnm

Why is the binary size reduced?

facebook-github-bot

@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

malfet · 2020-07-06T20:43:59Z

@zasdfgbnm I'm not entirely sure, to tell the truth, but my guess is that too many lambdas confuse both host and GPU compiler to have multiple identical instances of the same template.
I.e. nm torch_cuda_generated_BinaryMulDivKernel.cu.o return 2578 symbols before the change, but only 2325 after.

facebook-github-bot · 2020-07-07T00:11:35Z

@malfet merged this pull request in 87f9b55.

Summary: This trick should have no effect on performance, but it reduces size of kernels using the template by 10% For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb Pull Request resolved: pytorch#40992 Differential Revision: D22398733 Pulled By: malfet fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f

Summary: Follow up after #40992 Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely: BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb CompareEQKernel.cu 1.8Mb -> 1.7Mb BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb Pull Request resolved: #41059 Differential Revision: D22458928 Pulled By: malfet fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14

…40992)" This reverts commit ead6fe4.

Summary: This trick should have no effect on performance, but it reduces size of kernels using the template by 10% For example, sizeof(BinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb Pull Request resolved: pytorch#40992 Differential Revision: D22398733 Pulled By: malfet fbshipit-source-id: 6576f4da00dc5fc2575b2313577f52c6571d5e6f

Summary: Follow up after pytorch#40992 Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely: BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb CompareEQKernel.cu 1.8Mb -> 1.7Mb BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb Pull Request resolved: pytorch#41059 Differential Revision: D22458928 Pulled By: malfet fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14

Use explicit templates in gpu_kernel_with_scalars

0cf3579

This trick should have no effect on performance, but it reduces size of kernels using the template by 10% For example, sizeofBinaryMulDivKernel.cu.o) compiled by CUDA-10.1 toolchain for sm_75 before the change was 4.2Mb, after 3.8Mb

malfet requested review from ezyang, ngimel and zasdfgbnm July 5, 2020 21:00

zasdfgbnm approved these changes Jul 5, 2020

View reviewed changes

ngimel approved these changes Jul 5, 2020

View reviewed changes

facebook-github-bot reviewed Jul 6, 2020

View reviewed changes

facebook-github-bot closed this in 87f9b55 Jul 6, 2020

facebook-github-bot added the merged label Jul 7, 2020

malfet deleted the malfet/CUDALoops-expilcit-templates branch July 7, 2020 00:24

malfet mentioned this pull request Jul 7, 2020

Use explicit templates in CUDALoops kernels #41059

Closed

csarofeen added a commit to csarofeen/pytorch that referenced this pull request Aug 16, 2020

Revert "Use explicit templates in gpu_kernel_with_scalars (pytorch#…

2000bf0

…40992)" This reverts commit ead6fe4.

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use explicit templates in `gpu_kernel_with_scalars`#40992

Use explicit templates in `gpu_kernel_with_scalars`#40992
malfet wants to merge 1 commit intopytorch:masterfrom
malfet:malfet/CUDALoops-expilcit-templates

malfet commented Jul 5, 2020 •

edited

Loading

Uh oh!

zasdfgbnm left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

malfet commented Jul 6, 2020

Uh oh!

facebook-github-bot commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

malfet commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Jul 6, 2020

Uh oh!

facebook-github-bot commented Jul 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Jul 5, 2020 •

edited

Loading