Use explicit templates in CUDALoops kernels#41059
Closed
malfet wants to merge 7 commits intopytorch:masterfrom
Closed
Use explicit templates in CUDALoops kernels#41059malfet wants to merge 7 commits intopytorch:masterfrom
malfet wants to merge 7 commits intopytorch:masterfrom
Conversation
💊 CI failures summary and remediationsAs of commit 92b99f7 (more details on the Dr. CI page):
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
5f31caa to
305c44a
Compare
ngimel
approved these changes
Jul 7, 2020
This reduces binary size from 3.8 to 3.5Mb
…Kernel Reduces sizeof(CompareEQKernel.cu.o) from 1.8Mb to 1.7Mb by eliminating 11 duplicated symbols.
…l.cu This reduces object file size from 2.0 to 1.8Mb
Reduces binary size from 2.6 to 2.3Mb
Reduces binary size with no perf side effects
305c44a to
92b99f7
Compare
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
Collaborator
|
@malfet This PR is breaking the ROCm build. |
facebook-github-bot
pushed a commit
that referenced
this pull request
Sep 25, 2020
Summary: Reland attempt of #41059 Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely: BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb CompareEQKernel.cu 1.8Mb -> 1.7Mb BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb Pull Request resolved: #44286 Reviewed By: ngimel Differential Revision: D23859691 Pulled By: malfet fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: Follow up after pytorch#40992 Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely: BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb CompareEQKernel.cu 1.8Mb -> 1.7Mb BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb Pull Request resolved: pytorch#41059 Differential Revision: D22458928 Pulled By: malfet fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: Reland attempt of pytorch#41059 Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely: BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb CompareEQKernel.cu 1.8Mb -> 1.7Mb BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb Pull Request resolved: pytorch#44286 Reviewed By: ngimel Differential Revision: D23859691 Pulled By: malfet fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow up after #40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb