Skip to content

Use explicit templates in CUDALoops kernels#41059

Closed
malfet wants to merge 7 commits intopytorch:masterfrom
malfet:malfet/CUDALoops-more-explicit-templates
Closed

Use explicit templates in CUDALoops kernels#41059
malfet wants to merge 7 commits intopytorch:masterfrom
malfet:malfet/CUDALoops-more-explicit-templates

Conversation

@malfet
Copy link
Copy Markdown
Contributor

@malfet malfet commented Jul 7, 2020

Follow up after #40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

@malfet malfet requested review from gchanan, ngimel and zasdfgbnm July 7, 2020 02:27
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Jul 7, 2020

💊 CI failures summary and remediations

As of commit 92b99f7 (more details on the Dr. CI page):



🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 14 times.

@malfet malfet force-pushed the malfet/CUDALoops-more-explicit-templates branch 2 times, most recently from 5f31caa to 305c44a Compare July 7, 2020 19:48
Copy link
Copy Markdown
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment thread aten/src/ATen/native/cuda/BinaryMulDivKernel.cu Outdated
malfet added 7 commits July 7, 2020 19:57
This reduces binary size from 3.8 to 3.5Mb
…Kernel

Reduces sizeof(CompareEQKernel.cu.o) from 1.8Mb to 1.7Mb by eliminating 11 duplicated symbols.
…l.cu

This reduces object file size from 2.0 to 1.8Mb
Reduces binary size from 2.6 to 2.3Mb
Reduces binary size with no perf side effects
@malfet malfet force-pushed the malfet/CUDALoops-more-explicit-templates branch from 305c44a to 92b99f7 Compare July 8, 2020 03:10
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@malfet merged this pull request in e374280.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@malfet This PR is breaking the ROCm build.

@malfet malfet deleted the malfet/CUDALoops-more-explicit-templates branch September 8, 2020 00:55
facebook-github-bot pushed a commit that referenced this pull request Sep 25, 2020
Summary:
Reland attempt of #41059
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: #44286

Reviewed By: ngimel

Differential Revision: D23859691

Pulled By: malfet

fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Follow up after pytorch#40992
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: pytorch#41059

Differential Revision: D22458928

Pulled By: malfet

fbshipit-source-id: cca623bb6e769cfe372977b08463d98b1a02dd14
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Reland attempt of pytorch#41059
Use explicit templates instead of lambdas to reduce binary size without affecting the perf by 100-200Kb per arch per CU, namely:
BinaryMulDivKernel.cu 3.8Mb -> 3.5Mb
CompareEQKernel.cu 1.8Mb -> 1.7Mb
BinaryAddSubKernel.cu 2.0Mb -> 1.8Mb
BinaryBitwiseOpsKernels.cu 2.6Mb -> 2.3Mb

Pull Request resolved: pytorch#44286

Reviewed By: ngimel

Differential Revision: D23859691

Pulled By: malfet

fbshipit-source-id: 2c4e86f35e0f94a62294dc5d52a3ba364db23e2d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants