New CUDA Fuser: Unrolling support, interface refactor#36435
Closed
csarofeen wants to merge 59 commits intopytorch:masterfrom
Closed
New CUDA Fuser: Unrolling support, interface refactor#36435csarofeen wants to merge 59 commits intopytorch:masterfrom
csarofeen wants to merge 59 commits intopytorch:masterfrom
Conversation
…hecking (still not recursive). Add start index to For Loops.
…ditionals back to Int.
…omains as they are being transformed in these operations.
…iew APIs are pass through.
…ditionals back to Int.
jjsjann123
approved these changes
Apr 14, 2020
Collaborator
jjsjann123
left a comment
There was a problem hiding this comment.
LGTM.
The failing CI gives very good hints on minor code change. We should fix those.
…unroll_interface_rebase
soumith
approved these changes
Apr 14, 2020
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Contributor
jjsjann123
pushed a commit
to jjsjann123/nvfuser
that referenced
this pull request
Oct 29, 2022
Summary: Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is: https://godbolt.org/z/i0uAv3 What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling. Pull Request resolved: pytorch/pytorch#36435 Reviewed By: ZolotukhinM Differential Revision: D21024011 Pulled By: soumith fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
jjsjann123
pushed a commit
to jjsjann123/nvfuser
that referenced
this pull request
Nov 10, 2022
Summary: Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is: https://godbolt.org/z/i0uAv3 What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling. Pull Request resolved: pytorch/pytorch#36435 Reviewed By: ZolotukhinM Differential Revision: D21024011 Pulled By: soumith fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is: https://godbolt.org/z/i0uAv3 What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling. Pull Request resolved: pytorch#36435 Reviewed By: ZolotukhinM Differential Revision: D21024011 Pulled By: soumith fbshipit-source-id: e852e282fa7a304aba962e1926f756098c011fe0
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Unrolling support has been added in a way that we get good performing code on GPUs. Not sure how long this link will last but an example of a generated unrolled kernel is:
https://godbolt.org/z/i0uAv3
What can be seen from there is multiple calls of "ld.global.f32" without "ld.store.f32" in between them (and vice versa). This means that we are launching multiple loads that can be run in parallel, as well as multiple stores that can be run in parallel. This can be a crucial optimization for memory bound kernels. This was generally a point of concern in TVM as an attempt of a similar kernel from TVM produces: https://godbolt.org/z/Vu97vG which surrounds load - store pairs in conditional branches preventing the benefits of unrolling.