[nvFuser] Reduction support in codegen, fp16 support#38627
Closed
csarofeen wants to merge 40 commits intopytorch:masterfrom
Closed
[nvFuser] Reduction support in codegen, fp16 support#38627csarofeen wants to merge 40 commits intopytorch:masterfrom
csarofeen wants to merge 40 commits intopytorch:masterfrom
Conversation
…rder on TensorView as it's a passthrough to TensorDomain.
support fp16 by adding cast in parser, simple test added for half
Right now launch and kernel configuration is very naive. We need to revisit this when handling reduction.
repro the failing python test: ``` // This will prints out codegen cuda code export PYTORCH_CUDA_FUSER_DEBUG=1 // The failing test could be repro'ed with this python test_jit_cuda_fuser.py --ge_config profiling ```
…ytorch into reduction_clean
Cleanup a few compiler warnings
Collaborator
|
Error is not relevant. I'm stamping approval on this. Let's try merge master after CI fix. |
jjsjann123
approved these changes
May 20, 2020
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
malfet
reviewed
May 20, 2020
Contributor
facebook-github-bot
left a comment
There was a problem hiding this comment.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
soumith
approved these changes
May 21, 2020
Contributor
5 tasks
glaringlee
pushed a commit
that referenced
this pull request
May 27, 2020
This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]
glaringlee
pushed a commit
that referenced
this pull request
May 27, 2020
This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]
glaringlee
pushed a commit
that referenced
this pull request
May 28, 2020
This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]
glaringlee
pushed a commit
that referenced
this pull request
May 28, 2020
This is to reland #38675, and test cpp_extension compatibility in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]
glaringlee
pushed a commit
that referenced
this pull request
May 28, 2020
This is to reland #38675, and test cpp_extension compatibility in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]
jjsjann123
pushed a commit
to jjsjann123/nvfuser
that referenced
this pull request
Oct 29, 2022
Summary: Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support. The two remaining pieces missing for reduction support is: - Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore - Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs. Also working towards reductions and shape inference for reductions in the fusion pass. Pull Request resolved: pytorch/pytorch#38627 Reviewed By: albanD Differential Revision: D21663196 Pulled By: soumith fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
jjsjann123
pushed a commit
to jjsjann123/nvfuser
that referenced
this pull request
Nov 10, 2022
Summary: Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support. The two remaining pieces missing for reduction support is: - Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore - Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs. Also working towards reductions and shape inference for reductions in the fusion pass. Pull Request resolved: pytorch/pytorch#38627 Reviewed By: albanD Differential Revision: D21663196 Pulled By: soumith fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support. The two remaining pieces missing for reduction support is: - Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore - Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs. Also working towards reductions and shape inference for reductions in the fusion pass. Pull Request resolved: pytorch#38627 Reviewed By: albanD Differential Revision: D21663196 Pulled By: soumith fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.
The two remaining pieces missing for reduction support is:
PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.
Also working towards reductions and shape inference for reductions in the fusion pass.