Skip to content

[nvFuser] Reduction support in codegen, fp16 support#38627

Closed
csarofeen wants to merge 40 commits intopytorch:masterfrom
csarofeen:reduction_update
Closed

[nvFuser] Reduction support in codegen, fp16 support#38627
csarofeen wants to merge 40 commits intopytorch:masterfrom
csarofeen:reduction_update

Conversation

@csarofeen
Copy link
Copy Markdown
Contributor

Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:

  • Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
  • Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.

tlemo and others added 28 commits April 30, 2020 16:10
…rder on TensorView as it's a passthrough to TensorDomain.
support fp16 by adding cast in parser, simple test added for half
Right now launch and kernel configuration is very naive. We need to revisit this
when handling reduction.
repro the failing python test:
```
// This will prints out codegen cuda code
export PYTORCH_CUDA_FUSER_DEBUG=1

// The failing test could be repro'ed with this
python test_jit_cuda_fuser.py --ge_config profiling
```
Cleanup a few compiler warnings
@csarofeen csarofeen requested a review from apaszke as a code owner May 17, 2020 15:45
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 17, 2020
@jjsjann123
Copy link
Copy Markdown
Collaborator

Error is not relevant. I'm stamping approval on this. Let's try merge master after CI fix.

@jjsjann123 jjsjann123 self-requested a review May 20, 2020 00:14
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Comment thread torch/csrc/jit/codegen/cuda/arith.cpp Outdated
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@soumith merged this pull request in 8e69c3b.

@csarofeen csarofeen mentioned this pull request May 22, 2020
5 tasks
glaringlee pushed a commit that referenced this pull request May 27, 2020
This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4

There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently:
#37849
#38627
which caused the following problems:
https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps
https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps

The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler,  I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4.

People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. 

Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026)

[ghstack-poisoned]
glaringlee pushed a commit that referenced this pull request May 27, 2020
This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4

There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently:
#37849
#38627
which caused the following problems:
https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps
https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps

The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler,  I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4.

People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. 

Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026)

[ghstack-poisoned]
glaringlee pushed a commit that referenced this pull request May 28, 2020
This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4

There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently:
#37849
#38627
which caused the following problems:
https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps
https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps

The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler,  I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4.

People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. 

Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026)

[ghstack-poisoned]
glaringlee pushed a commit that referenced this pull request May 28, 2020
This is to reland #38675, and test cpp_extension compatibility in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4

There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently:
#37849
#38627
which caused the following problems:
https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps
https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps

The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler,  I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4.

People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. 

Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026)

[ghstack-poisoned]
glaringlee pushed a commit that referenced this pull request May 28, 2020
This is to reland #38675, and test cpp_extension compatibility in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4

There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently:
#37849
#38627
which caused the following problems:
https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps
https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps

The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler,  I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4.

People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. 

Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026)

[ghstack-poisoned]
@csarofeen csarofeen deleted the reduction_update branch June 5, 2020 14:01
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Adds reduction  support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: pytorch/pytorch#38627

Reviewed By: albanD

Differential Revision: D21663196

Pulled By: soumith

fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Adds reduction  support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: pytorch/pytorch#38627

Reviewed By: albanD

Differential Revision: D21663196

Pulled By: soumith

fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Adds reduction  support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:
- Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
- Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.
Pull Request resolved: pytorch#38627

Reviewed By: albanD

Differential Revision: D21663196

Pulled By: soumith

fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants