[nvFuser] Reduction support in codegen, fp16 support by csarofeen · Pull Request #38627 · pytorch/pytorch

csarofeen · 2020-05-17T15:45:45Z

Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support.

The two remaining pieces missing for reduction support is:

Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore
Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator

PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs.

Also working towards reductions and shape inference for reductions in the fusion pass.

…to be fixed.

…rder on TensorView as it's a passthrough to TensorDomain.

support fp16 by adding cast in parser, simple test added for half

…o reduction

Right now launch and kernel configuration is very naive. We need to revisit this when handling reduction.

repro the failing python test: ``` // This will prints out codegen cuda code export PYTORCH_CUDA_FUSER_DEBUG=1 // The failing test could be repro'ed with this python test_jit_cuda_fuser.py --ge_config profiling ```

…ytorch into reduction_clean

Cleanup a few compiler warnings

jjsjann123 · 2020-05-20T00:13:59Z

Error is not relevant. I'm stamping approval on this. Let's try merge master after CI fix.

…reduction

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-05-22T02:13:51Z

@soumith merged this pull request in 8e69c3b.

This is to reland #38675, and test cpp_extension compatible in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]

This is to reland #38675, and test cpp_extension compatibility in _test only, this is enough, the purpose of this test is to make sure pytorch and cpp extension are compatible with xenial + cuda 9.2 + gcc 5.4 There are two non gcc5.4 (+ cuda9.2) compatible change introduced recently: #37849 #38627 which caused the following problems: https://app.circleci.com/pipelines/github/pytorch/pytorch/173756/workflows/7445e169-9c26-4ec4-a23a-ff6160d155b1/jobs/5582207/steps https://app.circleci.com/pipelines/github/pytorch/pytorch/173970/workflows/bf0de0f2-9156-4c8f-a097-53ca8e20d4b0/jobs/5589265/steps The root cause is that gcc 5.4 does not support uniform initialization list well, it can not deduce a correct type in some cases. It probably bugs in the gcc 5 compiler, I modified these code a little bit to make them compatible with cuda 9.2 + gcc 5.4. People are still using xenial + gcc5.4 + cuda 9.x, this env should be covered until xenial is deprecated. Differential Revision: [D21731026](https://our.internmc.facebook.com/intern/diff/D21731026) [ghstack-poisoned]

Summary: Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support. The two remaining pieces missing for reduction support is: - Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore - Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs. Also working towards reductions and shape inference for reductions in the fusion pass. Pull Request resolved: pytorch/pytorch#38627 Reviewed By: albanD Differential Revision: D21663196 Pulled By: soumith fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e

Summary: Adds reduction support for the code generator. Reductions are fully supported with split/merge/reorder/rfactor/computeAt/unroll operators. There is also cross thread (intra-block) reduction support. The two remaining pieces missing for reduction support is: - Safety: If cross thread reduction was used, child operators shouldn't be able to bind that thread dim anymore - Cross block reduction: we will want inter-block reduction support to match parity with tensor iterator PR also provides FP16 support for fusions now. We insert casts on FP16 inputs to FP32, and we insert casts to FP16 on FP16 outputs. Also working towards reductions and shape inference for reductions in the fusion pass. Pull Request resolved: pytorch#38627 Reviewed By: albanD Differential Revision: D21663196 Pulled By: soumith fbshipit-source-id: 3ff2df563f86c39cd5821ab9c1148149e5172a9e

tlemo and others added 28 commits April 30, 2020 16:10

Cleanup a few compiler warnings

729fd67

Stage 1 of reintegrating reduction code into code generation.

a06fc0e

Increment 2 reduction reintegration.

d6afed6

Build fixed, need to work on tests.

60d273a

Increment 4, reintegrate reductions.

0bc25f4

Increment 4, reintegrate reductions, tests building, Predicates need …

63b5910

…to be fixed.

Reductions working with tests.

eedbafc

Clang.

e7b6f04

Disable rfactor on already rfactored domain. Clean up split/merge/reo…

8d48744

…rder on TensorView as it's a passthrough to TensorDomain.

debugging flag to spit cubin and ptx in codegen

6ae3f52

support fp16 by adding cast in parser, simple test added for half

Refactor Cast Op handling.

a656a4f

Hook up thread all reduce in codegen.

bef35f8

Clang.

b661883

Merge branch 'master' of https://www.github.com/csarofeen/pytorch int…

1f9a655

…o reduction

restore test

932c406

Enabling kernel reuse on dynamic size in legacy executor.

4cc15d3

Right now launch and kernel configuration is very naive. We need to revisit this when handling reduction.

added env variable flag to disable fallback path for debugging

3390c27

random test added to repro failing codegen

08f06db

repro the failing python test: ``` // This will prints out codegen cuda code export PYTORCH_CUDA_FUSER_DEBUG=1 // The failing test could be repro'ed with this python test_jit_cuda_fuser.py --ge_config profiling ```

Compute at fix, traverse from terminating outputs, not all outputs.

b1ab072

Support some crazy thread reduce patterns.

e81f566

re-apply const scalar codegen precision fix

efaa81f

Compiler warnings.

0cf8fd2

Merge branch 'reduction_update' of https://www.github.com/csarofeen/p…

20bf6aa

…ytorch into reduction_clean

Merge pull request #24 from csarofeen/cleanup_warnings

8314022

Cleanup a few compiler warnings

Support int for sum op in arith.cpp

609a013

Clang.

5156dd6

Add Fusion::hasReduction, add Fusion::hasRNG, minor cleanup.

ef67ab1

Clang tidy.

796f3ca

csarofeen requested a review from apaszke as a code owner May 17, 2020 15:45

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 17, 2020

remove CUDA driver API from rocm system

c49356d

jjsjann123 self-requested a review May 20, 2020 00:14

jjsjann123 approved these changes May 20, 2020

View reviewed changes

Merge branch 'master' of https://www.github.com/pytorch/pytorch into …

cba1757

…reduction

facebook-github-bot reviewed May 20, 2020

View reviewed changes

malfet reviewed May 20, 2020

View reviewed changes

Comment thread torch/csrc/jit/codegen/cuda/arith.cpp Outdated

jjsjann123 and others added 3 commits May 20, 2020 16:03

clang-tidy fix

f1fd457

added include to avoid internal build error

174fa41

Clang.

0867bc3

facebook-github-bot reviewed May 21, 2020

View reviewed changes

soumith approved these changes May 21, 2020

View reviewed changes

facebook-github-bot closed this in 8e69c3b May 22, 2020

facebook-github-bot added the merged label May 22, 2020

csarofeen mentioned this pull request May 22, 2020

Reduction update csarofeen/pytorch#27

Closed

5 tasks

glaringlee mentioned this pull request May 27, 2020

add xenial + cuda 9.2 + gcc 5.4 CI test #39036

Closed

csarofeen deleted the reduction_update branch June 5, 2020 14:01

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvFuser] Reduction support in codegen, fp16 support#38627

[nvFuser] Reduction support in codegen, fp16 support#38627
csarofeen wants to merge 40 commits intopytorch:masterfrom
csarofeen:reduction_update

csarofeen commented May 17, 2020

Uh oh!

jjsjann123 commented May 20, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented May 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

csarofeen commented May 17, 2020

Uh oh!

jjsjann123 commented May 20, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants