Nvfuser code bump 12 5 by jjsjann123 · Pull Request #69428 · pytorch/pytorch

jjsjann123 · 2021-12-05T10:55:05Z

Things added in this PR that requires review:

cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:

perf turning on codegen scheduler that improves performance.
permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:

aten::gelu with approximation
local changes that is upstreamed in PR fixing removeProfilingNodes duplicated functions (#1282) #68804

Create ops directory to hold all fusion definitions Create named variables for ops with multiple outputs Update batch norm with welford operation Rename WelfordResult var to var_sum Co-authored-by: Ryan Spring <rspring@nvidia.com>

In cuda_fp16.hpp, the constructor of half is set by the default keyword. Apparently that can reduce register usage in some cases. More specifically, in the following, x and y may result in different usage. ``` __half x; __half[1] y; ``` See https://gitlab-master.nvidia.com/nmaruyama/register-pressure/-/blob/master/register_pressure.cu for a concrete example.

* add repro * add fix * clang format * format * format

1. Allow binary dump when compiling to sass 2. Skip assertion for kernel code in release build, greatly saves register usage 3. Add env switch to dump register usage via ptxas verbose option

Allows segmentation to consider output-to-input aliasing. We add the aliased input to its corresponding SegmentedGroup, so executor would have the tensor to be aliased at kernel execution.

A refactor to make it easier to modify the signature of parse function.

repro added. assertion added Co-authored-by: jiej <jiej@nvidia.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>

Remove MagicScheduler title from benchmarks Co-authored-by: Ryan Spring <rspring@nvidia.com>

* update iterVisitor to output ordered exprs * update fusion printer * simplify logic

* Add autocast op parsing in fuser. * Add symbolic script changes to make autocast ops autodiff compatible. * Add proper symoblic scripting of autocast backward support. * Adding aten::to parsing. * enable profile int to profile ScalarType Co-authored-by: jiej <jiej@nvidia.com>

…uteAtMap. (pytorch#932)

…h#928) Outdated due to - added invariance in computeAt csarofeen#838 - the change to barrier sync allowing block broadcast/reduce to be placed in conditional code - persistent buffers being considered on inputs

Undoes some of the changes of pytorch#928 as layer norm half was failing. This just doesn't run computeWithOutputs on inputs that aren't inputs to the reduction.

* pipe through index mode * replace codegen srings * cache index mode * use std limit * move definitions * rename INDEX_TYPE

gcc-7.x can't work out the copy elision for return type with std::optional. e.g. In the example below, a copy is made during return; while on later compiler (9.x), NVRO kicks in and no copy/move is issued. std::optional<T> foo() { T ret = ...; return ret; } so we update the code to avoid the implicit conversion during return.

21 6 17 devel bump

Unswitch can be used for non-const IterDomains as it doesn't move the allocation.

Generate predicates based on reference tensors. Be more aggressive on single indexing into iteration domains comprising only of merges. Add new predicate method to unswitch predicates.

Clean up in thread predicates.

ngimel · 2021-12-10T20:16:43Z

Sorry, I'm not the one handling those.

jjsjann123 · 2021-12-13T21:23:17Z

Since we got Natalia confirming changes outside of codegen/nvfuser, can we start the import & internal CI to work towards a landing?

cc'ing @wconstab to track the status of aten::_softmax parsing for LTC.

ngimel

Approving cuLaunchKernel related changes.

facebook-github-bot · 2021-12-13T22:14:59Z

@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-12-14T16:53:00Z

@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-12-14T17:56:38Z

@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e2d5ecefa69ab2ef823854bd9eb6ab2d054645b1

Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb

Summary: Pull Request resolved: pytorch/pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch/pytorch#68804 Pull Request resolved: pytorch/pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb

Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb

rdspring1 and others added 30 commits June 3, 2021 12:37

Fix segmenter debug print and some clang format (pytorch#923)

db7109c

* add repro * add fix * clang format * format * format

NVRTC compilation flag update (pytorch#917)

c860d2c

1. Allow binary dump when compiling to sass 2. Skip assertion for kernel code in release build, greatly saves register usage 3. Add env switch to dump register usage via ptxas verbose option

alias support in segmentation (pytorch#914)

8cd7a64

Allows segmentation to consider output-to-input aliasing. We add the aliased input to its corresponding SegmentedGroup, so executor would have the tensor to be aliased at kernel execution.

Parser refactor (pytorch#913)

556290f

A refactor to make it easier to modify the signature of parse function.

Handle vectorize_shift for top-level expressions (pytorch#926)

6db9e6f

repro added. assertion added Co-authored-by: jiej <jiej@nvidia.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>

Fix Benchmark bugs after Composite Ops refactor (pytorch#927)

b3a9c48

Remove MagicScheduler title from benchmarks Co-authored-by: Ryan Spring <rspring@nvidia.com>

Fusion IR printer with IterDomain transformations (pytorch#925)

a1648a4

* update iterVisitor to output ordered exprs * update fusion printer * simplify logic

Reduce preducate redundancy by removing old p2c root map and use comp…

1e48bdf

…uteAtMap. (pytorch#932)

Lift some outdated restrictions from normalization scheduler. (pytorc…

baaab4a

…h#928) Outdated due to - added invariance in computeAt csarofeen#838 - the change to barrier sync allowing block broadcast/reduce to be placed in conditional code - persistent buffers being considered on inputs

Simplify extent use to minimum set. (pytorch#930)

a166573

Fix striding on bert layernorm bwd 2 (pytorch#935)

58d0cc2

excluding fusion input from reduction check (pytorch#934)

3eb5253

Revert some of normalization scheduler changes. (pytorch#936)

9de316a

Undoes some of the changes of pytorch#928 as layer norm half was failing. This just doesn't run computeWithOutputs on inputs that aren't inputs to the reduction.

Back off on innermost broadcast inlining (pytorch#931)

e7d15c6

silu added (pytorch#899)

5f16c24

Keep generated indexing consistent. (pytorch#937)

62f40c9

fixing cpp warning/bug; remove print in python test (pytorch#939)

984c897

Make sure ceilDiv works as intended (pytorch#940)

d1e60bd

32b mode indexing support (pytorch#938)

f900554

* pipe through index mode * replace codegen srings * cache index mode * use std limit * move definitions * rename INDEX_TYPE

Merge remote-tracking branch 'upstream/master' into HEAD

07e0ea4

Merge pull request pytorch#946 from csarofeen/21_6_17_devel_bump

6af3708

21 6 17 devel bump

patching BN in autodiff for the new TensorIterator backend (pytorch#947)

a34d198

Autocast redundant autocast fix. (pytorch#952)

f422922

Remove unnecessary restriction (pytorch#949)

ef12157

Unswitch can be used for non-const IterDomains as it doesn't move the allocation.

Generate predicates based on reference tensors. (pytorch#941)

fe98d4a

Generate predicates based on reference tensors. Be more aggressive on single indexing into iteration domains comprising only of merges. Add new predicate method to unswitch predicates.

Misc cleanups (pytorch#951)

d432246

Clean up in thread predicates.

albanD requested a review from ngimel December 10, 2021 14:52

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 10, 2021

ngimel removed their request for review December 10, 2021 20:16

ngimel approved these changes Dec 13, 2021

View reviewed changes

fixing build

1c5ed51

jjsjann123 added 2 commits December 14, 2021 09:46

changing WindowsTorchApi.h to Export.h

2f54416

clang-format

ed40537

wconstab mentioned this pull request Dec 15, 2021

Nvfuser code bump 12 5 (#69428) #69964

Closed

facebook-github-bot closed this in 76d282d Dec 16, 2021

jjsjann123 mentioned this pull request Dec 30, 2021

Nvfuser code bump 12 5 (#69964) csarofeen/pytorch#1345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nvfuser code bump 12 5#69428

Nvfuser code bump 12 5#69428
jjsjann123 wants to merge 538 commits intopytorch:masterfrom
jjsjann123:nvfuser_code_bump_12_5

jjsjann123 commented Dec 5, 2021 •

edited

Loading

Uh oh!

ngimel commented Dec 10, 2021

Uh oh!

jjsjann123 commented Dec 13, 2021

Uh oh!

ngimel left a comment

Uh oh!

facebook-github-bot commented Dec 13, 2021

Uh oh!

facebook-github-bot commented Dec 14, 2021

Uh oh!

facebook-github-bot commented Dec 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

jjsjann123 commented Dec 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel commented Dec 10, 2021

Uh oh!

jjsjann123 commented Dec 13, 2021

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Dec 13, 2021

Uh oh!

facebook-github-bot commented Dec 14, 2021

Uh oh!

facebook-github-bot commented Dec 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

jjsjann123 commented Dec 5, 2021 •

edited

Loading