Skip to content

Nvfuser code bump 12 5#69428

Closed
jjsjann123 wants to merge 538 commits intopytorch:masterfrom
jjsjann123:nvfuser_code_bump_12_5
Closed

Nvfuser code bump 12 5#69428
jjsjann123 wants to merge 538 commits intopytorch:masterfrom
jjsjann123:nvfuser_code_bump_12_5

Conversation

@jjsjann123
Copy link
Copy Markdown
Collaborator

@jjsjann123 jjsjann123 commented Dec 5, 2021

Things added in this PR that requires review:

  1. cuLaunchCooperativeKernel driver API added
    aten/src/ATen/cuda/detail/LazyNVRTC.cpp
    aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:

  1. perf turning on codegen scheduler that improves performance.
  2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:

  1. aten::gelu with approximation
  2. local changes that is upstreamed in PR fixing removeProfilingNodes duplicated functions (#1282) #68804

rdspring1 and others added 30 commits June 3, 2021 12:37
Create ops directory to hold all fusion definitions
Create named variables for ops with multiple outputs
Update batch norm with welford operation
Rename WelfordResult var to var_sum
Co-authored-by: Ryan Spring <rspring@nvidia.com>
In cuda_fp16.hpp, the constructor of half is set by the default keyword.
Apparently that can reduce register usage in some cases. More
specifically, in the following, x and y may result in different usage.

```
__half x;
__half[1] y;
```

See
https://gitlab-master.nvidia.com/nmaruyama/register-pressure/-/blob/master/register_pressure.cu for a concrete example.
* add repro

* add fix

* clang format

* format

* format
1. Allow binary dump when compiling to sass
2. Skip assertion for kernel code in release build, greatly saves register usage
3. Add env switch to dump register usage via ptxas verbose option
Allows segmentation to consider output-to-input aliasing. We add the aliased input to its corresponding SegmentedGroup, so executor would have the tensor to be aliased at kernel execution.
A refactor to make it easier to modify the signature of parse function.
repro added. assertion added
Co-authored-by: jiej <jiej@nvidia.com>
Co-authored-by: Ryan Spring <rspring@nvidia.com>
Remove MagicScheduler title from benchmarks

Co-authored-by: Ryan Spring <rspring@nvidia.com>
* update iterVisitor to output ordered exprs

* update fusion printer

* simplify logic
* Add autocast op parsing in fuser.
* Add symbolic script changes to make autocast ops autodiff compatible.
* Add proper symoblic scripting of autocast backward support.
* Adding aten::to parsing.
* enable profile int to profile ScalarType

Co-authored-by: jiej <jiej@nvidia.com>
…h#928)

Outdated due to 
- added invariance in computeAt csarofeen#838
- the change to barrier sync allowing block broadcast/reduce to be placed in conditional code
- persistent buffers being considered on inputs
Undoes some of the changes of pytorch#928 as layer norm half was failing. This just doesn't run computeWithOutputs on inputs that aren't inputs to the reduction.
* pipe through index mode

* replace codegen srings

* cache index mode

* use std limit

* move definitions

* rename INDEX_TYPE
gcc-7.x can't work out the copy elision for return type with std::optional.
e.g. In the example below, a copy is made during return; while on later compiler (9.x), NVRO kicks in and no copy/move is issued.
std::optional<T> foo() {
  T ret = ...;
  return ret;
}
so we update the code to avoid the implicit conversion during return.
Unswitch can be used for non-const IterDomains as it doesn't move the
allocation.
Generate predicates based on reference tensors. Be more aggressive on single indexing into iteration domains comprising only of merges. Add new predicate method to unswitch predicates.
Clean up in thread predicates.
@albanD albanD requested a review from ngimel December 10, 2021 14:52
@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 10, 2021
@ngimel ngimel removed their request for review December 10, 2021 20:16
@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Dec 10, 2021

Sorry, I'm not the one handling those.

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

Since we got Natalia confirming changes outside of codegen/nvfuser, can we start the import & internal CI to work towards a landing?

cc'ing @wconstab to track the status of aten::_softmax parsing for LTC.

Copy link
Copy Markdown
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving cuLaunchKernel related changes.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

wconstab pushed a commit to wconstab/pytorch that referenced this pull request Dec 16, 2021
Summary:
Pull Request resolved: pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch#68804

Pull Request resolved: pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e2d5ecefa69ab2ef823854bd9eb6ab2d054645b1
jjsjann123 added a commit to csarofeen/pytorch that referenced this pull request Jan 1, 2022
Summary:
Pull Request resolved: pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch#68804

Pull Request resolved: pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123 added a commit to csarofeen/pytorch that referenced this pull request Jan 1, 2022
Summary:
Pull Request resolved: pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch#68804

Pull Request resolved: pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Pull Request resolved: pytorch/pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch/pytorch#68804

Pull Request resolved: pytorch/pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Pull Request resolved: pytorch/pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch/pytorch#68804

Pull Request resolved: pytorch/pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Pull Request resolved: pytorch/pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch/pytorch#68804

Pull Request resolved: pytorch/pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Summary:
Pull Request resolved: pytorch#69964

Things added in this PR that requires review:
1. cuLaunchCooperativeKernel driver API added
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h

nvfuser code update:
1. perf turning on codegen scheduler that improves performance.
2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark)

Things reverted from local changes:
1. aten::gelu with approximation
2. local changes that is upstreamed in PR pytorch#68804

Pull Request resolved: pytorch#69428

Reviewed By: ngimel

Differential Revision: D33073817

Pulled By: wconstab

fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.