Nvfuser code bump 12 5#69428
Closed
jjsjann123 wants to merge 538 commits intopytorch:masterfrom
Closed
Conversation
Create ops directory to hold all fusion definitions Create named variables for ops with multiple outputs Update batch norm with welford operation Rename WelfordResult var to var_sum Co-authored-by: Ryan Spring <rspring@nvidia.com>
In cuda_fp16.hpp, the constructor of half is set by the default keyword. Apparently that can reduce register usage in some cases. More specifically, in the following, x and y may result in different usage. ``` __half x; __half[1] y; ``` See https://gitlab-master.nvidia.com/nmaruyama/register-pressure/-/blob/master/register_pressure.cu for a concrete example.
* add repro * add fix * clang format * format * format
1. Allow binary dump when compiling to sass 2. Skip assertion for kernel code in release build, greatly saves register usage 3. Add env switch to dump register usage via ptxas verbose option
Allows segmentation to consider output-to-input aliasing. We add the aliased input to its corresponding SegmentedGroup, so executor would have the tensor to be aliased at kernel execution.
A refactor to make it easier to modify the signature of parse function.
repro added. assertion added Co-authored-by: jiej <jiej@nvidia.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>
Remove MagicScheduler title from benchmarks Co-authored-by: Ryan Spring <rspring@nvidia.com>
* update iterVisitor to output ordered exprs * update fusion printer * simplify logic
* Add autocast op parsing in fuser. * Add symbolic script changes to make autocast ops autodiff compatible. * Add proper symoblic scripting of autocast backward support. * Adding aten::to parsing. * enable profile int to profile ScalarType Co-authored-by: jiej <jiej@nvidia.com>
…h#928) Outdated due to - added invariance in computeAt csarofeen#838 - the change to barrier sync allowing block broadcast/reduce to be placed in conditional code - persistent buffers being considered on inputs
Undoes some of the changes of pytorch#928 as layer norm half was failing. This just doesn't run computeWithOutputs on inputs that aren't inputs to the reduction.
* pipe through index mode * replace codegen srings * cache index mode * use std limit * move definitions * rename INDEX_TYPE
gcc-7.x can't work out the copy elision for return type with std::optional.
e.g. In the example below, a copy is made during return; while on later compiler (9.x), NVRO kicks in and no copy/move is issued.
std::optional<T> foo() {
T ret = ...;
return ret;
}
so we update the code to avoid the implicit conversion during return.
21 6 17 devel bump
Unswitch can be used for non-const IterDomains as it doesn't move the allocation.
Generate predicates based on reference tensors. Be more aggressive on single indexing into iteration domains comprising only of merges. Add new predicate method to unswitch predicates.
Clean up in thread predicates.
Collaborator
|
Sorry, I'm not the one handling those. |
Collaborator
Author
|
Since we got Natalia confirming changes outside of codegen/nvfuser, can we start the import & internal CI to work towards a landing? cc'ing @wconstab to track the status of |
ngimel
approved these changes
Dec 13, 2021
Collaborator
ngimel
left a comment
There was a problem hiding this comment.
Approving cuLaunchKernel related changes.
Contributor
|
@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Contributor
|
@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Contributor
|
@wconstab has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
wconstab
pushed a commit
to wconstab/pytorch
that referenced
this pull request
Dec 16, 2021
Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e2d5ecefa69ab2ef823854bd9eb6ab2d054645b1
jjsjann123
added a commit
to csarofeen/pytorch
that referenced
this pull request
Jan 1, 2022
Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123
added a commit
to csarofeen/pytorch
that referenced
this pull request
Jan 1, 2022
Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123
added a commit
to jjsjann123/nvfuser
that referenced
this pull request
Oct 29, 2022
Summary: Pull Request resolved: pytorch/pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch/pytorch#68804 Pull Request resolved: pytorch/pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123
added a commit
to jjsjann123/nvfuser
that referenced
this pull request
Nov 10, 2022
Summary: Pull Request resolved: pytorch/pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch/pytorch#68804 Pull Request resolved: pytorch/pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
jjsjann123
added a commit
to jjsjann123/nvfuser
that referenced
this pull request
Nov 10, 2022
Summary: Pull Request resolved: pytorch/pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch/pytorch#68804 Pull Request resolved: pytorch/pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 25, 2026
Summary: Pull Request resolved: pytorch#69964 Things added in this PR that requires review: 1. cuLaunchCooperativeKernel driver API added aten/src/ATen/cuda/detail/LazyNVRTC.cpp aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h nvfuser code update: 1. perf turning on codegen scheduler that improves performance. 2. permutation support has been extended beyond contiguous/channels-last. (The improvements could be observed on PW benchmark) Things reverted from local changes: 1. aten::gelu with approximation 2. local changes that is upstreamed in PR pytorch#68804 Pull Request resolved: pytorch#69428 Reviewed By: ngimel Differential Revision: D33073817 Pulled By: wconstab fbshipit-source-id: e77d32e81d037d7370822b040456fd4c3bd68edb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Things added in this PR that requires review:
aten/src/ATen/cuda/detail/LazyNVRTC.cpp
aten/src/ATen/cuda/nvrtc_stub/ATenNVRTC.h
nvfuser code update:
Things reverted from local changes: