[NVFuser] Upstream push 0907#84626
Conversation
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser [ghstack-poisoned]
🔗 Helpful links
❌ 3 New Failures, 11 PendingAs of commit d9845a9 (more details on the Dr. CI page): Expand to see more
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
| Job | Step |
|---|---|
| Unknown |
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser ghstack-source-id: 52040e7 Pull Request resolved: #84626
|
😢 forgot to add |
|
errr. I'm seeing real build errors on macos and rocm test failures from the smoke test, which has the trunk label #84240 . start digging there... |
|
Rocm failure is not very informative for debugging. We are hitting a driver error 200 vvv: and the assert seeing in the log is just checking on nvrtcCompileProgram being successful. Link to the rocm failure below: https://github.com/pytorch/pytorch/runs/8233369438?check_suite_focus=true cc'ing @jeffdaily in case you want to take a look. Meanwhile, should we disable the tests for merge later when I resolve the macos build issues? @davidberard98 |
|
wait... build CI on this PR actually passed. 🎉 There's some cxx11-abi test that's failing, but looks like just permission issue, maybe it's my account?! https://github.com/pytorch/pytorch/runs/8233758234?check_suite_focus=true |
|
I'm guessing the cxx11-abi failure is an infra failure, it looks like most PRs on master are also failing https://hud.pytorch.org/. Sanity-checks job will be skipped once we rebase this PR. But we do need to fix or disable rocm issues, I'm fine with disabling as long as @jeffdaily agrees. |
|
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
There are a number of failed tests, but they all seem to originate from the same block of generated code: |
But reduceGroup here is not being partially specialized... it's just an overload and we do have all template arguments specified. cc'ing @naoyam in case I was wrong about the overloaded template functions. |
|
I'm not aware of any code that results in partial specialization of function templates. Where can I see generated code? |
Thanks for confirming this. |
|
You don't need AMD hardware for build errors. docker pull rocm/dev-ubuntu-20.04:5.2 (github doesn't support uploading *.cu files, so *.txt extension added to file) Inside docker:
@naoyam the code is attached. |
|
@pytorchbot rebase -s |
|
@pytorchbot successfully started a rebase job. Check the current status here |
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
|
Successfully rebased |
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser ghstack-source-id: dc174be Pull Request resolved: #84626
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84626
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7bc72f5: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I don't think we have bandwidth to spare on rocm failures for our upstream PRs at the moment. This is not something we promised to support in nvfuser. |
|
@jjsjann123 sorry, can we get one more rebase? seems like the test runner for the internal tests failed the last few times, not sure why |
|
@davidberard98 any specifics for internal test failures? ROCm-specific or no? |
Do we also have interla ROCm-specific tests? Asking since I don't have visibility in that and looks like public ROCm CIs are green here. |
|
tests are just flaky from what I can tell. I'm not familiar with rocm tests internally. |
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: 34c0b92 Pull Request resolved: #84626
|
@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Daily bump for update since I'm getting nervous as we approach deadline. 🙇 |
|
@malfet, @davidberard98 is there anything holding off this PR internally from being merged? |
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here. |
|
@jjsjann123 @IvanYashchuk I am merging manually because the internal diff has already been merged. In general we still need to merge via the internal workflow; but in this case I am merging via github just because the bot hasn't done it for some reason. |
|
Hey @jjsjann123. |
…g during tests. (#85319) Fixes issue Jie found in his PR: #84626 (comment) Pull Request resolved: #85319 Approved by: https://github.com/jjsjann123
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: #84626 Approved by: https://github.com/malfet
…g during tests. (#85319) Fixes issue Jie found in his PR: pytorch/pytorch#84626 (comment) Pull Request resolved: pytorch/pytorch#85319 Approved by: https://github.com/jjsjann123
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939) b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933) 56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937) 371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931) 1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932) 0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890) ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936) 63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929) b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919) c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935) 88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934) b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914) b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892) 3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912) 20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921) 6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910) 9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907) 98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913) ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915) 2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911) d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906) e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904) 3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch/pytorch#84626 Approved by: https://github.com/malfet
Cherry-picking upstream build failure patches from PR pytorch#84626 Changes includes: 1. added throw in stringify 2. Split fused_reduction.cu as its size exceeds the limit in MSVC 3. update bzl build for runtime header 4. Fix a bug originally reported in pytorch/pytorch#84626 5. Meta internal build fix Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
…g during tests. (#85319) Fixes issue Jie found in his PR: pytorch/pytorch#84626 (comment) Pull Request resolved: pytorch/pytorch#85319 Approved by: https://github.com/jjsjann123
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 80fad9d Tensor factories must set the output shape as its input (#1939) d618d68 arange support (#1933) 0440c5a Double support on all expression evaluators (#1937) 143d14f Improve trivial reduction merge support (#1931) 6d290b2 Test `rand` in a fusion with zero tensor input (#1932) 0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890) 471b325 Fix a bug (#1936) 16c77f2 Propagate permissive mapping information into indexing pass (#1929) a6586fe Map IterationDomains through view operations. (#1919) 4bbb9c4 do not use deprecated functions (#1935) 2df32db Upstream cherry pick fixes 0811 (#1934) 7dd3ea4 Separate kernel compilation API from kernel execution API (#1914) ec0c55c Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 2485b63 Nullary RNGOp (#1892) 9996e80 Misc fixes/tuning for transpose scheduler (#1912) 3835119 Grouped grid welford (#1921) 43d2d17 Transpose scheduler small dim sizes better support (#1910) 552c205 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 9b7a57f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) ceadee1 Add variance_mean function using Welford (#1907) a7b2172 Remove DisableOption::UnrollWithRng (#1913) db07ac3 Minor fix for the debug interface of using PTX directly (#1917) 2fd1594 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 7e01cd0 dopt is only available since nvrtc 11.7 (#1915) 975821e Kill computeAtBetween (#1911) 871c913 Improve view support on pointwise and transpose scheduler (#1906) c70844a Fix name clash of RNG with shared memory (#1904) 904396c Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch/pytorch#84626 Approved by: https://github.com/malfet
…g during tests. (pytorch#85319) Fixes issue Jie found in his PR: pytorch#84626 (comment) Pull Request resolved: pytorch#85319 Approved by: https://github.com/jjsjann123
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (pytorch#1939) b2fd01e arange support (pytorch#1933) 56c00fd Double support on all expression evaluators (pytorch#1937) 371f282 Improve trivial reduction merge support (pytorch#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (pytorch#1932) 0dab160 Fix softmax bwd sizes. (pytorch#1890) ef98f36 Fix a bug (pytorch#1936) 63132a0 Propagate permissive mapping information into indexing pass (pytorch#1929) b4ac2c8 Map IterationDomains through view operations. (pytorch#1919) c0a187a do not use deprecated functions (pytorch#1935) 88de85e Upstream cherry pick fixes 0811 (pytorch#1934) b247dcf Separate kernel compilation API from kernel execution API (pytorch#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (pytorch#1924) 14a53e6 Nullary RNGOp (pytorch#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (pytorch#1912) 20cf109 Grouped grid welford (pytorch#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (pytorch#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (pytorch#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (pytorch#1918) 3fb3d80 Add variance_mean function using Welford (pytorch#1907) 98febf6 Remove DisableOption::UnrollWithRng (pytorch#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (pytorch#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (pytorch#1916) 5eefa9a dopt is only available since nvrtc 11.7 (pytorch#1915) 2ec8fc7 Kill computeAtBetween (pytorch#1911) d0d106a Improve view support on pointwise and transpose scheduler (pytorch#1906) e71e1ec Fix name clash of RNG with shared memory (pytorch#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (pytorch#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet
Stack from ghstack (oldest at bottom):
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Codegen changes include:
codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization
misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN
Commits that's in this PR from the devel branch:
RUN_TORCHBENCH: nvfuser
Differential Revision: D39324552