Convert Embedding to new style.#1916
Conversation
|
Any particular reason |
|
isn't embedding in functional interface just an indexing operation? |
7ada9be to
4225d38
Compare
|
@soumith a bit unrelated: the gradients of indexing operators are dense, while in |
|
@fmassa it would be great to have that, but we'd need an automatic conversion in case that grad would be used in a function not implemented for sparse tensors. Most backward functions would break if you gave them sparse grad outputs |
|
@apaszke I think whenever we add support for implicit type conversion between functions (do we plan to add that?), we could extend it for sparse-dense tensors as well? |
|
@fmassa can you elaborate? I'm not sure what you mean |
|
By the way, somewhat related to @jekbradbury 's comment: I've created a PR for Embedding here: #1987 |
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
4225d38 to
49f790a
Compare
…37632c Summary: Previous import was 971311db58f2fa8306d15e1458b5fd47dbc8d11c Included changes: - **[7a112a6f](onnx/onnx@7a112a6f)**: add quantization ops in onnx (pytorch#1908) <Ashwini Khade> - **[6de42d7d](onnx/onnx@6de42d7d)**: Create working-groups.md (pytorch#1916) <Prasanth Pulavarthi> Differential Revision: D14855745 fbshipit-source-id: a2c9f32930b80d70feaff721eefe40b18fd98af1
…4f4ab5 Summary: Previous import was 971311db58f2fa8306d15e1458b5fd47dbc8d11c Included changes: - **[3717dc61](onnx/onnx@3717dc61)**: Shape Inference Tests for QOps (pytorch#1929) <Ashwini Khade> - **[a80c3371](onnx/onnx@a80c3371)**: Prevent unused variables from generating warnings across all platforms. (pytorch#1930) <Pranav Sharma> - **[be9255c1](onnx/onnx@be9255c1)**: add title (pytorch#1919) <Prasanth Pulavarthi> - **[7a112a6f](onnx/onnx@7a112a6f)**: add quantization ops in onnx (pytorch#1908) <Ashwini Khade> - **[6de42d7d](onnx/onnx@6de42d7d)**: Create working-groups.md (pytorch#1916) <Prasanth Pulavarthi> Differential Revision: D14937172 fbshipit-source-id: 0bf94ae699026a20e2a367738a80cd0c094c7d75
…ee80ab (pytorch#19339) Summary: Pull Request resolved: pytorch#19339 Previous import was 971311db58f2fa8306d15e1458b5fd47dbc8d11c Included changes: - **[ad731347](onnx/onnx@ad731347)**: Fix shape inference for matmul (pytorch#1941) <Bowen Bao> - **[3717dc61](onnx/onnx@3717dc61)**: Shape Inference Tests for QOps (pytorch#1929) <Ashwini Khade> - **[a80c3371](onnx/onnx@a80c3371)**: Prevent unused variables from generating warnings across all platforms. (pytorch#1930) <Pranav Sharma> - **[be9255c1](onnx/onnx@be9255c1)**: add title (pytorch#1919) <Prasanth Pulavarthi> - **[7a112a6f](onnx/onnx@7a112a6f)**: add quantization ops in onnx (pytorch#1908) <Ashwini Khade> - **[6de42d7d](onnx/onnx@6de42d7d)**: Create working-groups.md (pytorch#1916) <Prasanth Pulavarthi> Differential Revision: D14969962 fbshipit-source-id: e752f9ab6aebb341d16e22287793337698365b85
…ee80ab (#19339) Summary: Pull Request resolved: #19339 Previous import was 971311db58f2fa8306d15e1458b5fd47dbc8d11c Included changes: - **[ad731347](onnx/onnx@ad731347)**: Fix shape inference for matmul (#1941) <Bowen Bao> - **[3717dc61](onnx/onnx@3717dc61)**: Shape Inference Tests for QOps (#1929) <Ashwini Khade> - **[a80c3371](onnx/onnx@a80c3371)**: Prevent unused variables from generating warnings across all platforms. (#1930) <Pranav Sharma> - **[be9255c1](onnx/onnx@be9255c1)**: add title (#1919) <Prasanth Pulavarthi> - **[7a112a6f](onnx/onnx@7a112a6f)**: add quantization ops in onnx (#1908) <Ashwini Khade> - **[6de42d7d](onnx/onnx@6de42d7d)**: Create working-groups.md (#1916) <Prasanth Pulavarthi> Reviewed By: yinghai Differential Revision: D14969962 fbshipit-source-id: 5ec64ef7aee5161666ed0c03e201be0ae20826f9
…ee80ab (pytorch#19339) Summary: Pull Request resolved: pytorch#19339 Previous import was 971311db58f2fa8306d15e1458b5fd47dbc8d11c Included changes: - **[ad731347](onnx/onnx@ad731347)**: Fix shape inference for matmul (pytorch#1941) <Bowen Bao> - **[3717dc61](onnx/onnx@3717dc61)**: Shape Inference Tests for QOps (pytorch#1929) <Ashwini Khade> - **[a80c3371](onnx/onnx@a80c3371)**: Prevent unused variables from generating warnings across all platforms. (pytorch#1930) <Pranav Sharma> - **[be9255c1](onnx/onnx@be9255c1)**: add title (pytorch#1919) <Prasanth Pulavarthi> - **[7a112a6f](onnx/onnx@7a112a6f)**: add quantization ops in onnx (pytorch#1908) <Ashwini Khade> - **[6de42d7d](onnx/onnx@6de42d7d)**: Create working-groups.md (pytorch#1916) <Prasanth Pulavarthi> Reviewed By: yinghai Differential Revision: D14969962 fbshipit-source-id: 5ec64ef7aee5161666ed0c03e201be0ae20826f9
Also cleaned up the relevant code path a little
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: d0d88cf Pull Request resolved: #84626
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: 0443a2a Pull Request resolved: #84626
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: 34c0b92 Pull Request resolved: #84626
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: #84626 Approved by: https://github.com/malfet
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: #84626 Approved by: https://github.com/malfet
Signed-off-by: Edward Z. Yang ezyang@fb.com