[NVFuser] Upstream push 0907 by jjsjann123 · Pull Request #84626 · pytorch/pytorch

jjsjann123 · 2022-09-07T09:52:09Z

Stack from ghstack (oldest at bottom):

-> [NVFuser] Upstream push 0907 #84626

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization
misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:

89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)

RUN_TORCHBENCH: nvfuser

Differential Revision: D39324552

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser [ghstack-poisoned]

facebook-github-bot · 2022-09-07T09:52:18Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84626
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours

❌ 3 New Failures, 11 Pending

As of commit d9845a9 (more details on the Dr. CI page):

Expand to see more

3/3 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

Lint / pr-sanity-checks (1/2)

Step: "PR size check" (full log | diagnosis details)

2022-09-07T09:55:18.9046776Z ##[error]Process completed with exit code 1.

2022-09-07T09:55:18.9033369Z + echo 'please contact @albanD or @seemethere.'
2022-09-07T09:55:18.9033580Z + echo
2022-09-07T09:55:18.9033728Z + false
2022-09-07T09:55:18.9033861Z 
2022-09-07T09:55:18.9034025Z Your PR is 11536 LOC which is more than the 2000 maximum
2022-09-07T09:55:18.9034311Z allowed within PyTorch infra. PLease make sure to split up
2022-09-07T09:55:18.9034918Z your PR into smaller pieces that can be reviewed.
2022-09-07T09:55:18.9035209Z If you think that this rule should not apply to your PR,
2022-09-07T09:55:18.9035462Z please contact @albanD or @seemethere.
2022-09-07T09:55:18.9035628Z 
2022-09-07T09:55:18.9046776Z ##[error]Process completed with exit code 1.
2022-09-07T09:55:18.9095497Z Post job cleanup.
2022-09-07T09:55:18.9127531Z Post job cleanup.
2022-09-07T09:55:19.0202708Z [command]/usr/bin/git version
2022-09-07T09:55:19.0253229Z git version 2.37.3
2022-09-07T09:55:19.0297595Z Temporarily overriding HOME='/home/runner/work/_temp/2de4cf5d-81bf-4e90-aea1-87debfc8fc92' before making global git config changes
2022-09-07T09:55:19.0298373Z Adding repository directory to the temporary git global config as a safe directory
2022-09-07T09:55:19.0304686Z [command]/usr/bin/git config --global --add safe.directory /home/runner/work/pytorch/pytorch
2022-09-07T09:55:19.0347043Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2022-09-07T09:55:19.0384686Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2022-09-07T09:55:19.0620053Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader

linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test / build (2/2)

Step: "Download Build Artifacts" (full log | diagnosis details)

2022-09-07T17:21:14.0066970Z ##[error]An error ...torch/pytorch/pytorch/'. No such file or directory

2022-09-07T17:21:14.0024264Z   ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine
2022-09-07T17:21:14.0024537Z   ANACONDA_USER: pytorch
2022-09-07T17:21:14.0024732Z   AWS_DEFAULT_REGION: us-east-1
2022-09-07T17:21:14.0024915Z   BINARY_ENV_FILE: /tmp/env
2022-09-07T17:21:14.0025153Z   BUILD_ENVIRONMENT: linux-binary-libtorch-cxx11-abi
2022-09-07T17:21:14.0025508Z   GITHUB_TOKEN: ***
2022-09-07T17:21:14.0025668Z   PR_NUMBER: 
2022-09-07T17:21:14.0025858Z   PYTORCH_FINAL_PACKAGE_DIR: /artifacts
2022-09-07T17:21:14.0026084Z   SHA1: d9845a9f809b51e5762e336f1f18458dc358852a
2022-09-07T17:21:14.0026271Z ##[endgroup]
2022-09-07T17:21:14.0066970Z ##[error]An error occurred trying to start process '/usr/bin/bash' with working directory '/home/ec2-user/actions-runner/_work/pytorch/pytorch/pytorch/'. No such file or directory
2022-09-07T17:21:14.0087984Z ##[group]Run # Ensure the working directory gets chowned back to the current user
2022-09-07T17:21:14.0088304Z �[36;1m# Ensure the working directory gets chowned back to the current user�[0m
2022-09-07T17:21:14.0088625Z �[36;1mdocker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .�[0m
2022-09-07T17:21:14.0098589Z shell: /usr/bin/bash -e {0}
2022-09-07T17:21:14.0098755Z env:
2022-09-07T17:21:14.0098923Z   PYTORCH_ROOT: /pytorch
2022-09-07T17:21:14.0099112Z   BUILDER_ROOT: /builder
2022-09-07T17:21:14.0099286Z   PACKAGE_TYPE: libtorch
2022-09-07T17:21:14.0099464Z   DESIRED_CUDA: cpu
2022-09-07T17:21:14.0099639Z   GPU_ARCH_VERSION:

🕵️‍♀️ 1 failure not recognized by patterns:

The following CI failures may be due to changes from the PR

Job	Step
^{linux-focal-rocm5.2-py3.7 / test (default, 1, 2, linux.rocm.gpu)}	^Unknown

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser ghstack-source-id: 52040e7 Pull Request resolved: #84626

jjsjann123 · 2022-09-07T16:36:24Z

😢 forgot to add trunk label last night....

jjsjann123 · 2022-09-07T17:28:12Z

errr. I'm seeing real build errors on macos and rocm test failures from the smoke test, which has the trunk label #84240 . start digging there...

jjsjann123 · 2022-09-07T21:03:21Z

Rocm failure is not very informative for debugging. We are hitting a driver error 200 vvv:

2022-09-07T09:45:48.6209995Z RuntimeError: The following operation failed in the TorchScript interpreter.
2022-09-07T09:45:48.6210996Z Traceback of TorchScript (most recent call last):
2022-09-07T09:45:48.6211823Z RuntimeError: CUDA driver error: 200

and the assert seeing in the log is just checking on nvrtcCompileProgram being successful.
2022-09-07T09:45:49.8770329Z RuntimeError: false INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/codegen/cuda/executor_utils.cpp":1180, please report a bug to PyTorch. #pragma clang force_cuda_host_device begin

Link to the rocm failure below: https://github.com/pytorch/pytorch/runs/8233369438?check_suite_focus=true

cc'ing @jeffdaily in case you want to take a look. Meanwhile, should we disable the tests for merge later when I resolve the macos build issues? @davidberard98

jjsjann123 · 2022-09-07T21:05:52Z

wait... build CI on this PR actually passed. 🎉

There's some cxx11-abi test that's failing, but looks like just permission issue, maybe it's my account?! https://github.com/pytorch/pytorch/runs/8233758234?check_suite_focus=true

davidberard98 · 2022-09-07T21:13:22Z

I'm guessing the cxx11-abi failure is an infra failure, it looks like most PRs on master are also failing https://hud.pytorch.org/. Sanity-checks job will be skipped once we rebase this PR. But we do need to fix or disable rocm issues, I'm fine with disabling as long as @jeffdaily agrees.

davidberard98 · 2022-09-07T21:14:58Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jeffdaily · 2022-09-07T21:38:02Z

There are a number of failed tests, but they all seem to originate from the same block of generated code:

/tmp/comgr-1caa3b/input/CompileSource:5601:5: error: function template partial specialization is not allowed
    reduce<Func, Types...>(
    ^     ~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6018:5: error: function template partial specialization is not allowed
    reduce<Func, Types...>(
    ^     ~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6075:5: error: function template partial specialization is not allowed
    reduceGroup<DataTypes..., Funcs..., BoolTypes...>(
    ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6260:5: error: function template partial specialization is not allowed
    reduceGroup<DataTypes..., Funcs..., BoolTypes...>(
    ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6320:5: error: function template partial specialization is not allowed
    reduceGroupBlock<BLOCK_BROADCAST, DataTypes..., Funcs..., BoolTypes...>(
    ^               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6397:5: error: function template partial specialization is not allowed
    reduceGroupLastBlock<DataTypes..., Funcs..., BoolTypes...>(
    ^                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6757:5: error: function template partial specialization is not allowed
    welfordGroup<NumArgs, DataType, IndexType>(
    ^           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6958:5: error: function template partial specialization is not allowed
    welfordGroupBlock<BLOCK_BROADCAST, NumVals, DataType, IndexType>(
    ^                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:7027:5: error: function template partial specialization is not allowed
    welfordGroupLastBlock<NumVals, DataType, IndexType>(
    ^                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

jjsjann123 · 2022-09-07T22:02:34Z

/tmp/comgr-1caa3b/input/CompileSource:6075:5: error: function template partial specialization is not allowed
reduceGroup<DataTypes..., Funcs..., BoolTypes...>(
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6260:5: error: function template partial specialization is not allowed
reduceGroup<DataTypes..., Funcs..., BoolTypes...>(

But reduceGroup here is not being partially specialized... it's just an overload and we do have all template arguments specified.
Sounds like a rocm compiler support thing. We can't back out cuda support because rocm is broken. Any chance we have a quick WAR this @jeffdaily ?

cc'ing @naoyam in case I was wrong about the overloaded template functions.

naoyam · 2022-09-07T22:17:16Z

I'm not aware of any code that results in partial specialization of function templates. Where can I see generated code?

jjsjann123 · 2022-09-07T22:20:46Z

I'm not aware of any code that results in partial specialization of function templates. Where can I see generated code?

Thanks for confirming this.
This is on rocm stack and I assume you'll need AMD hardware, which is why we really can't provide much help from NV side on rocm failures... We'll see if @jeffdaily would be able to provide a quick patch. We can also turn off rocm tests and handle the failure in parallel. So we are at least not blocked on this.

jeffdaily · 2022-09-07T23:48:48Z

You don't need AMD hardware for build errors.

docker pull rocm/dev-ubuntu-20.04:5.2

rocm_build_error.cu.txt

(github doesn't support uploading *.cu files, so *.txt extension added to file)

Inside docker:

hipcc -c rocm_build_error.cu -std=c++17 --amdgpu-target=gfx906

@naoyam the code is attached.

davidberard98 · 2022-09-08T15:11:19Z

@pytorchbot rebase -s

pytorchmergebot · 2022-09-08T15:13:09Z

@pytorchbot successfully started a rebase job. Check the current status here

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]

pytorchmergebot · 2022-09-08T15:13:26Z

Successfully rebased gh/jjsjann123/4/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/84626)

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser ghstack-source-id: dc174be Pull Request resolved: #84626

pytorch-bot · 2022-09-08T15:13:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84626

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7bc72f5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

davidberard98 · 2022-09-08T15:18:12Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jjsjann123 · 2022-09-08T16:20:03Z

You don't need AMD hardware for build errors.

docker pull rocm/dev-ubuntu-20.04:5.2

rocm_build_error.cu.txt

(github doesn't support uploading *.cu files, so *.txt extension added to file)

Inside docker:

hipcc -c rocm_build_error.cu -std=c++17 --amdgpu-target=gfx906

@naoyam the code is attached.

I don't think we have bandwidth to spare on rocm failures for our upstream PRs at the moment. This is not something we promised to support in nvfuser.
I'm suggesting that we disable the test for this PR. We'd be more than happy to review future PRs that patches and reenables related tests.

davidberard98 · 2022-09-21T20:16:42Z

@jjsjann123 sorry, can we get one more rebase?

seems like the test runner for the internal tests failed the last few times, not sure why

jeffdaily · 2022-09-21T21:39:17Z

@davidberard98 any specifics for internal test failures? ROCm-specific or no?

jjsjann123 · 2022-09-21T21:43:54Z

@davidberard98 any specifics for internal test failures? ROCm-specific or no?

Do we also have interla ROCm-specific tests? Asking since I don't have visibility in that and looks like public ROCm CIs are green here.

davidberard98 · 2022-09-21T21:47:10Z

tests are just flaky from what I can tell. I'm not familiar with rocm tests internally.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) [ghstack-poisoned]

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: 34c0b92 Pull Request resolved: #84626

davidberard98 · 2022-09-21T22:10:54Z

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jjsjann123 · 2022-09-22T18:07:03Z

Daily bump for update since I'm getting nervous as we approach deadline. 🙇

IvanYashchuk · 2022-09-23T16:58:08Z

@malfet, @davidberard98 is there anything holding off this PR internally from being merged?

davidberard98 · 2022-09-23T20:28:11Z

@pytorchbot merge

pytorchmergebot · 2022-09-23T20:29:42Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

davidberard98 · 2022-09-23T20:30:22Z

@jjsjann123 @IvanYashchuk I am merging manually because the internal diff has already been merged. In general we still need to merge via the internal workflow; but in this case I am merging via github just because the bot hasn't done it for some reason.

github-actions · 2022-09-23T20:30:36Z

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

…g during tests. (#85319) Fixes issue Jie found in his PR: #84626 (comment) Pull Request resolved: #85319 Approved by: https://github.com/jjsjann123

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: #84626 Approved by: https://github.com/malfet

…g during tests. (#85319) Fixes issue Jie found in his PR: pytorch/pytorch#84626 (comment) Pull Request resolved: pytorch/pytorch#85319 Approved by: https://github.com/jjsjann123

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939) b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933) 56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937) 371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931) 1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932) 0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890) ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936) 63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929) b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919) c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935) 88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934) b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914) b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892) 3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912) 20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921) 6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910) 9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907) 98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913) ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915) 2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911) d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906) e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904) 3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch/pytorch#84626 Approved by: https://github.com/malfet

Cherry-picking upstream build failure patches from PR pytorch#84626 Changes includes: 1. added throw in stringify 2. Split fused_reduction.cu as its size exceeds the limit in MSVC 3. update bzl build for runtime header 4. Fix a bug originally reported in pytorch/pytorch#84626 5. Meta internal build fix Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>

…g during tests. (#85319) Fixes issue Jie found in his PR: pytorch/pytorch#84626 (comment) Pull Request resolved: pytorch/pytorch#85319 Approved by: https://github.com/jjsjann123

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 80fad9d Tensor factories must set the output shape as its input (#1939) d618d68 arange support (#1933) 0440c5a Double support on all expression evaluators (#1937) 143d14f Improve trivial reduction merge support (#1931) 6d290b2 Test `rand` in a fusion with zero tensor input (#1932) 0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890) 471b325 Fix a bug (#1936) 16c77f2 Propagate permissive mapping information into indexing pass (#1929) a6586fe Map IterationDomains through view operations. (#1919) 4bbb9c4 do not use deprecated functions (#1935) 2df32db Upstream cherry pick fixes 0811 (#1934) 7dd3ea4 Separate kernel compilation API from kernel execution API (#1914) ec0c55c Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 2485b63 Nullary RNGOp (#1892) 9996e80 Misc fixes/tuning for transpose scheduler (#1912) 3835119 Grouped grid welford (#1921) 43d2d17 Transpose scheduler small dim sizes better support (#1910) 552c205 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 9b7a57f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) ceadee1 Add variance_mean function using Welford (#1907) a7b2172 Remove DisableOption::UnrollWithRng (#1913) db07ac3 Minor fix for the debug interface of using PTX directly (#1917) 2fd1594 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 7e01cd0 dopt is only available since nvrtc 11.7 (#1915) 975821e Kill computeAtBetween (#1911) 871c913 Improve view support on pointwise and transpose scheduler (#1906) c70844a Fix name clash of RNG with shared memory (#1904) 904396c Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch/pytorch#84626 Approved by: https://github.com/malfet

…g during tests. (pytorch#85319) Fixes issue Jie found in his PR: pytorch#84626 (comment) Pull Request resolved: pytorch#85319 Approved by: https://github.com/jjsjann123

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (pytorch#1939) b2fd01e arange support (pytorch#1933) 56c00fd Double support on all expression evaluators (pytorch#1937) 371f282 Improve trivial reduction merge support (pytorch#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (pytorch#1932) 0dab160 Fix softmax bwd sizes. (pytorch#1890) ef98f36 Fix a bug (pytorch#1936) 63132a0 Propagate permissive mapping information into indexing pass (pytorch#1929) b4ac2c8 Map IterationDomains through view operations. (pytorch#1919) c0a187a do not use deprecated functions (pytorch#1935) 88de85e Upstream cherry pick fixes 0811 (pytorch#1934) b247dcf Separate kernel compilation API from kernel execution API (pytorch#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (pytorch#1924) 14a53e6 Nullary RNGOp (pytorch#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (pytorch#1912) 20cf109 Grouped grid welford (pytorch#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (pytorch#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (pytorch#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (pytorch#1918) 3fb3d80 Add variance_mean function using Welford (pytorch#1907) 98febf6 Remove DisableOption::UnrollWithRng (pytorch#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (pytorch#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (pytorch#1916) 5eefa9a dopt is only available since nvrtc 11.7 (pytorch#1915) 2ec8fc7 Kill computeAtBetween (pytorch#1911) d0d106a Improve view support on pointwise and transpose scheduler (pytorch#1906) e71e1ec Fix name clash of RNG with shared memory (pytorch#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (pytorch#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet

[NVFuser] Upstream push 0907

d9845a9

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Rebase and squashed commits from nvfuser upstream push 0907 RUN_TORCHBENCH: nvfuser [ghstack-poisoned]

pytorch-bot Bot added the release notes: jit release notes category label Sep 7, 2022

facebook-github-bot added the cla signed label Sep 7, 2022

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 7, 2022

pytorchbot added the open source label Sep 7, 2022

jjsjann123 requested review from SherlockNoMad and davidberard98 September 7, 2022 16:35

jjsjann123 added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 7, 2022

davidberard98 added the skip-pr-sanity-checks label Sep 7, 2022

IvanYashchuk mentioned this pull request Sep 23, 2022

Add nvfuser support for prims.copy_to #84545

Draft

pytorchmergebot added the Merged label Sep 23, 2022

pytorchmergebot closed this in 0e582fb Sep 23, 2022

jjsjann123 deleted the gh/jjsjann123/4/head branch September 23, 2022 20:31

atalman mentioned this pull request Feb 19, 2024

[RFC] Intel GPU ATen Operations Upstreaming Options #119682

Closed

Conversation

jjsjann123 commented Sep 7, 2022 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 3 New Failures, 11 Pending

🕵️ 2 new failures recognized by patterns

Lint / pr-sanity-checks (1/2)

linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test / build (2/2)

🕵️‍♀️ 1 failure not recognized by patterns:

Uh oh!

jjsjann123 commented Sep 7, 2022

Uh oh!

jjsjann123 commented Sep 7, 2022

Uh oh!

jjsjann123 commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Sep 7, 2022

Uh oh!

davidberard98 commented Sep 7, 2022

Uh oh!

davidberard98 commented Sep 7, 2022

Uh oh!

jeffdaily commented Sep 7, 2022

Uh oh!

jjsjann123 commented Sep 7, 2022

Uh oh!

naoyam commented Sep 7, 2022

Uh oh!

jjsjann123 commented Sep 7, 2022

Uh oh!

jeffdaily commented Sep 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidberard98 commented Sep 8, 2022

Uh oh!

pytorchmergebot commented Sep 8, 2022

Uh oh!

pytorchmergebot commented Sep 8, 2022

Uh oh!

pytorch-bot Bot commented Sep 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84626

✅ No Failures

Uh oh!

davidberard98 commented Sep 8, 2022

Uh oh!

jjsjann123 commented Sep 8, 2022

Uh oh!

davidberard98 commented Sep 21, 2022

Uh oh!

jeffdaily commented Sep 21, 2022

Uh oh!

jjsjann123 commented Sep 21, 2022

Uh oh!

davidberard98 commented Sep 21, 2022

Uh oh!

davidberard98 commented Sep 21, 2022

Uh oh!

jjsjann123 commented Sep 22, 2022

Uh oh!

IvanYashchuk commented Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidberard98 commented Sep 23, 2022

Uh oh!

pytorchmergebot commented Sep 23, 2022

Uh oh!

davidberard98 commented Sep 23, 2022

Uh oh!

github-actions Bot commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

jjsjann123 commented Sep 7, 2022 •

edited by pytorchmergebot

Loading

facebook-github-bot commented Sep 7, 2022 •

edited

Loading

jjsjann123 commented Sep 7, 2022 •

edited

Loading

jeffdaily commented Sep 7, 2022 •

edited

Loading

pytorch-bot Bot commented Sep 8, 2022 •

edited

Loading

IvanYashchuk commented Sep 23, 2022 •

edited

Loading