Skip to content

[NVFuser] Upstream push 0907#84626

Closed
jjsjann123 wants to merge 8 commits intogh/jjsjann123/4/basefrom
gh/jjsjann123/4/head
Closed

[NVFuser] Upstream push 0907#84626
jjsjann123 wants to merge 8 commits intogh/jjsjann123/4/basefrom
gh/jjsjann123/4/head

Conversation

@jjsjann123
Copy link
Copy Markdown
Collaborator

@jjsjann123 jjsjann123 commented Sep 7, 2022

Stack from ghstack (oldest at bottom):

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

  • codegen improvement:
    i. improved view support on pointwise and transpose scheduler
    ii. grouped grid welford added for better outer-norm grid persistence in normalization

  • misc:
    i. new composite ops added: variance_mean , arange,
    ii. fixes misaligned address for transpose scheduler
    iii. refactor on separation of compilation API from execution API to prepare us for async compilation
    iv. double type support on expression evaluator
    v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:

89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)

RUN_TORCHBENCH: nvfuser

Differential Revision: D39324552

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Rebase and squashed commits from nvfuser upstream push 0907

RUN_TORCHBENCH: nvfuser

[ghstack-poisoned]
@pytorch-bot pytorch-bot Bot added the release notes: jit release notes category label Sep 7, 2022
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Sep 7, 2022

🔗 Helpful links

❌ 3 New Failures, 11 Pending

As of commit d9845a9 (more details on the Dr. CI page):

Expand to see more
  • 3/3 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build Lint / pr-sanity-checks (1/2)

Step: "PR size check" (full log | diagnosis details)

2022-09-07T09:55:18.9046776Z ##[error]Process completed with exit code 1.
2022-09-07T09:55:18.9033369Z + echo 'please contact @albanD or @seemethere.'
2022-09-07T09:55:18.9033580Z + echo
2022-09-07T09:55:18.9033728Z + false
2022-09-07T09:55:18.9033861Z 
2022-09-07T09:55:18.9034025Z Your PR is 11536 LOC which is more than the 2000 maximum
2022-09-07T09:55:18.9034311Z allowed within PyTorch infra. PLease make sure to split up
2022-09-07T09:55:18.9034918Z your PR into smaller pieces that can be reviewed.
2022-09-07T09:55:18.9035209Z If you think that this rule should not apply to your PR,
2022-09-07T09:55:18.9035462Z please contact @albanD or @seemethere.
2022-09-07T09:55:18.9035628Z 
2022-09-07T09:55:18.9046776Z ##[error]Process completed with exit code 1.
2022-09-07T09:55:18.9095497Z Post job cleanup.
2022-09-07T09:55:18.9127531Z Post job cleanup.
2022-09-07T09:55:19.0202708Z [command]/usr/bin/git version
2022-09-07T09:55:19.0253229Z git version 2.37.3
2022-09-07T09:55:19.0297595Z Temporarily overriding HOME='/home/runner/work/_temp/2de4cf5d-81bf-4e90-aea1-87debfc8fc92' before making global git config changes
2022-09-07T09:55:19.0298373Z Adding repository directory to the temporary git global config as a safe directory
2022-09-07T09:55:19.0304686Z [command]/usr/bin/git config --global --add safe.directory /home/runner/work/pytorch/pytorch
2022-09-07T09:55:19.0347043Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2022-09-07T09:55:19.0384686Z [command]/usr/bin/git submodule foreach --recursive git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :
2022-09-07T09:55:19.0620053Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader

See GitHub Actions build linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test / build (2/2)

Step: "Download Build Artifacts" (full log | diagnosis details)

2022-09-07T17:21:14.0066970Z ##[error]An error ...torch/pytorch/pytorch/'. No such file or directory
2022-09-07T17:21:14.0024264Z   ALPINE_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine
2022-09-07T17:21:14.0024537Z   ANACONDA_USER: pytorch
2022-09-07T17:21:14.0024732Z   AWS_DEFAULT_REGION: us-east-1
2022-09-07T17:21:14.0024915Z   BINARY_ENV_FILE: /tmp/env
2022-09-07T17:21:14.0025153Z   BUILD_ENVIRONMENT: linux-binary-libtorch-cxx11-abi
2022-09-07T17:21:14.0025508Z   GITHUB_TOKEN: ***
2022-09-07T17:21:14.0025668Z   PR_NUMBER: 
2022-09-07T17:21:14.0025858Z   PYTORCH_FINAL_PACKAGE_DIR: /artifacts
2022-09-07T17:21:14.0026084Z   SHA1: d9845a9f809b51e5762e336f1f18458dc358852a
2022-09-07T17:21:14.0026271Z ##[endgroup]
2022-09-07T17:21:14.0066970Z ##[error]An error occurred trying to start process '/usr/bin/bash' with working directory '/home/ec2-user/actions-runner/_work/pytorch/pytorch/pytorch/'. No such file or directory
2022-09-07T17:21:14.0087984Z ##[group]Run # Ensure the working directory gets chowned back to the current user
2022-09-07T17:21:14.0088304Z �[36;1m# Ensure the working directory gets chowned back to the current user�[0m
2022-09-07T17:21:14.0088625Z �[36;1mdocker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .�[0m
2022-09-07T17:21:14.0098589Z shell: /usr/bin/bash -e {0}
2022-09-07T17:21:14.0098755Z env:
2022-09-07T17:21:14.0098923Z   PYTORCH_ROOT: /pytorch
2022-09-07T17:21:14.0099112Z   BUILDER_ROOT: /builder
2022-09-07T17:21:14.0099286Z   PACKAGE_TYPE: libtorch
2022-09-07T17:21:14.0099464Z   DESIRED_CUDA: cpu
2022-09-07T17:21:14.0099639Z   GPU_ARCH_VERSION: 

🕵️‍♀️ 1 failure not recognized by patterns:

The following CI failures may be due to changes from the PR
Job Step
GitHub Actions linux-focal-rocm5.2-py3.7 / test (default, 1, 2, linux.rocm.gpu) Unknown

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

jjsjann123 added a commit that referenced this pull request Sep 7, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Rebase and squashed commits from nvfuser upstream push 0907

RUN_TORCHBENCH: nvfuser

ghstack-source-id: 52040e7
Pull Request resolved: #84626
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 7, 2022
@jjsjann123 jjsjann123 added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 7, 2022
@jjsjann123
Copy link
Copy Markdown
Collaborator Author

😢 forgot to add trunk label last night....

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

errr. I'm seeing real build errors on macos and rocm test failures from the smoke test, which has the trunk label #84240 . start digging there...

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

jjsjann123 commented Sep 7, 2022

Rocm failure is not very informative for debugging. We are hitting a driver error 200 vvv:

2022-09-07T09:45:48.6209995Z RuntimeError: The following operation failed in the TorchScript interpreter.
2022-09-07T09:45:48.6210996Z Traceback of TorchScript (most recent call last):
2022-09-07T09:45:48.6211823Z RuntimeError: CUDA driver error: 200

and the assert seeing in the log is just checking on nvrtcCompileProgram being successful.
2022-09-07T09:45:49.8770329Z RuntimeError: false INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/codegen/cuda/executor_utils.cpp":1180, please report a bug to PyTorch. #pragma clang force_cuda_host_device begin

Link to the rocm failure below: https://github.com/pytorch/pytorch/runs/8233369438?check_suite_focus=true

cc'ing @jeffdaily in case you want to take a look. Meanwhile, should we disable the tests for merge later when I resolve the macos build issues? @davidberard98

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

wait... build CI on this PR actually passed. 🎉

There's some cxx11-abi test that's failing, but looks like just permission issue, maybe it's my account?! https://github.com/pytorch/pytorch/runs/8233758234?check_suite_focus=true

@davidberard98
Copy link
Copy Markdown
Contributor

I'm guessing the cxx11-abi failure is an infra failure, it looks like most PRs on master are also failing https://hud.pytorch.org/. Sanity-checks job will be skipped once we rebase this PR. But we do need to fix or disable rocm issues, I'm fine with disabling as long as @jeffdaily agrees.

@davidberard98
Copy link
Copy Markdown
Contributor

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jeffdaily
Copy link
Copy Markdown
Collaborator

There are a number of failed tests, but they all seem to originate from the same block of generated code:

/tmp/comgr-1caa3b/input/CompileSource:5601:5: error: function template partial specialization is not allowed
    reduce<Func, Types...>(
    ^     ~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6018:5: error: function template partial specialization is not allowed
    reduce<Func, Types...>(
    ^     ~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6075:5: error: function template partial specialization is not allowed
    reduceGroup<DataTypes..., Funcs..., BoolTypes...>(
    ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6260:5: error: function template partial specialization is not allowed
    reduceGroup<DataTypes..., Funcs..., BoolTypes...>(
    ^          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6320:5: error: function template partial specialization is not allowed
    reduceGroupBlock<BLOCK_BROADCAST, DataTypes..., Funcs..., BoolTypes...>(
    ^               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6397:5: error: function template partial specialization is not allowed
    reduceGroupLastBlock<DataTypes..., Funcs..., BoolTypes...>(
    ^                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6757:5: error: function template partial specialization is not allowed
    welfordGroup<NumArgs, DataType, IndexType>(
    ^           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6958:5: error: function template partial specialization is not allowed
    welfordGroupBlock<BLOCK_BROADCAST, NumVals, DataType, IndexType>(
    ^                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:7027:5: error: function template partial specialization is not allowed
    welfordGroupLastBlock<NumVals, DataType, IndexType>(
    ^                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

/tmp/comgr-1caa3b/input/CompileSource:6075:5: error: function template partial specialization is not allowed
reduceGroup<DataTypes..., Funcs..., BoolTypes...>(
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/comgr-1caa3b/input/CompileSource:6260:5: error: function template partial specialization is not allowed
reduceGroup<DataTypes..., Funcs..., BoolTypes...>(

But reduceGroup here is not being partially specialized... it's just an overload and we do have all template arguments specified.
Sounds like a rocm compiler support thing. We can't back out cuda support because rocm is broken. Any chance we have a quick WAR this @jeffdaily ?

cc'ing @naoyam in case I was wrong about the overloaded template functions.

@naoyam
Copy link
Copy Markdown

naoyam commented Sep 7, 2022

I'm not aware of any code that results in partial specialization of function templates. Where can I see generated code?

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

I'm not aware of any code that results in partial specialization of function templates. Where can I see generated code?

Thanks for confirming this.
This is on rocm stack and I assume you'll need AMD hardware, which is why we really can't provide much help from NV side on rocm failures... We'll see if @jeffdaily would be able to provide a quick patch. We can also turn off rocm tests and handle the failure in parallel. So we are at least not blocked on this.

@jeffdaily
Copy link
Copy Markdown
Collaborator

jeffdaily commented Sep 7, 2022

You don't need AMD hardware for build errors.

docker pull rocm/dev-ubuntu-20.04:5.2

rocm_build_error.cu.txt

(github doesn't support uploading *.cu files, so *.txt extension added to file)

Inside docker:

hipcc -c rocm_build_error.cu -std=c++17 --amdgpu-target=gfx906

@naoyam the code is attached.

@davidberard98
Copy link
Copy Markdown
Contributor

@pytorchbot rebase -s

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange, 
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa Tensor factories must set the output shape as its input (#1939)
b2fd01e arange support (#1933)
56c00fd Double support on all expression evaluators (#1937)
371f282 Improve trivial reduction merge support (#1931)
1d0c267 Test `rand` in a fusion with zero tensor input (#1932)
0dab160 Fix softmax bwd sizes. (#1890)
ef98f36 Fix a bug (#1936)
63132a0 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c8 Map IterationDomains through view operations. (#1919)
c0a187a do not use deprecated functions (#1935)
88de85e Upstream cherry pick fixes 0811 (#1934)
b247dcf Separate kernel compilation API from kernel execution API (#1914)
b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6 Nullary RNGOp (#1892)
3c3c89e Misc fixes/tuning for transpose scheduler (#1912)
20cf109 Grouped grid welford (#1921)
6cf7eb0 Transpose scheduler small dim sizes better support (#1910)
9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80 Add variance_mean function using Welford (#1907)
98febf6 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a dopt is only available since nvrtc 11.7 (#1915)
2ec8fc7 Kill computeAtBetween (#1911)
d0d106a Improve view support on pointwise and transpose scheduler (#1906)
e71e1ec Fix name clash of RNG with shared memory (#1904)
3381793 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased gh/jjsjann123/4/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/84626)

pytorchmergebot pushed a commit that referenced this pull request Sep 8, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/
Rebase and squashed commits from nvfuser upstream push 0907

RUN_TORCHBENCH: nvfuser

ghstack-source-id: dc174be
Pull Request resolved: #84626
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Sep 8, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84626

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7bc72f5:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@davidberard98
Copy link
Copy Markdown
Contributor

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

You don't need AMD hardware for build errors.

docker pull rocm/dev-ubuntu-20.04:5.2

rocm_build_error.cu.txt

(github doesn't support uploading *.cu files, so *.txt extension added to file)

Inside docker:

hipcc -c rocm_build_error.cu -std=c++17 --amdgpu-target=gfx906

@naoyam the code is attached.

I don't think we have bandwidth to spare on rocm failures for our upstream PRs at the moment. This is not something we promised to support in nvfuser.
I'm suggesting that we disable the test for this PR. We'd be more than happy to review future PRs that patches and reenables related tests.

@davidberard98
Copy link
Copy Markdown
Contributor

@jjsjann123 sorry, can we get one more rebase?

seems like the test runner for the internal tests failed the last few times, not sure why

@jeffdaily
Copy link
Copy Markdown
Collaborator

@davidberard98 any specifics for internal test failures? ROCm-specific or no?

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

@davidberard98 any specifics for internal test failures? ROCm-specific or no?

Do we also have interla ROCm-specific tests? Asking since I don't have visibility in that and looks like public ROCm CIs are green here.

@davidberard98
Copy link
Copy Markdown
Contributor

tests are just flaky from what I can tell. I'm not familiar with rocm tests internally.

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange, 
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa Tensor factories must set the output shape as its input (#1939)
b2fd01e arange support (#1933)
56c00fd Double support on all expression evaluators (#1937)
371f282 Improve trivial reduction merge support (#1931)
1d0c267 Test `rand` in a fusion with zero tensor input (#1932)
0dab160 Fix softmax bwd sizes. (#1890)
ef98f36 Fix a bug (#1936)
63132a0 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c8 Map IterationDomains through view operations. (#1919)
c0a187a do not use deprecated functions (#1935)
88de85e Upstream cherry pick fixes 0811 (#1934)
b247dcf Separate kernel compilation API from kernel execution API (#1914)
b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6 Nullary RNGOp (#1892)
3c3c89e Misc fixes/tuning for transpose scheduler (#1912)
20cf109 Grouped grid welford (#1921)
6cf7eb0 Transpose scheduler small dim sizes better support (#1910)
9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80 Add variance_mean function using Welford (#1907)
98febf6 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a dopt is only available since nvrtc 11.7 (#1915)
2ec8fc7 Kill computeAtBetween (#1911)
d0d106a Improve view support on pointwise and transpose scheduler (#1906)
e71e1ec Fix name clash of RNG with shared memory (#1904)
3381793 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)

[ghstack-poisoned]
jjsjann123 added a commit that referenced this pull request Sep 21, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa Tensor factories must set the output shape as its input (#1939)
b2fd01e arange support (#1933)
56c00fd Double support on all expression evaluators (#1937)
371f282 Improve trivial reduction merge support (#1931)
1d0c267 Test `rand` in a fusion with zero tensor input (#1932)
0dab160 Fix softmax bwd sizes. (#1890)
ef98f36 Fix a bug (#1936)
63132a0 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c8 Map IterationDomains through view operations. (#1919)
c0a187a do not use deprecated functions (#1935)
88de85e Upstream cherry pick fixes 0811 (#1934)
b247dcf Separate kernel compilation API from kernel execution API (#1914)
b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6 Nullary RNGOp (#1892)
3c3c89e Misc fixes/tuning for transpose scheduler (#1912)
20cf109 Grouped grid welford (#1921)
6cf7eb0 Transpose scheduler small dim sizes better support (#1910)
9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80 Add variance_mean function using Welford (#1907)
98febf6 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a dopt is only available since nvrtc 11.7 (#1915)
2ec8fc7 Kill computeAtBetween (#1911)
d0d106a Improve view support on pointwise and transpose scheduler (#1906)
e71e1ec Fix name clash of RNG with shared memory (#1904)
3381793 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

ghstack-source-id: 34c0b92
Pull Request resolved: #84626
@davidberard98
Copy link
Copy Markdown
Contributor

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Copy Markdown
Collaborator Author

Daily bump for update since I'm getting nervous as we approach deadline. 🙇

@IvanYashchuk
Copy link
Copy Markdown
Collaborator

IvanYashchuk commented Sep 23, 2022

@malfet, @davidberard98 is there anything holding off this PR internally from being merged?

@davidberard98
Copy link
Copy Markdown
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@davidberard98
Copy link
Copy Markdown
Contributor

@jjsjann123 @IvanYashchuk I am merging manually because the internal diff has already been merged. In general we still need to merge via the internal workflow; but in this case I am merging via github just because the bot hasn't done it for some reason.

@github-actions
Copy link
Copy Markdown
Contributor

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@jjsjann123 jjsjann123 deleted the gh/jjsjann123/4/head branch September 23, 2022 20:31
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
…g during tests. (#85319)

Fixes issue Jie found in his PR:

#84626 (comment)
Pull Request resolved: #85319
Approved by: https://github.com/jjsjann123
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa Tensor factories must set the output shape as its input (#1939)
b2fd01e arange support (#1933)
56c00fd Double support on all expression evaluators (#1937)
371f282 Improve trivial reduction merge support (#1931)
1d0c267 Test `rand` in a fusion with zero tensor input (#1932)
0dab160 Fix softmax bwd sizes. (#1890)
ef98f36 Fix a bug (#1936)
63132a0 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c8 Map IterationDomains through view operations. (#1919)
c0a187a do not use deprecated functions (#1935)
88de85e Upstream cherry pick fixes 0811 (#1934)
b247dcf Separate kernel compilation API from kernel execution API (#1914)
b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6 Nullary RNGOp (#1892)
3c3c89e Misc fixes/tuning for transpose scheduler (#1912)
20cf109 Grouped grid welford (#1921)
6cf7eb0 Transpose scheduler small dim sizes better support (#1910)
9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80 Add variance_mean function using Welford (#1907)
98febf6 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a dopt is only available since nvrtc 11.7 (#1915)
2ec8fc7 Kill computeAtBetween (#1911)
d0d106a Improve view support on pointwise and transpose scheduler (#1906)
e71e1ec Fix name clash of RNG with shared memory (#1904)
3381793 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: #84626
Approved by: https://github.com/malfet
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa23aa804340b2406ab58899d816e3dc3d2 Tensor factories must set the output shape as its input (#1939)
b2fd01ea9346712c6d6f623ca6addbc4888d008e arange support (#1933)
56c00fd3922dad7dfc57351ad7d780f0f2f8e4ed Double support on all expression evaluators (#1937)
371f28223e57fe3f6b5e50a0a45177e6a5c0785c Improve trivial reduction merge support (#1931)
1d0c26790e5647920b40d419d26815bbe310b3a6 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
ef98f360f6d3e3e1cc662ecb65202d88150f128d Fix a bug (#1936)
63132a0c56508c550084b07fb76a3df865102d00 Propagate permissive mapping information into indexing pass (#1929)
b4ac2c88d78078ee4d8b21c4fc51645b5710a282 Map IterationDomains through view operations. (#1919)
c0a187a7619d7cf9dc920294e15461791e8d6d4d do not use deprecated functions (#1935)
88de85e758c5e4afb7b6e746573c0d9a53b4cea7 Upstream cherry pick fixes 0811 (#1934)
b247dcf7c57dc6ac3f7a799b0a6beb7770536a74 Separate kernel compilation API from kernel execution API (#1914)
b34e3b93ee1a8030730c14af3995dd95665af07d Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
14a53e6707f43bf760494c238a46386d69830822 Nullary RNGOp (#1892)
3c3c89e638f5172cafb0761f22bacd1fd695eec3 Misc fixes/tuning for transpose scheduler (#1912)
20cf109c8b44d48f61977e35bae94368985144ac Grouped grid welford (#1921)
6cf7eb024c9e53c358cbe56597e117bad56efefd Transpose scheduler small dim sizes better support (#1910)
9341ea9a5bf42f9b14ccad0c94edbc79fc5bb552 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
057237f66deeea816bb943d802a97c1b7e4414ab Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
3fb3d80339e4f794767a53eb8fdd61e64cf404a2 Add variance_mean function using Welford (#1907)
98febf6aa3b8c6fe4fdfb2864cda9e5d30089262 Remove DisableOption::UnrollWithRng (#1913)
ee8ef33a5591b534cf587d347af11e48ba7a15d4 Minor fix for the debug interface of using PTX directly (#1917)
6e8f953351f9dabfd1f991d8431cecb6c2ce684d Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
5eefa9a72385f6a4b145680a9dcc52d7e8293763 dopt is only available since nvrtc 11.7 (#1915)
2ec8fc711eafc72451eebf0f5e2a98a38bf3f6ef Kill computeAtBetween (#1911)
d0d106a1d9af118d71673173674e875be35d259d Improve view support on pointwise and transpose scheduler (#1906)
e71e1ecefe67219846070590bbed54bbc7416b79 Fix name clash of RNG with shared memory (#1904)
3381793a253689abf224febc73fd3fe2a0dbc921 Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: pytorch/pytorch#84626
Approved by: https://github.com/malfet
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Cherry-picking upstream build failure patches from PR pytorch#84626

Changes includes:
1. added throw in stringify
2. Split fused_reduction.cu as its size exceeds the limit in MSVC
3. update bzl build for runtime header
4. Fix a bug originally reported in pytorch/pytorch#84626
5. Meta internal build fix

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
jjsjann123 pushed a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
80fad9d Tensor factories must set the output shape as its input (#1939)
d618d68 arange support (#1933)
0440c5a Double support on all expression evaluators (#1937)
143d14f Improve trivial reduction merge support (#1931)
6d290b2 Test `rand` in a fusion with zero tensor input (#1932)
0dab160fb2177d178eef3148c6a529e0855009e9 Fix softmax bwd sizes. (#1890)
471b325 Fix a bug (#1936)
16c77f2 Propagate permissive mapping information into indexing pass (#1929)
a6586fe Map IterationDomains through view operations. (#1919)
4bbb9c4 do not use deprecated functions (#1935)
2df32db Upstream cherry pick fixes 0811 (#1934)
7dd3ea4 Separate kernel compilation API from kernel execution API (#1914)
ec0c55c Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924)
2485b63 Nullary RNGOp (#1892)
9996e80 Misc fixes/tuning for transpose scheduler (#1912)
3835119 Grouped grid welford (#1921)
43d2d17 Transpose scheduler small dim sizes better support (#1910)
552c205 Disabled ViewPersistentShmoo sizes that results in NAN (#1922)
9b7a57f Fix CUDA driver error: misaligned address for transpose scheduler  (#1918)
ceadee1 Add variance_mean function using Welford (#1907)
a7b2172 Remove DisableOption::UnrollWithRng (#1913)
db07ac3 Minor fix for the debug interface of using PTX directly (#1917)
2fd1594 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916)
7e01cd0 dopt is only available since nvrtc 11.7 (#1915)
975821e Kill computeAtBetween (#1911)
871c913 Improve view support on pointwise and transpose scheduler (#1906)
c70844a Fix name clash of RNG with shared memory (#1904)
904396c Fix mutator and sameAs for expanded IterDomain (#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: pytorch/pytorch#84626
Approved by: https://github.com/malfet
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Codegen changes include:

- codegen improvement:
i. improved view support on pointwise and transpose scheduler
ii. grouped grid welford added for better outer-norm grid persistence in normalization

- misc:
i. new composite ops added: variance_mean , arange,
ii. fixes misaligned address for transpose scheduler
iii. refactor on separation of compilation API from execution API to prepare us for async compilation
iv. double type support on expression evaluator
v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN

Commits that's in this PR from the devel branch:
```
89330aa Tensor factories must set the output shape as its input (pytorch#1939)
b2fd01e arange support (pytorch#1933)
56c00fd Double support on all expression evaluators (pytorch#1937)
371f282 Improve trivial reduction merge support (pytorch#1931)
1d0c267 Test `rand` in a fusion with zero tensor input (pytorch#1932)
0dab160 Fix softmax bwd sizes. (pytorch#1890)
ef98f36 Fix a bug (pytorch#1936)
63132a0 Propagate permissive mapping information into indexing pass (pytorch#1929)
b4ac2c8 Map IterationDomains through view operations. (pytorch#1919)
c0a187a do not use deprecated functions (pytorch#1935)
88de85e Upstream cherry pick fixes 0811 (pytorch#1934)
b247dcf Separate kernel compilation API from kernel execution API (pytorch#1914)
b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (pytorch#1924)
14a53e6 Nullary RNGOp (pytorch#1892)
3c3c89e Misc fixes/tuning for transpose scheduler (pytorch#1912)
20cf109 Grouped grid welford (pytorch#1921)
6cf7eb0 Transpose scheduler small dim sizes better support (pytorch#1910)
9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (pytorch#1922)
057237f Fix CUDA driver error: misaligned address for transpose scheduler  (pytorch#1918)
3fb3d80 Add variance_mean function using Welford (pytorch#1907)
98febf6 Remove DisableOption::UnrollWithRng (pytorch#1913)
ee8ef33 Minor fix for the debug interface of using PTX directly (pytorch#1917)
6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (pytorch#1916)
5eefa9a dopt is only available since nvrtc 11.7 (pytorch#1915)
2ec8fc7 Kill computeAtBetween (pytorch#1911)
d0d106a Improve view support on pointwise and transpose scheduler (pytorch#1906)
e71e1ec Fix name clash of RNG with shared memory (pytorch#1904)
3381793 Fix mutator and sameAs for expanded IterDomain (pytorch#1902)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552)
Pull Request resolved: pytorch#84626
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue open source release notes: jit release notes category skip-pr-sanity-checks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants