Fixes for PyTorch/XLA functionalization integration#88787
Fixes for PyTorch/XLA functionalization integration#88787wonjoo-wj wants to merge 11 commits intopytorch:masterfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88787
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 FailuresAs of commit 1f72367: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
612fc0a to
df3bf42
Compare
cfc2922 to
86f8b11
Compare
|
@wonjoolee95 You probably want to hide your debug hints while running the CI here. It makes the log too large and impossible to parse over the browser. Also, a rebase will be appreciated. |
|
I guess it's better to log those information with proper log level control such that you can still use it locally for debugging but it won't increase the test log size dramatically. |
bd0a882 to
2197973
Compare
|
Hmm, seems like some functorch tests are still failing even after applying the diff generated by |
|
It looks like those machines are with gpu devices. Do you have a gpu env? I guess you need to run those tests on a gpu env. Otherwise, the tests will be skipped. |
|
Please correct me if I'm wrong, seems like only the first two failing tests are gpu devices? I'll wait for the rest of the CI to complete and then first all the non-gpu tests first. |
f9804d3 to
048200a
Compare
|
With the latest commit, all the TestAOTAutograd tests should succeed: I'm still unsure about the previous GPU failure that failed with: But I'll let the CI to run one more time before spending more time on the GPU test. |
|
That failure is very likely unrelated. |
|
According to hud, the deploy failure shouldn't be related. |
18db0a9 to
14d131d
Compare
14d131d to
af74813
Compare
There was a problem hiding this comment.
@bdhirsh Is this guard safe to remove? With it, I crashed on xla symbolic expand:
root@t1v-n-307ffe96-w-0:/workspaces/work/pytorch/xla# PJRT_DEVICE=CPU python test/test_dynamic_shapes.py -v TestDynamicShapes.test_simple_expand
test_simple_expand (__main__.TestDynamicShapes) ... ERROR
======================================================================
ERROR: test_simple_expand (__main__.TestDynamicShapes)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_dynamic_shapes.py", line 18, in test_simple_expand
t5.expand(t2.size(0))
RuntimeError: /workspaces/work/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd.cpp:2109: SymIntArrayRef expected to contain only concrete integers
----------------------------------------------------------------------
Ran 1 test in 0.022s
FAILED (errors=1)
There was a problem hiding this comment.
Okay, it regresses. It's not safe.
There was a problem hiding this comment.
@alanwaketan so, the idea motivating this bit if code is the following:
- pytorch/XLA doesn't care about strides, so when comparing a pytorch program when run on CUDA vs XLA, the user will witness different strides on the tensors throughout their program
- functionalization gives XLA the ability to fix that problem; XLA can choose to not care about the value of strides in all of its kernels, but functionalization can run the meta function for every ATen, to properly set the strides
One question here is - do you think that's a benefit worth trying to capture for pytorch/XLA (stride correctness, for the user's perspective, for XLA tensors)? I'd be interested in @JackCaoG 's opinion.
Our options are either to:
(1) kill that code, and not bother trying to get correct strides
(2) make it more robust so it works on this test
In this test, it looks like you're using dynamic shapes, and the meta function we're calling doesn't play well with dynamic shapes. The way that the dynamic shapes workstream in core has been handling this is that we have python implementations / decompositons of a bunch of our ops, that we want to run when dynamic shapes are enabled. And it looks like... for some reason we aren't calling that python impl, and are instead calling the C++ one?
There's probably a better way to arrange for this to work with XLA, but one option option is enable the python dispatcher in your test, which should override a bunch of C++ meta kernels with their python equivalents:
from torch._dispatch.python import enable_python_dispatcher
with enable_python_dispatcher():
test()
There was a problem hiding this comment.
Here is the follow up on the xla side: pytorch/xla#4448.
There was a problem hiding this comment.
Looks like by enabling python dispatcher and implementing missing sym size ops can workaround this. But then it brings a bigger question whether we should enable python dispatcher for dynamic shapes or not.
ab6871a to
38fcb95
Compare
38fcb95 to
6ebd38f
Compare
|
Rebased this and the XLA POC PR with master. The |
6ebd38f to
016d9e6
Compare
016d9e6 to
c29cfe4
Compare
c29cfe4 to
55dac54
Compare
|
Bunch of The last commit does touch dynamo related code but the CI used to be green even with that commit. While I try to reproduce this locally, let me also try to rebase from master and re-run this CI. |
We can always use hud.pytorch.org to determine if the test is broken in tip of the tree. |
55dac54 to
c48991e
Compare
e5811f1 to
34fad84
Compare
34fad84 to
4c2a5ca
Compare
This reverts commit 1e66139201e493235fd4dadbe454d36eaa569aa7.
4c2a5ca to
f09903a
Compare
| CUDA: _fused_adamw_kernel_cuda_ | ||
| autogen: _fused_adamw, _fused_adamw.out | ||
|
|
||
| - func: _propagate_xla_data(Tensor input, Tensor output) -> () |
There was a problem hiding this comment.
@bdhirsh I have added the op you suggested in pytorch/xla#4505 (comment). Please review. You can just check the commit called: [Functionalization] Adds _propagate_xla_data.
7896fe8 to
1f72367
Compare
|
Moving to a new PR with a new branch -- #94537. Marking this one closed. |

Picking up #88506