fix maxpool2d for XLA dynamo tracing by bdhirsh · Pull Request #4276 · pytorch/xla

bdhirsh · 2022-12-05T22:15:52Z

waiting for CI before getting review

Today when XLA registers an autograd.Function to the `AutogradXLA` key, the "add to the autograd graph" step and the "run the forward kernel" step happen all in one go. That's wrong, and prevents other dispatcher code from executing in the middle. When trying to fix this, I noticed a bug in the codegen: we register kernels for both the XLA and AutogradXLA dispatch keys to the same class. This prevents XLA from registering a separate kernel to the XLA and Autograd XLA, which is what this PR attempts to address. Companion patch to fix XLA's max_pool2d registration here, which was blocking the dynamo integration: pytorch/xla#4276 After this PR, XLA should generate two separate header files: `XLANativeFunctions.h`, and `AutogradXLANativeFunctions.h` Before, all of the kernels (including autograd kernels) would be thrown in `XLANativeFunctions.h`. cc JackCaoG [ghstack-poisoned]

bdhirsh · 2022-12-06T00:51:19Z

@JackCaoG do you know why the option to re-run is grey'd out on the CI failure? Not sure if it's a permissions thing or something else. It looks like CI didn't pick up my torch pin or something, trying to kick it off again:

JackCaoG · 2022-12-06T01:16:06Z

Hmm, I was able to restart the CI. I thought you are admin so you have all permission, let me double check.

JackCaoG · 2022-12-06T18:54:01Z

hmm I think torch pin has taken effect

+ TORCH_PIN=/tmp/pytorch/xla/scripts/../torch_patches/.torch_pin
+ '[' -f /tmp/pytorch/xla/scripts/../torch_patches/.torch_pin ']'
++ cat /tmp/pytorch/xla/scripts/../torch_patches/.torch_pin
+ CID='#90226'
+ [[ #90226 = \#* ]]
+ PRNUM=90226
+ set +x
Fetching PyTorch PR #90226
/tmp/pytorch /tmp/pytorch
From https://github.com/pytorch/pytorch
 * [new ref]               refs/pull/90226/head -> 90226
Switched to branch '90226'
M	third_party/ideep
M	third_party/kineto
Submodule path 'third_party/ideep': checked out 'ececd0a4f53c39f2d91caaddee0de1cd214f5b99'
Submodule path 'third_party/kineto': checked out '0703c78999061b8329dfab7ec5046fc5764a5573'

in the full log

JackCaoG · 2022-12-07T21:52:06Z

======================================================================
ERROR: test_pooling_shape_xla (__main__.TestPoolingNNDeviceTypeXLA)
Test the output shape calculation for pooling functions
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 391, in instantiated_test
    raise rte
  File "/opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_device_type.py", line 378, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/pytorch/xla/test/../../test/nn/test_pooling.py", line 593, in test_pooling_shape
    check((1, 1, 3, 3, 4), (1, 1, 5, 6, 7), kernel_size=1, stride=2, padding=0, ceil_mode=True)
  File "/tmp/pytorch/xla/test/../../test/nn/test_pooling.py", line 591, in check
    self.assertEqual(op(t, *args, **kwargs).shape, expected_out_shape[:i + 2])
  File "/opt/conda/lib/python3.7/site-packages/torch/_jit_internal.py", line 485, in fn
    return if_false(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 782, in _max_pool2d
    return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: 0 INTERNAL ASSERT FAILED at "/tmp/pytorch/aten/src/ATen/core/boxing/KernelFunction.cpp":19, please report a bug to PyTorch. fallthrough_kernel was executed but it should have been short-circuited by the dispatcher. This could occur if you registered a fallthrough kernel as a override for a specific operator (as opposed to a backend fallback); this is NOT currently supported, and we do not intend to add support for it in the near future.  If you do find yourself in need of this, let us know in the bug tracker.

seems like something has to do with fallthrough kernel? The test is about pooling so I think this is a real failure.

bdhirsh · 2022-12-08T15:51:19Z

@JackCaoG I'm trying to rebuild XLA locally on my new devserver (I'm anticipating issues) - but I just stared at the code for a while and I think I know what the problem is. Just pushed it, so I'll see what the latest round of CI yields.

@shunting314 I'll give you a shout when this PR looks ready to test - when it is, can you try re-running your dynamo-XLA integration with max_pool2d (both fw and bw) and confirm if there are issues?

shunting314 · 2022-12-08T18:20:44Z

@bdhirsh sure, I'd be glad to do the tests

bdhirsh · 2022-12-08T22:35:15Z

Hey @shunting314, it looks like the max_pool2d unit tests are passing. I do see a failure in the XLA-dynamo tests, but it doesn't seem related to this change (?). Can you try running E2E tests again?

ERROR: test_simple_model (__main__.DynamoBasicTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pytorch/xla/test/dynamo/test_dynamo.py", line 41, in test_simple_model
    res_xla_dynamo_2 = self.fn_simple_dynamo(xla_x, xla_y)
  File "/opt/conda/lib/python3.7/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
    return fn(*args, **kwargs)
  File "/tmp/pytorch/xla/test/dynamo/test_dynamo.py", line 21, in fn_simple_dynamo
    @dynamo.optimize('torchxla_trace_once')
  File "/opt/conda/lib/python3.7/site-packages/torch/_dynamo/eval_frame.py", line 209, in _fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/_dynamo/optimizations/backends.py", line 755, in fwd
    model = subgraph.model
NameError: free variable 'subgraph' referenced before assignment in enclosing scope

JackCaoG · 2022-12-08T22:36:13Z

rebase should fix the issue

JackCaoG · 2022-12-08T23:05:37Z

@bdhirsh you might also need to rebase your pytorch pr.

…ops" Today when XLA registers an autograd.Function to the `AutogradXLA` key, the "add to the autograd graph" step and the "run the forward kernel" step happen all in one go. That's wrong, and prevents other dispatcher code from executing in the middle. When trying to fix this, I noticed a bug in the codegen: we register kernels for both the XLA and AutogradXLA dispatch keys to the same class. This prevents XLA from registering a separate kernel to the XLA and Autograd XLA, which is what this PR attempts to address. Companion patch to fix XLA's max_pool2d registration here, which was blocking the dynamo integration: pytorch/xla#4276 After this PR, XLA should generate two separate header files: `XLANativeFunctions.h`, and `AutogradXLANativeFunctions.h` Before, all of the kernels (including autograd kernels) would be thrown in `XLANativeFunctions.h`. cc JackCaoG [ghstack-poisoned]

Today when XLA registers an autograd.Function to the `AutogradXLA` key, the "add to the autograd graph" step and the "run the forward kernel" step happen all in one go. That's wrong, and prevents other dispatcher code from executing in the middle. When trying to fix this, I noticed a bug in the codegen: we register kernels for both the XLA and AutogradXLA dispatch keys to the same class. This prevents XLA from registering a separate kernel to the XLA and Autograd XLA, which is what this PR attempts to address. Companion patch to fix XLA's max_pool2d registration here, which was blocking the dynamo integration: pytorch/xla#4276 After this PR, XLA should generate two separate header files: `XLANativeFunctions.h`, and `AutogradXLANativeFunctions.h` Before, all of the kernels (including autograd kernels) would be thrown in `XLANativeFunctions.h`. cc JackCaoG [ghstack-poisoned]

bdhirsh · 2022-12-08T23:07:33Z

yep - done

shunting314 · 2022-12-08T23:08:50Z

@bdhirsh is this the only PR i need patch? ( I previously saw you have 2 related PRs?)

shunting314 · 2022-12-08T23:09:26Z

Do I need patch ' pytorch/pytorch#90226 ' as well ?

bdhirsh · 2022-12-08T23:21:02Z

yes sorry - you'll need both :)

shunting314 · 2022-12-09T06:56:27Z

@bdhirsh I seed the following errors when building torchxla

# BUILD_CPP_TESTS=0 python setup.py develop
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Building torch_xla version: 1.14
XLA Commit ID: c6acdf5417e736429e353e7abf2a26cde1d96122
PyTorch Commit ID: 9f04175b952abbe841e52e65fc95c0bd61a2f2cd
Traceback (most recent call last):
  File "/pytorch/xla/scripts/gen_lazy_tensor.py", line 120, in <module>
    get_device_fn="torch_xla::bridge::GetXlaDevice")
  File "/pytorch/torchgen/gen_lazy_tensor.py", line 367, in run_gen_lazy_tensor
    source_yaml, grouped_native_functions, backend_indices
  File "/pytorch/torchgen/gen_backend_stubs.py", line 246, in parse_backend_yaml
    {forward_kernels[0].kernel} is listed under "supported", but {backward_kernels[0].kernel} is listed under "autograd".'
AssertionError: Currently, all variants of an op must either be registered to a backend key, or to a backend's autograd key. They cannot be mix and matched. If this is something you need, feel free to create an issue! max_pool2d is listed under "supported", but max_pool2d is listed under "autograd".
Failed to generate lazy files: ['python', '/pytorch/xla/scripts/gen_lazy_tensor.py']

shunting314 · 2022-12-09T06:57:04Z

oh, nvm, let my patch the pytorch side PR as well

shunting314 · 2022-12-09T07:21:09Z

@bdhirsh I still see the issue after patching this PR and the corresponding PR on pytorch side. I've created a standalone tests without the need of patching my PR. You can repro using this simple script in your environment: pytorch/torchdynamo#1837 (comment)

bdhirsh · 2022-12-12T23:18:45Z

+    c10::DispatchKey::Conjugate,
+    c10::DispatchKey::Negative,
+    c10::DispatchKey::ZeroTensor,
+    c10::DispatchKey::ADInplaceOrView,


Oh we should take out ADInplaceOrView from here

bdhirsh · 2022-12-14T22:10:17Z

Hey @JackCaoG - do you mind finishing up landing? @shunting314 confirmed that this fixes the E2E tests for max_pool2d

JackCaoG

Thanks!

bdhirsh mentioned this pull request Dec 5, 2022

codegen fixes to fix tracing XLA autograd ops pytorch/pytorch#90226

Closed

bdhirsh force-pushed the dynamo_maxpool_fix branch from 69f54bb to c4d9828 Compare December 6, 2022 21:13

bdhirsh force-pushed the dynamo_maxpool_fix branch from 443f718 to 7bf44ba Compare December 8, 2022 22:39

bdhirsh force-pushed the dynamo_maxpool_fix branch from 8278bc4 to 176b704 Compare December 12, 2022 22:51

bdhirsh commented Dec 12, 2022

View reviewed changes

bdhirsh force-pushed the dynamo_maxpool_fix branch 2 times, most recently from 98d1059 to 83876d5 Compare December 13, 2022 23:25

Brian Hirsh added 4 commits December 14, 2022 00:42

fix maxpool2d for XLA dynamo tracing

586922f

add torch pin

1344c7f

fix

a46b4d4

fix torch pin

f0670e3

Brian Hirsh and others added 5 commits December 14, 2022 00:42

fix torch pin again

37e5342

fixes

0d3d21e

fix 3

fcbf995

fix custom ops

140545f

fix

47f6b54

bdhirsh force-pushed the dynamo_maxpool_fix branch from 83876d5 to 47f6b54 Compare December 14, 2022 00:42

remove meta funcs, fix build

1b98735

linter

b9ed606

JackCaoG approved these changes Dec 15, 2022

View reviewed changes

JackCaoG merged commit 4013c57 into master Dec 15, 2022

Conversation

bdhirsh commented Dec 5, 2022

Uh oh!

bdhirsh commented Dec 6, 2022

Uh oh!

JackCaoG commented Dec 6, 2022

Uh oh!

JackCaoG commented Dec 6, 2022

Uh oh!

JackCaoG commented Dec 7, 2022

Uh oh!

bdhirsh commented Dec 8, 2022

Uh oh!

shunting314 commented Dec 8, 2022

Uh oh!

bdhirsh commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Dec 8, 2022

Uh oh!

JackCaoG commented Dec 8, 2022

Uh oh!

bdhirsh commented Dec 8, 2022

Uh oh!

shunting314 commented Dec 8, 2022

Uh oh!

shunting314 commented Dec 8, 2022

Uh oh!

bdhirsh commented Dec 8, 2022

Uh oh!

shunting314 commented Dec 9, 2022

Uh oh!

shunting314 commented Dec 9, 2022

Uh oh!

shunting314 commented Dec 9, 2022

Uh oh!

bdhirsh Dec 12, 2022

Choose a reason for hiding this comment

Uh oh!

bdhirsh commented Dec 14, 2022

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bdhirsh commented Dec 8, 2022 •

edited

Loading