Skip to content

integrate functionalization <> LTC torchscript backend#75527

Closed
bdhirsh wants to merge 69 commits intogh/bdhirsh/199/basefrom
gh/bdhirsh/199/head
Closed

integrate functionalization <> LTC torchscript backend#75527
bdhirsh wants to merge 69 commits intogh/bdhirsh/199/basefrom
gh/bdhirsh/199/head

Conversation

@bdhirsh
Copy link
Collaborator

@bdhirsh bdhirsh commented Apr 8, 2022

This PR integrates functionalization into LazyTensorCore. The high level is:

(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have FunctionalTensorWrapper(LazyTensorImpl).

(3) A bunch of aliasing bugs are now fixed. The most significant one is that mark_step() no longer severs aliasing relationships between tensors. I included a test in the PR.

What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (LazyNativeFunctions::empty/empty_strided). This is the main integration point - I updated those functions to return a wrapped FunctionalTensorWrapper object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call ltc_tensor.to('cpu'), we need to sync any updates and "unwrap" the tensor. When you call cpu_tensor.to('lazy'), we need to wrap the tensor up.

(c) python bindings. Python bindings (like mark_step()) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

(1) ts_native_functions.yaml

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

(2) ts_native_functions.cpp

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for empty/ empty_strided, and to.device that I mentioned in the integration section above.

(c) I added a lowering for the at::lift operator. This is a new op that's needed for the torch.tensor() constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like block_diag) are CompositeExplicitAutograd, which means that they run underneath functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: at::functionalization::functionalize_aten_op. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.

(3) lazy_ir.py

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a std::string. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like div.rounding_mode were storing the string argument as a c10::string_view, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a std::string on the node instead of a c10::string_view

(b) Now that we're codegen'ing a bunch of view_copy nodes, I didn't want to have to write shape inference rules for all of them (since they don't have at::meta:: implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (at::compositeexplicitautograd), and plumb meta tensors through. I added some codegen support for this.

(4) ts_eager_fallback.cpp

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

(5) shape_inference.h/cpp

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.

(6) init.cpp

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

(7) test_ts_opinfo.py

Some basic test cleanup. Also added a test explicitly for mark_step() preserving alias relationships.

Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of aten. The important changes are:

(1) detach() support for functionalization (in FunctionalTensorWrapper.h/cpp). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a FunctionalTensorWrapper object.

I ended up duplicating a bit of the detach logic from TensorImpl.h to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" CompositeExplicitAutograd kernel: at::functionalization::functionalize_aten_op (in FunctionalTensorWrapper.h/cpp). The idea here is LTC needs to add some special handling for ops like block_diag that are CompositeExplicitAutograd, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some native_functions.yaml changes. This is mostly just me using the new CompositeExplicitAutogradNonFunctional to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.

Stack from ghstack (oldest at bottom):

Differential Revision: D35705375

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 8, 2022

🔗 Helpful links

❌ 6 New Failures

As of commit f207a08 (more details on the Dr. CI page):

Expand to see more
  • 6/6 failures introduced in this PR

🕵️ 6 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build trunk / win-vs2019-cuda11.6-py3 / test (default, 2, 5, windows.8xlarge.nvidia.gpu) (1/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T08:37:09.3256650Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2022-06-22T08:37:09.3221492Z 
2022-06-22T08:37:09.3221824Z For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')�[0m
2022-06-22T08:37:09.3222300Z   warnings.warn(errors.NumbaWarning(msg))
2022-06-22T08:37:09.3222721Z C:\Jenkins\Miniconda3\lib\site-packages\numba\cuda\envvars.py:17: NumbaWarning: �[1m
2022-06-22T08:37:09.3223376Z Environment variables with the 'NUMBAPRO' prefix are deprecated and consequently ignored, found use of NUMBAPRO_LIBDEVICE=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\nvvm\libdevice.
2022-06-22T08:37:09.3223774Z 
2022-06-22T08:37:09.3224074Z For more information about alternatives visit: ('https://numba.pydata.org/numba-doc/latest/cuda/overview.html', '#cudatoolkit-lookup')�[0m
2022-06-22T08:37:09.3224541Z   warnings.warn(errors.NumbaWarning(msg))
2022-06-22T08:37:09.3224948Z ok (0.993s)
2022-06-22T08:37:09.3246416Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.003s)
2022-06-22T08:37:09.3256650Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2022-06-22T08:37:09.3258144Z ok (0.001s)
2022-06-22T08:37:09.3276036Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.001s)
2022-06-22T08:37:09.3345228Z   test_chained_then (__main__.TestFuture) ... ok (0.000s)
2022-06-22T08:37:09.4398808Z   test_collect_all (__main__.TestFuture) ... ok (0.113s)
2022-06-22T08:37:09.4414908Z   test_done (__main__.TestFuture) ... ok (0.001s)
2022-06-22T08:37:09.4437016Z   test_done_exception (__main__.TestFuture) ... ok (0.003s)
2022-06-22T08:37:09.4466297Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.003s)
2022-06-22T08:37:09.4483369Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2022-06-22T08:37:09.4483788Z 
2022-06-22T08:37:09.4483875Z At:

See GitHub Actions build trunk / win-vs2019-cuda11.6-py3 / test (force_on_cpu, 1, 1, windows.4xlarge) (2/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T07:52:08.7807375Z test_cast (__mai...Error: VariableType::ID() not implemented (0.000s)
2022-06-22T07:52:08.6990134Z   test_call_python_mod_from_tracing_fn (__main__.TestScript) ... ok (0.010s)
2022-06-22T07:52:08.7050864Z   test_call_script_fn_from_script_fn (__main__.TestScript) ... ok (0.006s)
2022-06-22T07:52:08.7145519Z   test_call_script_fn_from_script_module (__main__.TestScript) ... ok (0.009s)
2022-06-22T07:52:08.7255689Z   test_call_script_fn_from_tracing_fn (__main__.TestScript) ... ok (0.011s)
2022-06-22T07:52:08.7327179Z   test_call_script_mod_from_script_fn (__main__.TestScript) ... ok (0.007s)
2022-06-22T07:52:08.7457137Z   test_call_script_mod_from_script_module (__main__.TestScript) ... ok (0.013s)
2022-06-22T07:52:08.7469284Z   test_call_script_mod_from_tracing_fn (__main__.TestScript) ... skip: error in first class mode (0.002s)
2022-06-22T07:52:08.7602375Z   test_call_traced_fn_from_tracing_fn (__main__.TestScript) ... ok (0.013s)
2022-06-22T07:52:08.7613967Z   test_call_traced_mod_from_tracing_fn (__main__.TestScript) ... skip: error in first class mode (0.001s)
2022-06-22T07:52:08.7798732Z   test_canonicalize_control_outputs (__main__.TestScript) ... ok (0.012s)
2022-06-22T07:52:08.7807375Z   test_cast (__main__.TestScript) ... skip: RuntimeError: VariableType::ID() not implemented (0.000s)
2022-06-22T07:52:08.8006007Z   test_cat (__main__.TestScript) ... ok (0.016s)
2022-06-22T07:52:08.8095184Z   test_cat_lifts (__main__.TestScript) ... ok (0.016s)
2022-06-22T07:52:08.8152321Z   test_chr (__main__.TestScript) ... ok (0.000s)
2022-06-22T07:52:08.8167964Z   test_circular_dependency (__main__.TestScript)
2022-06-22T07:52:08.8658539Z https://github.com/pytorch/pytorch/issues/25871 ... ok (0.061s)
2022-06-22T07:52:08.8889018Z   test_class_as_attribute (__main__.TestScript) ... ok (0.009s)
2022-06-22T07:52:08.8925387Z   test_class_attribute (__main__.TestScript) ... ok (0.016s)
2022-06-22T07:52:08.8964688Z   test_class_attribute_in_script (__main__.TestScript) ... ok (0.000s)
2022-06-22T07:52:08.9036035Z   test_class_with_comment_at_lower_indentation (__main__.TestScript) ... ok (0.000s)
2022-06-22T07:52:08.9046136Z   test_code_with_constants (__main__.TestScript)

See GitHub Actions build pull / win-vs2019-cpu-py3 / test (default, 2, 2, windows.4xlarge) (3/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T07:15:59.2322147Z ls: cannot access ...d/win_tmp/ci_scripts/*': No such file or directory
2022-06-22T07:15:59.0948981Z + export TEST_DIR_WIN
2022-06-22T07:15:59.0949232Z + export PYTORCH_FINAL_PACKAGE_DIR=/c/2540234189/build-results/
2022-06-22T07:15:59.0949515Z + PYTORCH_FINAL_PACKAGE_DIR=/c/2540234189/build-results/
2022-06-22T07:15:59.1019670Z ++ cygpath -w /c/2540234189/build-results/
2022-06-22T07:15:59.1133421Z + PYTORCH_FINAL_PACKAGE_DIR_WIN='C:\2540234189\build-results\'
2022-06-22T07:15:59.1133715Z + export PYTORCH_FINAL_PACKAGE_DIR_WIN
2022-06-22T07:15:59.1134017Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/build/win_tmp/build/torch
2022-06-22T07:15:59.1421388Z + CI_SCRIPTS_DIR=/c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts
2022-06-22T07:15:59.1421797Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts
2022-06-22T07:15:59.1633303Z ++ ls '/c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts/*'
2022-06-22T07:15:59.2322147Z ls: cannot access '/c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts/*': No such file or directory
2022-06-22T07:15:59.2325700Z + '[' -n '' ']'
2022-06-22T07:15:59.2326097Z + export SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/.jenkins/pytorch/win-test-helpers
2022-06-22T07:15:59.2326511Z + SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/.jenkins/pytorch/win-test-helpers
2022-06-22T07:15:59.2326829Z + [[ win-vs2019-cpu-py3 == *cuda11* ]]
2022-06-22T07:15:59.2327063Z + [[ default = \f\o\r\c\e\_\o\n\_\c\p\u ]]
2022-06-22T07:15:59.2327276Z + [[ win-vs2019-cpu-py3 == *cuda* ]]
2022-06-22T07:15:59.2327964Z + run_tests
2022-06-22T07:15:59.2328304Z + for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe
2022-06-22T07:15:59.2328654Z + [[ -x /c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe ]]
2022-06-22T07:15:59.2330909Z + '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe'

See GitHub Actions build trunk / win-vs2019-cuda11.6-py3 / test (default, 1, 5, windows.8xlarge.nvidia.gpu) (4/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T07:38:43.1847592Z ls: cannot access ...d/win_tmp/ci_scripts/*': No such file or directory
2022-06-22T07:38:43.0425020Z + export TEST_DIR_WIN
2022-06-22T07:38:43.0425629Z + export PYTORCH_FINAL_PACKAGE_DIR=/c/2540316661/build-results/
2022-06-22T07:38:43.0426224Z + PYTORCH_FINAL_PACKAGE_DIR=/c/2540316661/build-results/
2022-06-22T07:38:43.0520378Z ++ cygpath -w /c/2540316661/build-results/
2022-06-22T07:38:43.0684680Z + PYTORCH_FINAL_PACKAGE_DIR_WIN='C:\2540316661\build-results\'
2022-06-22T07:38:43.0685368Z + export PYTORCH_FINAL_PACKAGE_DIR_WIN
2022-06-22T07:38:43.0686095Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/build/win_tmp/build/torch
2022-06-22T07:38:43.1184192Z + CI_SCRIPTS_DIR=/c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts
2022-06-22T07:38:43.1184994Z + mkdir -p /c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts
2022-06-22T07:38:43.1463822Z ++ ls '/c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts/*'
2022-06-22T07:38:43.1847592Z ls: cannot access '/c/actions-runner/_work/pytorch/pytorch/build/win_tmp/ci_scripts/*': No such file or directory
2022-06-22T07:38:43.1852865Z + '[' -n '' ']'
2022-06-22T07:38:43.1853654Z + export SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/.jenkins/pytorch/win-test-helpers
2022-06-22T07:38:43.1854292Z + SCRIPT_HELPERS_DIR=/c/actions-runner/_work/pytorch/pytorch/.jenkins/pytorch/win-test-helpers
2022-06-22T07:38:43.1854750Z + [[ win-vs2019-cuda11.6-py3 == *cuda11* ]]
2022-06-22T07:38:43.1855110Z + export BUILD_SPLIT_CUDA=ON
2022-06-22T07:38:43.1855571Z + BUILD_SPLIT_CUDA=ON
2022-06-22T07:38:43.1855997Z + [[ default = \f\o\r\c\e\_\o\n\_\c\p\u ]]
2022-06-22T07:38:43.1856289Z + [[ win-vs2019-cuda11.6-py3 == *cuda* ]]
2022-06-22T07:38:43.1856591Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda
2022-06-22T07:38:43.1856889Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda

See GitHub Actions build pull / win-vs2019-cpu-py3 / test (default, 1, 2, windows.4xlarge) (5/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T08:16:35.3452607Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2022-06-22T08:16:35.3416161Z   C:\Jenkins\Miniconda3\lib\unittest\suite.py(122): run
2022-06-22T08:16:35.3416443Z   C:\Jenkins\Miniconda3\lib\unittest\suite.py(84): __call__
2022-06-22T08:16:35.3416736Z   C:\Jenkins\Miniconda3\lib\site-packages\xmlrunner\runner.py(67): run
2022-06-22T08:16:35.3417044Z   C:\Jenkins\Miniconda3\lib\unittest\main.py(271): runTests
2022-06-22T08:16:35.3417333Z   C:\Jenkins\Miniconda3\lib\unittest\main.py(101): __init__
2022-06-22T08:16:35.3417690Z   C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\build\torch\testing\_internal\common_utils.py(688): run_tests
2022-06-22T08:16:35.3418009Z   test_futures.py(331): <module>
2022-06-22T08:16:35.3418133Z 
2022-06-22T08:16:35.3418200Z ok (0.564s)
2022-06-22T08:16:35.3442250Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.016s)
2022-06-22T08:16:35.3452607Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2022-06-22T08:16:35.3453832Z ok (0.001s)
2022-06-22T08:16:35.3469913Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.001s)
2022-06-22T08:16:35.3519085Z   test_chained_then (__main__.TestFuture) ... ok (0.005s)
2022-06-22T08:16:35.4546226Z   test_collect_all (__main__.TestFuture) ... ok (0.103s)
2022-06-22T08:16:35.4560190Z   test_done (__main__.TestFuture) ... ok (0.001s)
2022-06-22T08:16:35.4578047Z   test_done_exception (__main__.TestFuture) ... ok (0.000s)
2022-06-22T08:16:35.4600759Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.000s)
2022-06-22T08:16:35.4614980Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:201] Got the following error when running the callback: ValueError: Expected error
2022-06-22T08:16:35.4615311Z 
2022-06-22T08:16:35.4615360Z At:

See GitHub Actions build trunk / macos-11-py3-x86-64 / test (default, 1, 2, macos-12) (6/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-22T09:10:16.6678810Z FAIL [0.108s]: tes...ntization.core.test_quantized_op.TestQuantizedOps)
2022-06-22T09:10:16.6675130Z     assert_equal(
2022-06-22T09:10:16.6675630Z   File "/Users/runner/miniconda3/envs/build/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-06-22T09:10:16.6675970Z     raise error_metas[0].to_error(msg)
2022-06-22T09:10:16.6676330Z AssertionError: Tensor-likes are not close!
2022-06-22T09:10:16.6676820Z 
2022-06-22T09:10:16.6676930Z Mismatched elements: 58 / 1044 (5.6%)
2022-06-22T09:10:16.6677370Z Greatest absolute difference: 1.0 at index (0, 0, 0, 0) (up to 1e-05 allowed)
2022-06-22T09:10:16.6678110Z Greatest relative difference: 1.0 at index (0, 0, 0, 0) (up to 1.3e-06 allowed) : torch results are off
2022-06-22T09:10:16.6678370Z 
2022-06-22T09:10:16.6678490Z ======================================================================
2022-06-22T09:10:16.6678810Z FAIL [0.108s]: test_qrelu6 (quantization.core.test_quantized_op.TestQuantizedOps)
2022-06-22T09:10:16.6679290Z ----------------------------------------------------------------------
2022-06-22T09:10:16.6679590Z Traceback (most recent call last):
2022-06-22T09:10:16.6679950Z   File "/Users/runner/work/pytorch/pytorch/test/quantization/core/test_quantized_op.py", line 277, in test_qrelu6
2022-06-22T09:10:16.6680430Z     self._test_activation_function(X, 'relu6', relu6_test_configs)
2022-06-22T09:10:16.6680840Z   File "/Users/runner/work/pytorch/pytorch/test/quantization/core/test_quantized_op.py", line 225, in _test_activation_function
2022-06-22T09:10:16.6681350Z     self.assertEqual(qY, qY_hat, msg='{} - {} failed: ({} vs. {})'.format(
2022-06-22T09:10:16.6681940Z   File "/Users/runner/miniconda3/envs/build/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2238, in assertEqual
2022-06-22T09:10:16.6682280Z     assert_equal(
2022-06-22T09:10:16.6682780Z   File "/Users/runner/miniconda3/envs/build/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
2022-06-22T09:10:16.6683140Z     raise error_metas[0].to_error(msg)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh added a commit that referenced this pull request Apr 14, 2022
bdhirsh added a commit that referenced this pull request Apr 14, 2022
bdhirsh added a commit that referenced this pull request Apr 15, 2022
bdhirsh added a commit that referenced this pull request Apr 15, 2022
bdhirsh added a commit that referenced this pull request Apr 17, 2022
@bdhirsh
Copy link
Collaborator Author

bdhirsh commented Apr 17, 2022

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

bdhirsh added a commit that referenced this pull request Apr 20, 2022
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 13, 2022
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 14, 2022
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 15, 2022
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 15, 2022
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 16, 2022
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
torchgen/gen.py Outdated
mapMaybe(gen_composite_view_copy_kernel, view_groups)
),
"SymIntViewCopyKernel_Definitions": list(
mapMaybe(lambda pair: gen_symint_view_copy_kernel(pair[0], pair[1]), view_copy_with_symint_pairs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ezyang I remember hearing that long term we'd like to have view*.SymInt fully subsume the existing view/view copy ops, so we can always rip this out later.

But for now, I'm codegen'ing {view}_copy.SymInt kernel overloads to call into their {view}_copy variants, which is what the existing expand_copy.SymInt kernel does today.

if remove_non_owning_ref_types:
return NamedCType(binds, VectorCType(BaseCType(SymIntT)))
else:
return NamedCType(binds, BaseCType(symIntArrayRefT))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Krovatkin if you're interested - the changes here + in translate.py are needed to get functionalization working with sym ints :). There are still a few other things that I need to fix, but this basically tells the codegen how to:

(1) convert SymIntArrayRef -> std::vector<SymInt> (needed because functionalization stashes SymInt argument inputs into a lambda, which can outlive the original SymIntArrayRef)
(2) convert std::vector<SymInt> -> SymIntArrayRef (going the other way)
(3) convert from SymIntArrayRef -> IntArrayRef (needed for the expand_copy.SymInt -> expand_copy kernel)

@bdhirsh
Copy link
Collaborator Author

bdhirsh commented Jun 17, 2022

@pytorchbot help

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 17, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'help' (choose from 'merge', 'revert', 'rebase')

usage: @pytorchbot [-h] {merge,revert,rebase} ...

Try @pytorchbot --help for more info.

bdhirsh added 5 commits June 17, 2022 06:20
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
bdhirsh added 2 commits June 20, 2022 23:14
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
@bdhirsh
Copy link
Collaborator Author

bdhirsh commented Jun 22, 2022

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

This PR integrates functionalization into LazyTensorCore. The high level is:

(1)  LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.

(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.

(3)  A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.

## What is the interface between functionalization and LTC?

There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:

(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.

(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.

(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.

## What's the set of changes / what order should I look at things in?

LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:

**(1) `ts_native_functions.yaml`**

Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.

**(2) `ts_native_functions.cpp`**

This is probably where the most important changes to LTC are. There are 4 major changes in this file:

(a) I removed the hand-written kernels for the most of the view ops.

(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.

(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.

(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.


**(3) `lazy_ir.py`**

Some codegen changes. There are two main changes in the codegen:

(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`

(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.


**(4) `ts_eager_fallback.cpp`**

I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).

**(5) `shape_inference.h/cpp`**

Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.


**(6) `init.cpp`**

Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.

**(7) `test_ts_opinfo.py`**

Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.

 

## Other functionalization changes (not specific to LTC)

This is basically the stuff in this PR inside of `aten`. The important changes are:

(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.

I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.

(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op`  (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.

(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.


Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/bdhirsh/199/orig onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/75527)

@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants