integrate functionalization <> LTC torchscript backend#75527
integrate functionalization <> LTC torchscript backend#75527bdhirsh wants to merge 69 commits intogh/bdhirsh/199/basefrom
Conversation
🔗 Helpful links
❌ 6 New FailuresAs of commit f207a08 (more details on the Dr. CI page): Expand to see more
🕵️ 6 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
… backend" [ghstack-poisoned]
… backend" [ghstack-poisoned]
… backend" [ghstack-poisoned]
… backend" [ghstack-poisoned]
… backend" [ghstack-poisoned]
… backend" [ghstack-poisoned]
|
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… backend" Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375) [ghstack-poisoned]
… backend" Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375) [ghstack-poisoned]
… backend" Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375) [ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
torchgen/gen.py
Outdated
| mapMaybe(gen_composite_view_copy_kernel, view_groups) | ||
| ), | ||
| "SymIntViewCopyKernel_Definitions": list( | ||
| mapMaybe(lambda pair: gen_symint_view_copy_kernel(pair[0], pair[1]), view_copy_with_symint_pairs) |
There was a problem hiding this comment.
cc @ezyang I remember hearing that long term we'd like to have view*.SymInt fully subsume the existing view/view copy ops, so we can always rip this out later.
But for now, I'm codegen'ing {view}_copy.SymInt kernel overloads to call into their {view}_copy variants, which is what the existing expand_copy.SymInt kernel does today.
| if remove_non_owning_ref_types: | ||
| return NamedCType(binds, VectorCType(BaseCType(SymIntT))) | ||
| else: | ||
| return NamedCType(binds, BaseCType(symIntArrayRefT)) |
There was a problem hiding this comment.
Hey @Krovatkin if you're interested - the changes here + in translate.py are needed to get functionalization working with sym ints :). There are still a few other things that I need to fix, but this basically tells the codegen how to:
(1) convert SymIntArrayRef -> std::vector<SymInt> (needed because functionalization stashes SymInt argument inputs into a lambda, which can outlive the original SymIntArrayRef)
(2) convert std::vector<SymInt> -> SymIntArrayRef (going the other way)
(3) convert from SymIntArrayRef -> IntArrayRef (needed for the expand_copy.SymInt -> expand_copy kernel)
|
@pytorchbot help |
|
❌ 🤖 pytorchbot command failed: Try |
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have `FunctionalTensorWrapper(LazyTensorImpl)`.
(3) A bunch of aliasing bugs are now fixed. The most significant one is that `mark_step()` no longer severs aliasing relationships between tensors. I included a test in the PR.
## What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (`LazyNativeFunctions::empty/empty_strided`). This is the main integration point - I updated those functions to return a wrapped `FunctionalTensorWrapper` object, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.
(b) converting between devices. When you call `ltc_tensor.to('cpu')`, we need to sync any updates and "unwrap" the tensor. When you call `cpu_tensor.to('lazy')`, we need to wrap the tensor up.
(c) python bindings. Python bindings (like `mark_step()`) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.
## What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
**(1) `ts_native_functions.yaml`**
Here, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
**(2) `ts_native_functions.cpp`**
This is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for `empty`/ `empty_strided`, and `to.device` that I mentioned in the integration section above.
(c) I added a lowering for the `at::lift` operator. This is a new op that's needed for the `torch.tensor()` constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.
(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like `block_diag`) are `CompositeExplicitAutograd`, which means that they run **underneath** functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel: `at::functionalization::functionalize_aten_op`. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.
**(3) `lazy_ir.py`**
Some codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a `std::string`. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops like `div.rounding_mode` were storing the string argument as a `c10::string_view`, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store a `std::string` on the node instead of a `c10::string_view`
(b) Now that we're codegen'ing a bunch of `view_copy` nodes, I didn't want to have to write shape inference rules for all of them (since they don't have `at::meta::` implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (`at::compositeexplicitautograd`), and plumb meta tensors through. I added some codegen support for this.
**(4) `ts_eager_fallback.cpp`**
I had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
**(5) `shape_inference.h/cpp`**
Added some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
**(6) `init.cpp`**
Updated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
**(7) `test_ts_opinfo.py`**
Some basic test cleanup. Also added a test explicitly for `mark_step()` preserving alias relationships.
## Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of `aten`. The important changes are:
(1) `detach()` support for functionalization (in `FunctionalTensorWrapper.h/cpp`). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on a `FunctionalTensorWrapper` object.
I ended up duplicating a bit of the detach logic from `TensorImpl.h` to get this to work, but I couldn't think of a better way to do it.
(2) A helper function for "functionalization" `CompositeExplicitAutograd` kernel: `at::functionalization::functionalize_aten_op` (in `FunctionalTensorWrapper.h/cpp`). The idea here is LTC needs to add some special handling for ops like `block_diag` that are `CompositeExplicitAutograd`, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.
(3) some `native_functions.yaml` changes. This is mostly just me using the new `CompositeExplicitAutogradNonFunctional` to pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.
Differential Revision: [D35705375](https://our.internmc.facebook.com/intern/diff/D35705375)
[ghstack-poisoned]
|
Successfully rebased |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
This PR integrates functionalization into LazyTensorCore. The high level is:
(1) LTC will no longer see view/aliasing operators directly. Instead, functionalization will run "above" LTC, which will only see non-aliasing *_copy variants of each view operator. It will also remove mutations, so (for the most part) LTC will only see "functional/out-of-place" operators.
(2) At the C++ level, every lazy tensor is wrapped in a layer of indirection: we now have
FunctionalTensorWrapper(LazyTensorImpl).(3) A bunch of aliasing bugs are now fixed. The most significant one is that
mark_step()no longer severs aliasing relationships between tensors. I included a test in the PR.What is the interface between functionalization and LTC?
There needs to be some code that "promotes/demotes" a tensor from a functional wrapper to its inner LTC tensor. The places where that happens are:
(a) factory functions (
LazyNativeFunctions::empty/empty_strided). This is the main integration point - I updated those functions to return a wrappedFunctionalTensorWrapperobject, which will cause every future usage of the returned tensor to pass through functionalization for every operator (which does the unwrapping) before hitting the LTC backend again.(b) converting between devices. When you call
ltc_tensor.to('cpu'), we need to sync any updates and "unwrap" the tensor. When you callcpu_tensor.to('lazy'), we need to wrap the tensor up.(c) python bindings. Python bindings (like
mark_step()) that don't go through the dispatcher. That means that they need to do the unwrapping themselves, instead of relying on functionalization kernels to do it automatically.What's the set of changes / what order should I look at things in?
LTC folks can focus just on the LTC-specific changes. I'd recommend looking at the following:
(1)
ts_native_functions.yamlHere, I basically removed a bunch of view ops, and added corresponding "view_copy" variants that automatically get codegen'd. view_copy ops are "ordinary" out-of-place ops, so the codegen for them should just work.
(2)
ts_native_functions.cppThis is probably where the most important changes to LTC are. There are 4 major changes in this file:
(a) I removed the hand-written kernels for the most of the view ops.
(b) I added the wrapping/unwrapping logic for
empty/empty_strided, andto.devicethat I mentioned in the integration section above.(c) I added a lowering for the
at::liftoperator. This is a new op that's needed for thetorch.tensor()constructor, where we need to explicitly "lift" LTC tensors into functional tensor objects.(d) There are a total of 10 aten operators that are problematic, that I had to add a bit of extra handling for. Why? The high level idea is that a few ops (like
block_diag) areCompositeExplicitAutograd, which means that they run underneath functionalization. These ops are "functional" (no aliasing info), but they internally call view operators. To handle these ops, I added a helper function in core that lets you "functionalize" a composite kernel:at::functionalization::functionalize_aten_op. The change for LTC is basically that these ops used to work "for free", whereas now you need to manually write a (one-liner) kernel for them that explicitly calls into their decomposition.(3)
lazy_ir.pySome codegen changes. There are two main changes in the codegen:
(a) Fixed a use-after-free error with ops that take in a
std::string. This was UB that only surfaced for some reason when I did the integration, but the codegen'd nodes for ops likediv.rounding_modewere storing the string argument as ac10::string_view, and the constructed node was outlasting the life-time of the string. I added some logic to fix that by explicitly ensuring that we store astd::stringon the node instead of ac10::string_view(b) Now that we're codegen'ing a bunch of
view_copynodes, I didn't want to have to write shape inference rules for all of them (since they don't haveat::meta::implementations). However, every view op + view_copy actually supports meta tensors. You just need to run the composite implementation (at::compositeexplicitautograd), and plumb meta tensors through. I added some codegen support for this.(4)
ts_eager_fallback.cppI had to update the eager fallback to ensure that when converting from ltc -> non-ltc device and back, it unwraps/wraps properly. Also updated the check to error if it sees any view ops (since LTC should never see view ops, so we never expect the fallback to see one).
(5)
shape_inference.h/cppAdded some shape formulas for a few of the new view_copy ops. I also updated the formulas for some of the existing view ops to explicitly raise an error, since they should never be called. We should just delete them, but I figured we can make this PR just a bit smaller and fully rip out the LTC view infrastructure later.
(6)
init.cppUpdated the python bindings to "unwrap" functional wrapper tensor inputs, as mentioned in the integration section above.
(7)
test_ts_opinfo.pySome basic test cleanup. Also added a test explicitly for
mark_step()preserving alias relationships.Other functionalization changes (not specific to LTC)
This is basically the stuff in this PR inside of
aten. The important changes are:(1)
detach()support for functionalization (inFunctionalTensorWrapper.h/cpp). This is only actually relevant to LTC/XLA though, since they are the only context under which autograd will directly be called on aFunctionalTensorWrapperobject.I ended up duplicating a bit of the detach logic from
TensorImpl.hto get this to work, but I couldn't think of a better way to do it.(2) A helper function for "functionalization"
CompositeExplicitAutogradkernel:at::functionalization::functionalize_aten_op(inFunctionalTensorWrapper.h/cpp). The idea here is LTC needs to add some special handling for ops likeblock_diagthat areCompositeExplicitAutograd, but call into view operators "underneath" the functionalization pass. I wanted to add a helper function to make this case easy to handle.(3) some
native_functions.yamlchanges. This is mostly just me using the newCompositeExplicitAutogradNonFunctionalto pre-emptively prevent XLA/LTC from accidentally using the "problematic" decompositions. This will also make XLA failures easier to spot.Stack from ghstack (oldest at bottom):
Differential Revision: D35705375