Tests for fallback boxed dispatch (including TLS mode)#26719
Tests for fallback boxed dispatch (including TLS mode)#26719ezyang wants to merge 17 commits intogh/ezyang/375/basefrom
Conversation
Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
…dispatch." Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_included_tensor_type_set()` which is considered as part of dispatch. (It is a little goofy that we now have to do two TLS accesses; I'm not too sure if there's a way to avoid this.) * There is a screed about how modes (Profiling/Tracing) are different from hybrid wrapper-modes (Variable) * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
zdevito
left a comment
There was a problem hiding this comment.
Generally looks good. I have a few medium level comments about the API and use of thread locals.
c10/core/impl/LocalTensorTypeSet.cpp
Outdated
|
|
||
| #endif | ||
|
|
||
| void set_enabled_via_excluded(TensorTypeId tid, bool enabled) { |
There was a problem hiding this comment.
The names of these functions are confusing to me. The caller still needs to understand that thefinal set will be (tensor_properties | included) - excluded but then there is also a second translation between how this function affects included/excluded. Wouldn't it make more sense to just have it follow the set operations directely, excluded.add, excluded.remove, included.add, included.remove.
c10/core/impl/LocalTensorTypeSet.h
Outdated
| C10_API void tls_variable_set_enabled(bool enabled); | ||
| C10_API TensorTypeSet tls_excluded_tensor_type_set(); | ||
|
|
||
| C10_API bool TESTING_ONLY_tls_generic_mode_is_enabled(); |
There was a problem hiding this comment.
Is it the best idea to have separate functions here for every key? I am concerned that over time non-simple operations on the tensor set can end up here, but I'd prefer the opposite: that the only thing you can do to a key is to enable it in a thread or in tensors properties.
There was a problem hiding this comment.
Solves my giant comment problem too.
c10/core/impl/LocalTensorTypeSet.cpp
Outdated
|
|
||
| // NB: Zero initialized! | ||
| thread_local uint64_t raw_excluded; | ||
| thread_local uint64_t raw_included; |
There was a problem hiding this comment.
These are used adjacent to each other in dispatch, so it would be more efficient if they were in a single thread_local struct. Otherwise there is a sequence of like 4 loads to get to each one. I would imagine this struct would grow into the generic thread-local dispatch state that the chain of responsibility would need to call super as well.
…ng TLS mode)" This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_included_tensor_type_set()` which is considered as part of dispatch. (It is a little goofy that we now have to do two TLS accesses; I'm not too sure if there's a way to avoid this.) * There is a screed about how modes (Profiling/Tracing) are different from hybrid wrapper-modes (Variable) * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
…ch (including TLS mode)" This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_local_tensor_type_set()` which is considered as part of dispatch. * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_local_tensor_type_set()` which is considered as part of dispatch. * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
This PR introduces a dirt simple registration API to allow custom Tensors. The API exposed aims to avoid code complexity and compilation times. It makes an explicit trade-off supporting developer productivity over compute efficiency. Couple of reasons to have this: - Tests are easier to write (see #26719) - It addresses the limit for the 64 possible TensorTypeId's we can have with TensorTypeSet - Drastically reduces mental overhead for folks working in research wanting to add a Tensor - Allows entirely decoupled static registration for tensors defined in separate repos (planned usage in pytorch/sparse) [ghstack-poisoned]
This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_local_tensor_type_set()` which is considered as part of dispatch. * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D17549624](https://our.internmc.facebook.com/intern/diff/D17549624) [ghstack-poisoned]
Summary: Pull Request resolved: pytorch/pytorch#26719 This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_local_tensor_type_set()` which is considered as part of dispatch. * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D17549624 Test Plan: Imported from OSS Pulled By: ezyang fbshipit-source-id: 57dbdd8d6812a66082aa6db2934c8edcda340ea6
Summary: Pull Request resolved: pytorch#26719 This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first. General structure of the PR: * Add two new testing tensor type ids, `TESTING_ONLY_GenericWrapperTensorId` and `TESTING_ONLY_GenericModeTensorId`, which our tests use. They might find other use in other tests if necessary. * Add support for toggling the availability of `TESTING_ONLY_GenericModeTensorId`. Introduces a new thread local variable accessible by `tls_local_tensor_type_set()` which is considered as part of dispatch. * The mode fallback is very simple: it increments a counter and then passes on the call to the underlying kernel by invoking the JIT. * The wrapper fallback is more complex: it parses the arguments, unwrapping any wrapped tensor arguments, then invokes the JIT, and then rewraps the outputs. The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch: * `getOperator` is horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface. * `GenericWrapperTensorImpl` doesn't populate all of its fields accurately. Most notably, size is not setup correctly. * `generic_wrapper_fallback` should handle tensor lists in arguments and returns properly. One pitfall: fallback dispatch only works with non-c10 code. That's why I test using `batch_norm`. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D17549624 Test Plan: Imported from OSS Pulled By: ezyang fbshipit-source-id: 57dbdd8d6812a66082aa6db2934c8edcda340ea6
Stack from ghstack:
This PR adds a pair of tests for fallback boxed dispatch, exercising two different ways you might use it: (1) to implement a "wrapper" tensor type (e.g., LazyTensor, NestedTensor), and (2) to implement a toggleable "mode" (e.g., Profiling, Tracing). Both implement the most trivial possible implementations of their type: they "wrap" a real tensor simply forward along to the real implementation. This PR also adds the necessary feature support for toggleable mode, which is in the original generic dispatch abstraction design, but was not previously implemented. I had not originally intended to add this, but it turns out writing a new "mode" is a lot simpler than writing a "wrapper" type, so I ended up writing the mode version first.
General structure of the PR:
TESTING_ONLY_GenericWrapperTensorIdandTESTING_ONLY_GenericModeTensorId, which our tests use. They might find other use in other tests if necessary.TESTING_ONLY_GenericModeTensorId. Introduces a new thread local variable accessible bytls_local_tensor_type_set()which is considered as part of dispatch.The examples here are somewhat simplistic; there are a number of engineering improvements that could be applied. We could save these for later (landing this patch to get immediate testing), or incorporate them into this patch:
getOperatoris horrible. Bram Wasti and I discussed a plan for how to make this easier, by simply refactoring the JIT interface.GenericWrapperTensorImpldoesn't populate all of its fields accurately. Most notably, size is not setup correctly.generic_wrapper_fallbackshould handle tensor lists in arguments and returns properly.One pitfall: fallback dispatch only works with non-c10 code. That's why I test using
batch_norm.Signed-off-by: Edward Z. Yang ezyang@fb.com
Differential Revision: D17549624