[wip][lazy] A Lazy Tensor Implementation by bwasti · Pull Request #25753 · pytorch/pytorch

bwasti · 2019-09-06T05:27:20Z

Lazy tensors create a lazy execution environment for PyTorch. Instead of immediately running operators, lazy tensors defer execution until the last possible moment (or when the user converts the tensor to eager).

By deferring execution, the backend is able to transparently view a larger chunk of the program. It can then optimize the execution plan or compile better code than what is possible in an immediate execution setting.

Laziness may hurt the debugging experience, as users will need to disable the feature to get the exact lines of failure in their Python code.

Design

The general design requires hijacking the dispatch mechanism for all operators, shimming these operators to return Lazy Tensors and storing a record of their invocation in the output Tensor. When we need to execute the operators, we lower the recorded "history" stored within the Tensor to PyTorch's JIT IR and run that. Autograd is set up in the same way as TorchScript.

Status

This diff is not in a land-able state, but provides a fully working implementation of lazy tensor for a large subset of PyTorch operators. I'm putting it up for initial design comments and to motivate conversation surround dispatch with concrete code and needs.

The change is blocked by the development of a stable multi-dispatch mechanism.

The change is also missing a couple of things that will need to exist before landing:

non-CPU laziness (not too hard to add)
thorough testing (full models etc)

cc @nikitaved @pearu @cpuhrsch @IvanYashchuk @yf225 @glaringlee @zou3519 @ezyang @bhosmer @smessmer @ljk53 @bdhirsh @albanD @mruberry @jbschlosser @ngimel

vadimkantorov · 2019-09-06T09:07:52Z

Would this enable custom user kernel codegen plug-ins? Like KeOps?

soumith · 2019-09-06T13:58:25Z

@vadimkantorov it could if one writes a KeOps-pytorch-jit plugin similar to how https://github.com/pytorch/tvm is a jit plugin

ezyang · 2019-09-06T15:05:11Z

aten/src/ATen/core/ATenDispatch.h

Repeating from what was discussed in realtime: these changes will conflict with the post-multi dispatch implementation at

pytorch/aten/src/ATen/core/ATenDispatch.h

Line 63 in 7bfa1e4

void* getOp(TensorTypeId tid) const {

The recommendation:

Currently, the implementation of dispatcher logic in #25653

op = dispatch_table[op_id][valid_keys[0]] if not op: # fallback logic... return op(args)

We adjust fallback handling to now be:

op = dispatch_table[op_id][valid_keys[0]] if not op and default_dispatch_table[valid_keys[0]]: op = BoxedOpFunctor<FuncType>(default_dispatch_table[valid_keys[0]]) if not op: # fallback logic... return op(args)

There are two pieces:

The default dispatch table contains registrations of functions default_dispatch(Operator op, Stack&& ivalues) which get passed in what operator they need to process, and arguments in boxed form. The expectation is a backend like lazy would write this function once to work polymorphically over everything. Operator is some type that gives you the schema string, and also a way to redispatch on the operator.

The boxed op functors boxed_op_functors is a functor which unpacks FuncType into its boxed form, and then calls the default dispatch function. It is templated here, as is done in this PR.

ezyang · 2019-09-06T15:07:18Z

torch/csrc/jit/graph_executor.h

How much of this is actually changed, versus just code motion? (Can we split this off into a diff before this in the stack?)

Furthermore, I would not expect our executors to change at all with this patch. Why did they need to change? This part of the PR is very hard to review because it mixes a ton of code motion with some changes.

yea, it'll be split out in any later (non-WIP) diffs

ezyang · 2019-09-06T15:12:10Z

aten/src/ATen/native/native_functions.yaml

I know in earlier conversation, @zdevito has objected to API creep here. I personally think these functions are probably useful as a debugging mechanism, but we probably shouldn't make them methods (and give them suitably private sounding names).

ezyang · 2019-09-06T15:14:24Z

aten/src/ATen/core/LazyTensor.h

This doesn't make sense to me, there is already a device_opt_ on the parent struct...

ezyang · 2019-09-06T15:14:51Z

aten/src/ATen/core/LazyTensor.h

ezyang · 2019-09-06T15:17:04Z

aten/src/ATen/core/LazyTensor.h

It looks like in some cases, schema_ == nullptr. When this is the case, what does the lazy tensor semantically represent? It kinds of sounds like you have an invariant like "if schema is null, then inps_.size() == 1". Is this true?

Based on further reading, I think this invariant holds? It would be clearer if there were a more semantic function we use to test for this case, rather than manually always poking to check if schema_ is null; maybe something like:

optional<Tensor> constant_opt();

ezyang · 2019-09-06T15:42:39Z

aten/src/ATen/core/LazyTensor.cpp

I know in this particular case it is safe, but unchecked static casts always make me nervous (probably because I'm afraid someone will copy paste this code to a place where there the precondition that a tensor is lazy is NOT satisfied.)

ezyang · 2019-09-06T15:45:46Z

aten/src/ATen/core/LazyTensor.cpp

This looks highly questionable. Can't we just impose the invariant that the result of to_eager() is always a Variable? When can it not be a Variable?

I empirically found this to fail; I'll dig into what is going on here.

ezyang · 2019-09-06T15:47:29Z

aten/src/ATen/core/LazyTensor.cpp

Inferring that the dtype/device of the result is the same as the first argument will not work for many operations. (I guess lazy tensor is going to need this information, somehow, too.)

ezyang · 2019-09-06T15:54:37Z

aten/src/ATen/core/LazyTensor.cpp

Err... a std::map? Really?

(is this for determinism?)

will swap to unordered, oversight on my part

ezyang · 2019-09-06T16:03:16Z

aten/src/ATen/core/LazyTensor.h

In this struct, I don't see any place for storing materialized results after you run to_eager on a graph. This seems to imply to me that if I do something like:

x = ... some lazy computation ... x.to_eager() x.to_eager()

I will end up redoing the entirety of lazy computation twice. Is that true?

yep, that's true

ezyang · 2019-09-06T16:10:18Z

aten/src/ATen/core/LazyTensor.cpp

Didn't you add a t.is_lazy() specifically to not have to type all this out? XD

ezyang · 2019-09-06T16:12:47Z

aten/src/ATen/core/LazyTensor.h

I've been thinking about whether, in the "type set" world, lazy tensors should also have dispatch keys for, e.g., variable and cpu/cuda tensor.

On the side of yes: tensor type sets are currently used for instanceof tests, and in transparent lazy usage, you want lazy cpu tensors to report as cpu tensors

On the side of no: tensor type sets are used to make claims about the structure of the TensorImpl; but a lazy cpu tensor "is not a" cpu tensor.

@zdevito Maybe this means we really should organize one of these notions some other way on Tensor.

ezyang · 2019-09-06T19:53:40Z

aten/src/ATen/core/LazyTensor.cpp

Dead variable?

ezyang · 2019-09-06T19:55:27Z

aten/src/ATen/core/ATenDispatch.h

Question for the design of boxed dispatch key fallback: in your prototype, you save const char* schema. However, tracing the usages of it, it seems that what you really care about in the end is the symbol corresponding to the function, so you can make a JIT op for it. (1) Is this true? (2) Are there other uses of, e.g., FunctionSchema or similar that you anticipate needing?

ezyang · 2019-09-06T20:00:34Z

aten/src/ATen/core/ATenDispatch.h

No actual change here right?

apaszke

This is quite interesting. I've been thinking quite a bit about how the JIT and autograd should relate in the future, and this seems like a step in a direction that I've been considering. There's one big issue though: torch::jit::Graph and all other JIT abstractions are not designed to be manipulated in the hot path. There's a whole lot of pointer chasing that makes it very convenient to implement many algorithms, but would completely kill our performance. Are we planning to reevaluate that implementation and possibly change how we represent things?

ezyang · 2019-09-10T14:06:49Z

Are we planning to reevaluate that implementation and possibly change how we represent things?

Yes, an alternate IR representation for speed is on the table. JIT IR is being used here because it's easy, and there are plenty of other problems to solve even without taking on IR design to start.

zdevito

I think this needs more documentation and explanation before I can give an accurate review. In particular I'd like to see:

An overview of the lifetime of lazy tensor, and what code pieces introduce lazyness and remove it.
Documentation of the machinery: there are things like make_lazy and its counterpart to make things eager. What are they. Can we put this machinery in its own files so it can be understood to be different than standard things in the ATen core.
An explanation of why details of the graph executor got exposed. I can't find the code that uses them and I do not believe it should have had to change.

zdevito · 2019-09-10T17:19:59Z

torch/csrc/jit/graph_executor.h

Furthermore, I would not expect our executors to change at all with this patch. Why did they need to change? This part of the PR is very hard to review because it mixes a ton of code motion with some changes.

zdevito · 2019-09-10T17:20:19Z

torch/csrc/jit/graph_executor.h

Theres are implementation details of the executor. They should not be exposed.

I just want to use DifferentiableGraphOp, which seems to pull all of this in

But why do you need that?

zdevito · 2019-09-10T17:25:18Z

aten/src/ATen/core/ATenDispatch.h

I can't follow what is going on here. Can you document it? Why is this working with raw char* schema strings?

jeffreyksmithjr · 2019-09-11T19:26:15Z

Request/suggestion: we could use some sort of lazy tag on GH, either as a module:lazy variant or just a plain old lazy label.

This is required for boxed backend fallback kernels (e.g. lazy, AMP) because they need to know which op was actually called. Differential Revision: [D18282746](https://our.internmc.facebook.com/intern/diff/D18282746/) [ghstack-poisoned]

- Differential Revision: [D18329308](https://our.internmc.facebook.com/intern/diff/D18329308/) [ghstack-poisoned]

This makes for a nicer API, especially in backend fallback kernels who get an OperatorHandle instance and can directly call these methods on it. Differential Revision: [D18357424](https://our.internmc.facebook.com/intern/diff/D18357424/) [ghstack-poisoned]

…nel API" This argument is needed by boxing wrappers so they're able to get a pointer to the corresponding unboxed kernel and call into it. But if a kernel is registered in a boxed way, we don't need it and should hide this from the API. This is especially needed for the backend fallback API where users would only be left wondering why this argument is there and what it does. Also, hiding it allows us to potentially totally remove it in a future refactoring if we find some way to do so. Differential Revision: [D18361991](https://our.internmc.facebook.com/intern/diff/D18361991/) [ghstack-poisoned]

Remove callUnboxedOnly() and instead use metaprogramming to figure out if an operator can use a boxed fallback or not. This enables boxed fallback for ops in native_functions.yaml even if they don't have `use_c10_dispatcher: full` set, as long as they're in the range of supported types. Differential Revision: [D18462653](https://our.internmc.facebook.com/intern/diff/D18462653/) [ghstack-poisoned]

This PR re-introduces backend_fallback_test.cpp, which was previously called boxed_fallback_test.cpp and showed how to use the backend fallback API. Differential Revision: [D18462654](https://our.internmc.facebook.com/intern/diff/D18462654/) [ghstack-poisoned]

…nels" This is required for boxed backend fallback kernels (e.g. lazy, AMP) because they need to know which op was actually called. Differential Revision: [D18282746](https://our.internmc.facebook.com/intern/diff/D18282746/) [ghstack-poisoned]

This is required for boxed backend fallback kernels (e.g. lazy, AMP) because they need to know which op was actually called. Differential Revision: [D18282746](https://our.internmc.facebook.com/intern/diff/D18282746/) [ghstack-poisoned]

- Differential Revision: [D18329308](https://our.internmc.facebook.com/intern/diff/D18329308/) [ghstack-poisoned]

This makes for a nicer API, especially in backend fallback kernels who get an OperatorHandle instance and can directly call these methods on it. Differential Revision: [D18357424](https://our.internmc.facebook.com/intern/diff/D18357424/) [ghstack-poisoned]

…nel API" This argument is needed by boxing wrappers so they're able to get a pointer to the corresponding unboxed kernel and call into it. But if a kernel is registered in a boxed way, we don't need it and should hide this from the API. This is especially needed for the backend fallback API where users would only be left wondering why this argument is there and what it does. Also, hiding it allows us to potentially totally remove it in a future refactoring if we find some way to do so. Differential Revision: [D18361991](https://our.internmc.facebook.com/intern/diff/D18361991/) [ghstack-poisoned]

Remove callUnboxedOnly() and instead use metaprogramming to figure out if an operator can use a boxed fallback or not. This enables boxed fallback for ops in native_functions.yaml even if they don't have `use_c10_dispatcher: full` set, as long as they're in the range of supported types. Differential Revision: [D18462653](https://our.internmc.facebook.com/intern/diff/D18462653/) [ghstack-poisoned]

bwasti · 2019-11-16T02:58:53Z

rebased onto unblocking PR:
#29682

we're back in business!

This PR re-introduces backend_fallback_test.cpp, which was previously called boxed_fallback_test.cpp and showed how to use the backend fallback API. Differential Revision: [D18462654](https://our.internmc.facebook.com/intern/diff/D18462654/) [ghstack-poisoned]

mcarilli · 2019-11-18T04:25:55Z

@bwasti I'm trying to implement autocasting (automatic casting of inputs to the most performant precision for a given op). We can talk on slack if you want more context but I'm also interposing a layer on the dispatch path (for a new "AutocastTensorId") that casts inputs to the desired type then runs the op. There are a few points I'm curious about, loosely organized in two topics:

1. Will autocasting play well with LazyTensor?

I think autocasting is compatible with your approach, as long as AutocastTensorId is given higher priority that LazyTensorId. Any casts to half or float will redispatch to subsequent dispatch layers, including the LazyTensor layer, so the graph that LazyTensor records will include all the casts, and jit will have the freedom to fuse them as it sees fit.

Currently, I give AmpTensorId higher priority than VariableTensorId, to ensure that inputs saved for backward by a given op are post-cast (saving inputs after a cast to FP16 provides significant memory savings). It appears you're giving LazyTensorId lower priority than VariableTensorId (ie, routing through the LazyTensor hijack occurs after VariableTensorId's autograd history recording). So as-is, AmpTensorId will have higher priority than LazyTensorId, and I think autocasting will work with LazyTensors, but I don't want to accidentally paint myself into a corner. Please let me know if my approach might break something.

2. What's your planned API? Can I draw inspiration from it?

I'm also trying to figure out the python API for autocasting. Any information I can gather on other APIs that alter/control how ops are dispatched is valuable. Right now, we're leaning towards exposing autocasting via a context manager that sets AmpTensorId to be globally included, but not baking AmpTensorId into the tensors themselves. Based on your tests, it seems you're pursuing the other approach: laziness is carried as a tensor attribute, and the user-facing control points are tensor.to_lazy() and tensor.to_eager(). Is that correct? Is it complete, or are there other APIs you're considering? Also, what happens with c = a + b where a is a LazyTensor and b is not? My guess based on the usual dispatch logic is that it would route through LazyTensor and c would be a lazy tensor as well, since the dispatch ID is computed as a superset, but I'm wondering if there's any special-casing you considered.

facebook-github-bot

@bwasti has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

bwasti · 2019-11-19T23:02:34Z

Will autocasting play well with LazyTensor?
I give AmpTensorId higher priority than VariableTensorId

Yep, LazyTensor is conceptually just a hack to inject a lazy execution model into PyTorch. It can really function at any level, but below autograd makes the most sense imo. Seems like it'll play well with automatic casting :)

What's your planned API? Can I draw inspiration from it?
tensor.to_lazy() and tensor.to_eager()

Yep, that's pretty much it. Ideally there wouldn't be any user interaction at the tensor level, so I don't think I'll be adding anything to the API.

c = a + b where a is a LazyTensor and b is not

I'm not sure this is well handled, but ideally b would be implicitly "cast" to LazyTensor.

mcarilli · 2019-11-20T00:05:19Z

@bwasti Thanks! To elaborate a bit on what I said earlier, the current default behavior is to compute an op's TensorTypeSet by taking the superset (inclusive or) of the TensorTypeSets of any participating Tensors, then folding in (inclusive or) the thread-local tensor-independent included TensorTypeSet, then subtracting out the thread-local excluded TensorTypeSet. Maybe I'm repeating stuff you already know, but I expect c = a + b (where a is a LazyTensor and b is not) will route through the LazyTensor hijacking and produce c as another LazyTensor. I was wondering if you had additional code to explicitly alter this default behavior.

The to_lazy()/to_eager() interface shows us an interesting alternative to a context manager. We could mimic that with to_amp() to assign autocastability, and explicit calls to float() or half() would remove autocastability.

In your case, there are a couple details I'm additionally curious about:
If I'm reading the code correctly, a.to_lazy() stashes a as an input to the graph. Does it also deep-copy the memory as it does so? And is a itself altered by a.to_lazy(), or does a.to_lazy() simply return a new "Tensor" (lightweight handle without any actual associated memory) that can be used to build the graph further?

Finally, what's the default behavior of print(a) or a.item() if a is a lazy tensor? Does that automatically trigger a to_eager(), or will it error and request that you call to_eager manually?

pritamdamania87 · 2019-11-22T20:23:40Z

torch/csrc/distributed/autograd/utils.cpp

-  if (!tensors.empty()) {
-    TORCH_INTERNAL_ASSERT(
-        torch::autograd::compute_requires_grad(tensors),
-        "Received tensors do not require grad, addRecvRpcBackward should not be called");


Wondering if these changes in distributed are intended?

pritamdamania87 · 2019-11-22T20:23:47Z

torch/csrc/distributed/rpc/python_functions.cpp

      std::vector<char>(pickledPythonUDF.begin(), pickledPythonUDF.end()),
      tensors);
  return sendMessageWithAutograd(
-      agent, dst, std::move(*pythonCall).toMessage());


same as above

bwasti · 2019-12-03T22:20:05Z

due to an internal build requirement -- LazyTensor impl has moved to this PR, which is exported from internal code:

#30674

bwasti requested a review from apaszke as a code owner September 6, 2019 05:27

pytorchbot added oncall: jit Add this issue/PR to JIT oncall triage queue module: internals Related to internal abstractions in c10 and ATen module: operators labels Sep 6, 2019

bwasti requested review from ezyang, smessmer and zdevito September 6, 2019 05:27

bwasti force-pushed the lazy_tensor branch from 1659f5d to 08c94c3 Compare September 6, 2019 06:00

ezyang reviewed Sep 6, 2019

View reviewed changes

aten/src/ATen/core/LazyTensor.h Outdated

Copy link

Contributor

ezyang Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here

bwasti reacted with thumbs up emoji

ezyang reviewed Sep 6, 2019

View reviewed changes

aten/src/ATen/core/LazyTensor.cpp Outdated

Copy link

Contributor

ezyang Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead variable?

bwasti reacted with thumbs up emoji

ezyang reviewed Sep 6, 2019

View reviewed changes

aten/src/ATen/core/ATenDispatch.h Outdated

Copy link

Contributor

ezyang Sep 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actual change here right?

bwasti reacted with thumbs up emoji

apaszke reviewed Sep 10, 2019

View reviewed changes

zdevito reviewed Sep 10, 2019

View reviewed changes

smessmer added 12 commits November 15, 2019 01:18

Update on "Test cases for backend fallback kernels"

fb0633e

- Differential Revision: [D18329308](https://our.internmc.facebook.com/intern/diff/D18329308/) [ghstack-poisoned]

Update on "Test cases for backend fallback kernels"

ad56791

- Differential Revision: [D18329308](https://our.internmc.facebook.com/intern/diff/D18329308/) [ghstack-poisoned]

bwasti force-pushed the lazy_tensor branch from 9d63019 to b5e7d1a Compare November 16, 2019 02:58

smessmer and others added 2 commits November 16, 2019 16:23

Lazy Tensor

ea42fe6

bwasti force-pushed the lazy_tensor branch from b5e7d1a to ea42fe6 Compare November 17, 2019 00:23

bwasti requested a review from pritamdamania87 as a code owner November 17, 2019 00:23

facebook-github-bot reviewed Nov 19, 2019

View reviewed changes

pritamdamania87 reviewed Nov 22, 2019

View reviewed changes

bwasti mentioned this pull request Dec 3, 2019

Lazy Tensor Implementation #30674

Closed

mruberry removed the module: operators (deprecated) label Oct 10, 2020

facebook-github-bot added the cla signed label Oct 30, 2020

wconstab added module: lazy and removed lazy Lazy Tensor work items labels Aug 30, 2021

ezyang closed this Aug 30, 2021

Conversation

bwasti commented Sep 6, 2019 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadimkantorov commented Sep 6, 2019

Uh oh!

soumith commented Sep 6, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ezyang Sep 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 10, 2019

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffreyksmithjr commented Sep 11, 2019

Uh oh!

bwasti commented Sep 6, 2019 •

edited by pytorch-probot bot

Loading

ezyang Sep 6, 2019 •

edited

Loading

ezyang Sep 6, 2019 •

edited

Loading

mcarilli commented Nov 18, 2019 •

edited

Loading

mcarilli commented Nov 20, 2019 •

edited

Loading