Skip to content

Add FakeTensorMode#77972

Closed
eellison wants to merge 16 commits intogh/eellison/295/basefrom
gh/eellison/295/head
Closed

Add FakeTensorMode#77972
eellison wants to merge 16 commits intogh/eellison/295/basefrom
gh/eellison/295/head

Conversation

@eellison
Copy link
Contributor

@eellison eellison commented May 20, 2022

Stack from ghstack (oldest at bottom):

This adds a mode which will intercept calls to __torch__dispatch__ even if the inputs are not already FakeTensors. This mimics the convenient prior existing usage. It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators.

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to meta devices) following along with the class here.

One open question is what should be the duration of the FakeTensorConverter. IMO, it would make sense & be convenient for it to live for the duration of FakeTensorMode. Since we shouldn't be allocating any new Tensors with actual data (just on meta devices) it is probably fine for those tensors to live for the duration of FakeTensorMode.

If that is not sufficient, we could try using weakref.WeakKeyDictionary mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 20, 2022

🔗 Helpful links

❌ 1 New Failures

As of commit 791b39f (more details on the Dr. CI page):

Expand to see more
  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-focal-py3.7-gcc7-mobile-lightweight-dispatch-build / build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2022-05-31T14:14:47.7623462Z ##[error]Process completed with exit code 137.
2022-05-31T14:12:32.4887724Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/x8-lut/gen/lut-avx512skx-vpshufb-x256.c.o�[0m
2022-05-31T14:12:32.5971334Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2-k-over-64.c.o�[0m
2022-05-31T14:12:32.6515761Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2-k-over-2048.c.o�[0m
2022-05-31T14:12:32.7050982Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2minus-k-over-4.c.o�[0m
2022-05-31T14:12:32.7556644Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2minus-k-over-8.c.o�[0m
2022-05-31T14:12:32.8127663Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2minus-k-over-16.c.o�[0m
2022-05-31T14:12:32.8659333Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2minus-k-over-64.c.o�[0m
2022-05-31T14:12:32.9194429Z [ 98%] �[32mBuilding C object confu-deps/XNNPACK/CMakeFiles/all_microkernels.dir/src/tables/exp2minus-k-over-2048.c.o�[0m
2022-05-31T14:12:32.9933254Z [ 98%] Built target all_microkernels
2022-05-31T14:12:32.9999721Z [ 98%] �[32mBuilding CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/RegisterCodegenUnboxedKernels_7.cpp.o�[0m
2022-05-31T14:14:47.7623462Z ##[error]Process completed with exit code 137.
2022-05-31T14:14:47.8232475Z Prepare all required actions
2022-05-31T14:14:47.8314724Z ##[group]Run ./.github/actions/teardown-linux
2022-05-31T14:14:47.8315081Z with:
2022-05-31T14:14:47.8315340Z env:
2022-05-31T14:14:47.8315615Z   IN_CI: 1
2022-05-31T14:14:47.8315896Z   IS_GHA: 1
2022-05-31T14:14:47.8316167Z ##[endgroup]
2022-05-31T14:14:47.8360161Z ##[group]Run .github/scripts/wait_for_ssh_to_drain.sh
2022-05-31T14:14:47.8360635Z �[36;1m.github/scripts/wait_for_ssh_to_drain.sh�[0m
2022-05-31T14:14:47.9517213Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

[ghstack-poisoned]
eellison pushed a commit that referenced this pull request May 20, 2022
ghstack-source-id: cf2089f
Pull Request resolved: #77972
@eellison eellison requested review from Chillee and ezyang May 20, 2022 16:18
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
eellison pushed a commit that referenced this pull request May 20, 2022
ghstack-source-id: 62ea367
Pull Request resolved: #77972
return common_device

class FakeTensorMode(FakeTensor):
context = no_dispatch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a modern style mode instead pretty please :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't actually use context, AFAICT?

Copy link
Contributor Author

@eellison eellison May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a modern style mode ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inherit from TorchDispatchMode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.. mind linking me differences between two why one should do modern style over existing ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main difference is you can store instance variables on the mode, since it is an actual object not a class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice... i probably don't need setup_mode then

# TODO: no real reason to restrict multiple outputs
return (
len(schema.returns) == 1 and schema.returns[0].type is torch._C.TensorType.get()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tag opportunity :) cc @anjali411

func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
)
# cpu is default device if none is specified
out_device = new_kwargs.pop("device", torch.device("cpu"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it's torch.get_default_tensor_type() lol but ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is immutable 😛

This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with. 






[ghstack-poisoned]
Elias Ellison added 4 commits May 23, 2022 13:28
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with. 






[ghstack-poisoned]
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with. 






[ghstack-poisoned]
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with. 






[ghstack-poisoned]
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with. 






[ghstack-poisoned]
@eellison
Copy link
Contributor Author

@eellison has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@eellison
Copy link
Contributor Author

@eellison has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.

Differential Revision: [D36618464](https://our.internmc.facebook.com/intern/diff/D36618464)

[ghstack-poisoned]
eellison pushed a commit that referenced this pull request May 24, 2022
ghstack-source-id: 00439da
Pull Request resolved: #77972
@eellison
Copy link
Contributor Author

@eellison has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@eellison
Copy link
Contributor Author

I actually think the current behavior makes sense. The easiest mental model IMO is that everything is simulated, and that no real tensors will be affected. I think allowing inplacing views and other things to actually affect the input tensors would be a mistake.
If you do you would get into a situation where
input.t_() does affect your input, but input.add_(fake_tensor) doesn't...

Additionally, the behavior above with resize_ should just work because the first use of that tensor will get converted to fake, and then any subsequent uses the cached version with the resize_ applied will be used, which correctly simulates what compute you would have done.

Elias Ellison added 3 commits May 24, 2022 15:59
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.

Differential Revision: [D36618464](https://our.internmc.facebook.com/intern/diff/D36618464)

[ghstack-poisoned]
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.

Differential Revision: [D36618464](https://our.internmc.facebook.com/intern/diff/D36618464)

[ghstack-poisoned]
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.

Differential Revision: [D36618464](https://our.internmc.facebook.com/intern/diff/D36618464)

[ghstack-poisoned]
@ezyang
Copy link
Contributor

ezyang commented May 25, 2022

We discussed this in person and we decided that for torchdynamo the easiest thing will be to just wrap all the tensors as fake tensors before running the computation. That means we don't have to support in place modifying non-fake tensors.

This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.

Differential Revision: [D36618464](https://our.internmc.facebook.com/intern/diff/D36618464)

[ghstack-poisoned]
This adds a mode which will intercept calls to `__torch__dispatch__` even if the inputs are not already `FakeTensors`. This mimics the convenient [prior existing usage](https://pytorch.org/torchdistx/latest/fake_tensor.html). It does so by wrapping input tensors to Fake Tensors and then continuing to run the operators. 

Not Yet Implemented:

I still need to memoize conversion of non-fake tensors to fake tensors (and internally, to `meta` devices) following along with the [class here](https://github.com/pytorch/pytorch/blob/master/test/test_meta.py#L70). 

One open question is what should be the duration of the `FakeTensorConverter`. IMO, it would make sense & be convenient for it to live for the duration of `FakeTensorMode`. Since we shouldn't be allocating any new Tensors with actual data (just on `meta` devices) it is probably fine for those tensors to live for the duration of `FakeTensorMode`.

If that is not sufficient, we could try using `weakref.WeakKeyDictionary` mapping tensors to their fake equivalents. I looked into this a bit and there are at least some a few incompatibilities that need to be dealt with.



[ghstack-poisoned]
@eellison eellison mentioned this pull request May 31, 2022
@eellison
Copy link
Contributor Author

@pytorchbot merge this please

@github-actions
Copy link
Contributor

Hey @eellison.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

dzdang added a commit that referenced this pull request May 31, 2022
Pull Request resolved: #77972

Approved by: https://github.com/ezyang
ghstack-source-id: de497bb
facebook-github-bot pushed a commit that referenced this pull request Jun 1, 2022
Summary:
Pull Request resolved: #77972

Approved by: https://github.com/ezyang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/cea7dd1646ab147edac8f0e22f0aa85cf3136fef

Reviewed By: seemethere

Differential Revision: D36784784

Pulled By: seemethere

fbshipit-source-id: 55175d158483e4b388402a4ddcc273b69ef403c7
def _is_tensor_constructor(func: OpOverload):
assert isinstance(func, OpOverload)
schema = func._schema
if any(contains_tensor_types(arg.type) for arg in schema.arguments):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about _like ops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those got added later

Copy link
Contributor

@anjali411 anjali411 Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean _is_tensor_constructor would still return false for them but it should return true right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya. _no_tensor_arg_constructor is a better name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants