Skip to content

Remove eager mode support form CommTensor#84978

Closed
mrshenli wants to merge 6 commits intogh/mrshenli/335/basefrom
gh/mrshenli/335/head
Closed

Remove eager mode support form CommTensor#84978
mrshenli wants to merge 6 commits intogh/mrshenli/335/basefrom
gh/mrshenli/335/head

Conversation

@mrshenli
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli commented Sep 14, 2022

Stack from ghstack (oldest at bottom):

We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

Note that, we still need the eager mode code in __torch_dispatch__,
as make_fx will also run the ops in eager mode to get the output.

We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Sep 14, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84978

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 2 Pending

As of commit 7648f90:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 14, 2022
We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

[ghstack-poisoned]
We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

[ghstack-poisoned]
We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

Note that, we still need the eager mode code in `__torch_dispatch__`,
as `make_fx` will also run the ops in eager mode to get the output.

[ghstack-poisoned]
We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

Note that, we still need the eager mode code in `__torch_dispatch__`,
as `make_fx` will also run the ops in eager mode to get the output.

[ghstack-poisoned]
Copy link
Copy Markdown
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me overall, have some questions about how/when "eager mode" work.wait() should be triggered.



def _get_tracer(obj: Any) -> Optional[torch.fx.Tracer]:
slots = get_proxy_slots(obj)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a global fx tracer table, and that's why we could do this trick?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, exactly. Below is how proxy_tensor.py sets the slots.

def set_proxy_slot(obj, tracer, proxy):
d = obj.__dict__.setdefault(proxy_slot, weakref.WeakKeyDictionary())
assert isinstance(d, weakref.WeakKeyDictionary)
d[tracer] = proxy

And it also a similar solution to get the proxy instance.

# Gets the proxy for a tensor, if it exists.
def get_proxy(obj):
res = get_proxy_slots(obj)
if res is None:
return None
vals = tuple(res.values())
assert len(vals) == 1
return vals[0]

t = tensor._tensor if isinstance(tensor, CommTensor) else tensor
if _get_tracer(t) is None:
# noop for eager mode
return tensor
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be the eager mode looks like then? I think for device mesh collective, we would always wrap the input tensor be CommTensor, so for "eager mode", we just return the same tensor, and user of device mesh could need to manually call work.wait() to sync the stream and get the results?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user of device mesh could need to manually call work.wait() to sync the stream and get the results?

Yep, in eager mode, users/DT will have to explicitly call wait. I assume it is already so today, otherwise it won't be semantically correct? LMK if DT relies on CommTensor to correctly wait for communication. In that case, I will revert this change and only leave clean up code.

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/335/head branch September 18, 2022 14:20
mehtanirav pushed a commit that referenced this pull request Oct 4, 2022
We don't need eager mode support (automatic wait on read) for now.
Removing that to simply the code. We can always add this back if
necessary in the future.

Note that, we still need the eager mode code in `__torch_dispatch__`,
as `make_fx` will also run the ops in eager mode to get the output.
Pull Request resolved: #84978
Approved by: https://github.com/wanchaol
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants