Skip to content

Test tracing consecutive comms on the same input tensor#84980

Closed
mrshenli wants to merge 3 commits intogh/mrshenli/336/basefrom
gh/mrshenli/336/head
Closed

Test tracing consecutive comms on the same input tensor#84980
mrshenli wants to merge 3 commits intogh/mrshenli/336/basefrom
gh/mrshenli/336/head

Conversation

@mrshenli
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli commented Sep 14, 2022

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Sep 14, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84980

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 2 Pending

As of commit bb9024a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 14, 2022
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 14, 2022
@mrshenli
Copy link
Copy Markdown
Contributor Author

We can see that the 2nd allreduce properly waited for the completion of the 1st allreduce and the final result properly waited for the 2nd allreduce.

opcode         name                  target                                          args                                                     kwargs
-------------  --------------------  ----------------------------------------------  -------------------------------------------------------  --------
placeholder    x_1                   x_1                                             ()                                                       {}
call_function  add                   aten.add.Tensor                                 (x_1, x_1)                                               {}
get_attr       _tensor_constant0     _tensor_constant0                               ()                                                       {}
get_attr       _tensor_constant1     _tensor_constant1                               ()                                                       {}
call_function  allreduce__default    c10d.allreduce_.default                         ([add], _tensor_constant0, _tensor_constant1, -1)        {}
call_function  comm_result           <function _wrap_comm_result at 0x7f2c12ee90d0>  (allreduce__default,)                                    {}
call_function  getitem               <built-in function getitem>                     (comm_result, 0)                                         {}
call_function  getitem_1             <built-in function getitem>                     (getitem, 0)                                             {}
call_function  wait_comm             <function _wait_comm at 0x7f2c1b25f310>         (getitem_1,)                                             {}
get_attr       _tensor_constant2     _tensor_constant2                               ()                                                       {}
get_attr       _tensor_constant3     _tensor_constant3                               ()                                                       {}
call_function  allreduce__default_1  c10d.allreduce_.default                         ([wait_comm], _tensor_constant2, _tensor_constant3, -1)  {}
call_function  comm_result_1         <function _wrap_comm_result at 0x7f2c12ee90d0>  (allreduce__default_1,)                                  {}
call_function  getitem_3             <built-in function getitem>                     (comm_result_1, 0)                                       {}
call_function  getitem_4             <built-in function getitem>                     (getitem_3, 0)                                           {}
call_function  wait_comm_1           <function _wait_comm at 0x7f2c1b25f310>         (getitem_4,)                                             {}
call_function  mul                   aten.mul.Tensor                                 (wait_comm_1, 2)                                         {}
output         output                output                                          (mul,)                                                   {}

@mrshenli mrshenli requested a review from wanchaol September 14, 2022 03:39
@mrshenli mrshenli changed the title Support tracing consecutive comms on the same input tensor Test tracing consecutive comms on the same input tensor Sep 14, 2022
mrshenli added a commit that referenced this pull request Sep 14, 2022
def _test_consecutive_comm_work_wait(self, tensor):
def comm_fn(tensor, group=None):
work1 = dist.all_reduce(tensor, group=group, async_op=True)
work1.wait()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also have a test case to make this work1.wait() be waited after work2 produced?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that, but I am not sure if we should. Waiting after work2 is produced is only correct because PG internally uses the same stream for communication. If that subtle assumption no longer holds, we could run into race condition, e.g., the 2nd allreduce start reading the memory space before the 1st finishes writing.

Curious, is there a use case that we need to call work1.wait() after work2 is produced?

@mrshenli
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot pushed a commit that referenced this pull request Feb 15, 2024
Seeing the error for c10d tests when running on 1GPU. Adding the skip when there is insufficient GPU.

```
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
referring to #84980

Pull Request resolved: #119938
Approved by: https://github.com/eqy, https://github.com/fegin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants