make DebugMode wait on collectives before hashing outputs#169295
make DebugMode wait on collectives before hashing outputs#169295bdhirsh wants to merge 16 commits intogh/bdhirsh/680/basefrom
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169295
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 9aadcdd with merge base c49edf7 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # This doesn't actually test that we synchronize on collectives. | ||
| # I found this hard to test robustly since previously we would race. | ||
| # However, it does test that we do not crash. | ||
| # This doesn't actually test that we synchronize on collectives, |
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
|
fyi I landed #169027! |
torch/utils/_debug_mode.py
Outdated
| if ( | ||
| a.name == "async_op" | ||
| and type(a.type) is torch.BoolType | ||
| and i < len(args) |
There was a problem hiding this comment.
I'm doing some testing with _allgather_base_, and I think this check might be too restrictive, because in eager, aync_op is default to True, and it doesn't show up in args, but we should wait for it
There was a problem hiding this comment.
This worked for me:
for i, a in enumerate(func._schema.arguments):
# Annoying, async_op is not a kwarg in the non-functional collectives.
# I'm using OpOverload._schema to figure out if it exists
if a.name == "async_op" and type(a.type) is torch.BoolType:
# Determine async_op value from args, kwargs, or default
if i < len(args):
async_op_value = args[i]
elif "async_op" in kwargs:
async_op_value = kwargs["async_op"]
elif a.has_default_value():
async_op_value = a.default_value
else:
async_op_value = False
if async_op_value:
for item in result:
if isinstance(item, torch.ScriptObject) and hasattr(
item, "wait"
):
item.wait()
break
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks: (1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment)) (2) For functional collectives, we can always call `wait_tensor` on the output. The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase. [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15), trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
this is an attempt to re-land #168119 with a few tweaks:
(1) for non-functional collectives, only wait on the work item with
async=True. See comment(2) For functional collectives, we can always call
wait_tensoron the output.The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.
Stack from ghstack (oldest at bottom):