Skip to content

make DebugMode wait on collectives before hashing outputs#169295

Open
bdhirsh wants to merge 16 commits intogh/bdhirsh/680/basefrom
gh/bdhirsh/680/head
Open

make DebugMode wait on collectives before hashing outputs#169295
bdhirsh wants to merge 16 commits intogh/bdhirsh/680/basefrom
gh/bdhirsh/680/head

Conversation

@bdhirsh
Copy link
Collaborator

@bdhirsh bdhirsh commented Dec 1, 2025

this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with async=True. See comment

(2) For functional collectives, we can always call wait_tensor on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.

Stack from ghstack (oldest at bottom):

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169295

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 9aadcdd with merge base c49edf7 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh added a commit that referenced this pull request Dec 1, 2025
@bdhirsh bdhirsh requested review from pianpwk and yushangdi December 1, 2025 15:12
# This doesn't actually test that we synchronize on collectives.
# I found this hard to test robustly since previously we would race.
# However, it does test that we do not crash.
# This doesn't actually test that we synchronize on collectives,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on duplicated comment

Copy link
Contributor

@pianpwk pianpwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brian!

this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 1, 2025
@yushangdi
Copy link
Contributor

fyi I landed #169027!

if (
a.name == "async_op"
and type(a.type) is torch.BoolType
and i < len(args)
Copy link
Contributor

@yushangdi yushangdi Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing some testing with _allgather_base_, and I think this check might be too restrictive, because in eager, aync_op is default to True, and it doesn't show up in args, but we should wait for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked for me:

            for i, a in enumerate(func._schema.arguments):
                # Annoying, async_op is not a kwarg in the non-functional collectives.
                # I'm using OpOverload._schema to figure out if it exists
                if a.name == "async_op" and type(a.type) is torch.BoolType:
                    # Determine async_op value from args, kwargs, or default
                    if i < len(args):
                        async_op_value = args[i]
                    elif "async_op" in kwargs:
                        async_op_value = kwargs["async_op"]
                    elif a.has_default_value():
                        async_op_value = a.default_value
                    else:
                        async_op_value = False

                    if async_op_value:
                        for item in result:
                            if isinstance(item, torch.ScriptObject) and hasattr(
                                item, "wait"
                            ):
                                item.wait()
                    break

this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 12, 2025
@bdhirsh bdhirsh added the topic: not user facing topic category label Dec 12, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 12, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 15, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 26, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 31, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 31, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Dec 31, 2025
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jan 2, 2026
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jan 6, 2026
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jan 6, 2026
this is an attempt to re-land #168119 with a few tweaks:

(1) for non-functional collectives, only wait on the work item with `async=True`. [See comment](#168119 (comment))

(2) For functional collectives, we can always call `wait_tensor` on the output.

The test in this PR will probably conflict with the test in #169027, so ill wait for that PR to land first and rebase.




[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jan 7, 2026
@bdhirsh
Copy link
Collaborator Author

bdhirsh commented Jan 14, 2026

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 14, 2026
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

@bdhirsh
Copy link
Collaborator Author

bdhirsh commented Jan 14, 2026

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15), trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 6071d23205b78735c3a03a00758aa345734a1f21 returned non-zero exit code 1

Auto-merging torch/_functorch/aot_autograd.py
CONFLICT (content): Merge conflict in torch/_functorch/aot_autograd.py
error: could not apply 6071d23205b... ensure user modes are disabled when running aot_compile_joint_with_descriptors
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants