Skip to content

[inductor collectives] sink waits iterative#157708

Closed
IvanKobzarev wants to merge 4 commits intogh/IvanKobzarev/122/basefrom
gh/IvanKobzarev/122/head
Closed

[inductor collectives] sink waits iterative#157708
IvanKobzarev wants to merge 4 commits intogh/IvanKobzarev/122/basefrom
gh/IvanKobzarev/122/head

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157708

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 9a737bb with merge base d27d361 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jul 7, 2025
IvanKobzarev added a commit that referenced this pull request Jul 7, 2025
ghstack-source-id: 6c13e33
Pull Request resolved: #157708
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 7, 2025
@IvanKobzarev IvanKobzarev added the topic: not user facing topic category label Jul 7, 2025
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763)

[ghstack-poisoned]
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

"reorder_for_compute_comm_overlap": True,
"reorder_for_compute_comm_overlap_passes": [
_reorder_communication_preserving_peak_memory,
sink_waits_iterative,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected the sink would happen before the reorder. I guess now I'm not sure if it actually matters what order it is in.

For the case where collectives all independent of one another, i think moving the wait can be either before or after reorder, since the wait is not an obstacle for the reorder pass.

For the case where one collective produces a temporary that feeds into another collective, i guess it could be worse to sink waits first, becuase doing so could lock the second collective in later position.

OK- i'm convinced this is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline - I've added a ban of swaps of groups that both contain collective_ops.
Once I added - the result of [reorder, sink_waits] and [sink_waits, reorder] became the same.

) -> tuple[list[BaseSchedulerNode], dict[BaseSchedulerNode, SinkWaitInfo]]:
from torch._inductor.scheduler import GroupedSchedulerNode, init_group_node

def _group_name(snode, with_bufs=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we elevate this to a util and use it from the reorder pass as well?

gsnode = gsnodes[i]
if contains_wait(gsnode):
info = stats[gsnode.snodes[0]] = SinkWaitInfo()
for j in range(i + 1, n):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since a lot of stuff is boilerplate copy/paste from the reorder pass (the loop structure, the logging stuff) i wonder if we can make a reusable thing that has callbacks for the internal logic.

otoh i'd probably rather get things working nicely first and think about this later

@@ -1613,6 +1614,7 @@ def _reorder_communication_preserving_peak_memory(
"reorder_for_compute_comm_overlap": True,
"reorder_for_compute_comm_overlap_passes": [
_reorder_communication_preserving_peak_memory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it'd be good to add a test that shows that we can move a wait (wait_a) past a collective and past another wait (wait_b), in the case that wait_b and collective are not actually depending on the wait_a

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763)

[ghstack-poisoned]
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D77861763](https://our.internmc.facebook.com/intern/diff/D77861763)

[ghstack-poisoned]
IvanKobzarev added a commit that referenced this pull request Jul 7, 2025
ghstack-source-id: b77eeb5
Pull Request resolved: #157708
@IvanKobzarev
Copy link
Contributor Author

@IvanKobzarev has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@IvanKobzarev
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/IvanKobzarev/122/head branch August 8, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants