Skip to content

port distributed pipeline test files for Intel GPU#159033

Closed
wincent8 wants to merge 3 commits intopytorch:mainfrom
wincent8:wliao2/add_pipeline
Closed

port distributed pipeline test files for Intel GPU#159033
wincent8 wants to merge 3 commits intopytorch:mainfrom
wincent8:wliao2/add_pipeline

Conversation

@wincent8
Copy link
Contributor

@wincent8 wincent8 commented Jul 24, 2025

In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

  1. instantiate_device_type_tests()
  2. use "torch.accelerator.current_accelerator()" to determine the accelerator backend
  3. use "requires_accelerator_dist_backend()" to replace requires_nccl()
  4. use "get_default_backend_for_device()" to get backend
  5. enabled XPU for some test path

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @gujinghui @fengyuan14 @guangyey

@wincent8 wincent8 requested a review from a team as a code owner July 24, 2025 10:31
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159033

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 28da538 with merge base 67fc16c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 24, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 24, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@wincent8
Copy link
Contributor Author

@pytorchbot label "module: xpu"
@pytorchbot label "triaged"

@pytorch-bot pytorch-bot bot added the module: xpu Intel XPU related issues label Jul 25, 2025
@wincent8
Copy link
Contributor Author

@pytorchbot label "triaged"

@pytorch-bot pytorch-bot bot added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 25, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 25, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 25, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 25, 2025
@guangyey guangyey added the topic: not user facing topic category label Jul 25, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 25, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 25, 2025
@wincent8
Copy link
Contributor Author

@pytorchbot label "module: xpu"

@wincent8
Copy link
Contributor Author

@pytorchbot label "triaged"

devices = ["cpu", "cuda", "hpu", "xpu"]
instantiate_device_type_tests(UnflattenTests, globals(), only_for=devices)
instantiate_device_type_tests(
UnflattenTests, globals(), only_for=devices, allow_xpu=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the need for adding allow_xpu=True ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no allow_xpu=True, these test cases will not be instantiated actually refer to

def get_desired_device_type_test_bases(
except_for=None, only_for=None, include_lazy=False, allow_mps=False, allow_xpu=False
):
# allow callers to specifically opt tests into being tested on MPS, similar to `include_lazy`
test_bases = device_type_test_bases.copy()
if allow_mps and TEST_MPS and MPSTestBase not in test_bases:
test_bases.append(MPSTestBase)
if allow_xpu and TEST_XPU and XPUTestBase not in test_bases:

@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 26, 2025
Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I recommend to change TEST_MULTIGPU to TEST_MULTIACCELERATOR

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 28, 2025
guangyey
guangyey previously approved these changes Jul 28, 2025
@guangyey guangyey changed the title [WIP]port distributed pipeline test files for Intel GPU port distributed pipeline test files for Intel GPU Jul 28, 2025
@guangyey
Copy link
Collaborator

@d4l3k May I know if the internal CI is green?

@guangyey
Copy link
Collaborator

@wincent8, please fix conflicts.

@wincent8
Copy link
Contributor Author

@wincent8, please fix conflicts.

done

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/159033/head returned non-zero exit code 1

Rebasing (1/14)
Auto-merging test/distributed/pipelining/test_schedule_multiproc.py
CONFLICT (content): Merge conflict in test/distributed/pipelining/test_schedule_multiproc.py
Auto-merging test/distributed/pipelining/test_stage.py
error: could not apply 190e5a02391... enable distributed pipeline case for xpu
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 190e5a02391... # enable distributed pipeline case for xpu

Raised by https://github.com/pytorch/pytorch/actions/runs/17144110931

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this in D80223633.

Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.
left some comments on whether we can make the backend strings go away.

devices = ["cpu", "cuda", "hpu", "xpu"]
instantiate_device_type_tests(UnflattenTests, globals(), only_for=devices)
instantiate_device_type_tests(
UnflattenTests, globals(), only_for=devices, allow_xpu=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the need for adding allow_xpu=True ?

Comment on lines -743 to +744
@requires_nccl()
@requires_accelerator_dist_backend(["nccl", "xccl"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, requires_accelerator_dist_backend accepts None as argument, and it will basically search the same list of strings. I would prefer to leave it empty for less maintenance load.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done explicitly to avoid running these on MTIA which has issues with these tests

@d4l3k
Copy link
Member

d4l3k commented Aug 22, 2025

I kicked off a land -- this should be merged soon so just sit tight

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@seemethere
Copy link
Member

@pytorchbot merge -f 'merged internally'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: inductor module: xpu Intel XPU related issues oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (checkpoint) Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants