port distributed pipeline test files for Intel GPU by wincent8 · Pull Request #159033 · pytorch/pytorch

wincent8 · 2025-07-24T10:31:28Z

In this PR we will port all distributed pipeline test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

instantiate_device_type_tests()
use "torch.accelerator.current_accelerator()" to determine the accelerator backend
use "requires_accelerator_dist_backend()" to replace requires_nccl()
use "get_default_backend_for_device()" to get backend
enabled XPU for some test path

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @gujinghui @fengyuan14 @guangyey

pytorch-bot · 2025-07-24T10:31:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159033

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 28da538 with merge base 67fc16c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-07-24T10:31:35Z

The committers listed above are authorized under a signed CLA.

✅ login: wincent8 / name: wliao2 (3cb451b, 28da538, 39f3808)

wincent8 · 2025-07-25T02:59:30Z

@pytorchbot label "module: xpu"
@pytorchbot label "triaged"

wincent8 · 2025-07-25T03:00:41Z

@pytorchbot label "triaged"

test/distributed/pipelining/test_backward.py

pytorch-bot · 2025-07-25T03:05:31Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

test/distributed/pipelining/test_backward.py

test/distributed/pipelining/test_schedule_multiproc.py

wincent8 · 2025-07-25T08:15:27Z

@pytorchbot label "module: xpu"

wincent8 · 2025-07-25T08:15:44Z

@pytorchbot label "triaged"

test/distributed/pipelining/test_schedule.py

torch/testing/_internal/common_utils.py

guangyey · 2025-07-26T09:17:31Z

test/distributed/pipelining/test_unflatten.py

 devices = ["cpu", "cuda", "hpu", "xpu"]
-instantiate_device_type_tests(UnflattenTests, globals(), only_for=devices)
+instantiate_device_type_tests(
+    UnflattenTests, globals(), only_for=devices, allow_xpu=True


Good catch!

Why the need for adding allow_xpu=True ?

If no allow_xpu=True, these test cases will not be instantiated actually refer to

pytorch/torch/testing/_internal/common_device_type.py

Lines 789 to 796 in 74280d0

def get_desired_device_type_test_bases(

except_for=None, only_for=None, include_lazy=False, allow_mps=False, allow_xpu=False

):

# allow callers to specifically opt tests into being tested on MPS, similar to `include_lazy`

test_bases = device_type_test_bases.copy()

if allow_mps and TEST_MPS and MPSTestBase not in test_bases:

test_bases.append(MPSTestBase)

if allow_xpu and TEST_XPU and XPUTestBase not in test_bases:

guangyey

Overall LGTM. I recommend to change TEST_MULTIGPU to TEST_MULTIACCELERATOR

guangyey · 2025-08-19T02:28:59Z

@d4l3k May I know if the internal CI is green?

guangyey · 2025-08-22T02:03:06Z

@wincent8, please fix conflicts.

wincent8 · 2025-08-22T02:09:11Z

@wincent8, please fix conflicts.

done

guangyey · 2025-08-22T02:11:24Z

@pytorchbot rebase

pytorchmergebot · 2025-08-22T02:12:54Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-08-22T02:12:55Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/159033/head returned non-zero exit code 1

Rebasing (1/14)
Auto-merging test/distributed/pipelining/test_schedule_multiproc.py
CONFLICT (content): Merge conflict in test/distributed/pipelining/test_schedule_multiproc.py
Auto-merging test/distributed/pipelining/test_stage.py
error: could not apply 190e5a02391... enable distributed pipeline case for xpu
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 190e5a02391... # enable distributed pipeline case for xpu

Raised by https://github.com/pytorch/pytorch/actions/runs/17144110931

facebook-github-bot · 2025-08-22T16:59:13Z

@d4l3k has imported this pull request. If you are a Meta employee, you can view this in D80223633.

kwen2501

lgtm.
left some comments on whether we can make the backend strings go away.

kwen2501 · 2025-08-22T17:30:42Z

test/distributed/pipelining/test_unflatten.py

 devices = ["cpu", "cuda", "hpu", "xpu"]
-instantiate_device_type_tests(UnflattenTests, globals(), only_for=devices)
+instantiate_device_type_tests(
+    UnflattenTests, globals(), only_for=devices, allow_xpu=True


Why the need for adding allow_xpu=True ?

kwen2501 · 2025-08-22T17:35:07Z

test/distributed/pipelining/test_schedule.py

-    @requires_nccl()
+    @requires_accelerator_dist_backend(["nccl", "xccl"])


As far as I know, requires_accelerator_dist_backend accepts None as argument, and it will basically search the same list of strings. I would prefer to leave it empty for less maintenance load.

This is done explicitly to avoid running these on MTIA which has issues with these tests

d4l3k · 2025-08-22T22:05:42Z

I kicked off a land -- this should be merged soon so just sit tight

facebook-github-bot · 2025-08-23T07:14:34Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-08-23T07:16:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-23T07:16:32Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team

Raised by workflow job

seemethere · 2025-08-25T05:22:47Z

@pytorchbot merge -f 'merged internally'

pytorchmergebot · 2025-08-25T05:24:18Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wincent8 requested a review from a team as a code owner July 24, 2025 10:31

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 24, 2025

pytorchbot added the open source label Jul 24, 2025

pytorch-bot bot added the module: xpu Intel XPU related issues label Jul 25, 2025

pytorch-bot bot added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 25, 2025

guangyey reviewed Jul 25, 2025

View reviewed changes

test/distributed/pipelining/test_backward.py Outdated Show resolved Hide resolved

guangyey added this to PyTorch Intel Jul 25, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 25, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 25, 2025

guangyey added the topic: not user facing topic category label Jul 25, 2025

daisyden reviewed Jul 25, 2025

View reviewed changes

test/distributed/pipelining/test_backward.py Outdated Show resolved Hide resolved

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 25, 2025

daisyden reviewed Jul 25, 2025

View reviewed changes

test/distributed/pipelining/test_schedule_multiproc.py Outdated Show resolved Hide resolved

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 25, 2025

wincent8 mentioned this pull request Jul 25, 2025

port 2 distributed pipeline test files for Intel GPU #159140

Closed

guangyey reviewed Jul 26, 2025

View reviewed changes

test/distributed/pipelining/test_schedule.py Outdated Show resolved Hide resolved

guangyey reviewed Jul 26, 2025

View reviewed changes

torch/testing/_internal/common_utils.py Outdated Show resolved Hide resolved

guangyey reviewed Jul 26, 2025

View reviewed changes

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 26, 2025

guangyey reviewed Jul 26, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 28, 2025

guangyey previously approved these changes Jul 28, 2025

View reviewed changes

guangyey changed the title ~~[WIP]port distributed pipeline test files for Intel GPU~~ port distributed pipeline test files for Intel GPU Jul 28, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 22, 2025

wincent8 force-pushed the wliao2/add_pipeline branch from 9c4b7e0 to b13f259 Compare August 22, 2025 03:48

wincent8 requested review from SherlockNoMad, angelayi, avikchaudhuri, janeyx99, mikaylagawarecki, tugsbayasgalan, ydwu4 and zhxchen17 as code owners August 22, 2025 03:48

pytorch-bot bot added module: inductor release notes: distributed (checkpoint) and removed ciflow/xpu Run XPU CI tasks labels Aug 22, 2025

wincent8 added 3 commits August 22, 2025 14:05

enable distributed pipeline case for xpu

3cb451b

fix lint issue

39f3808

fix a typo

28da538

kwen2501 approved these changes Aug 22, 2025

View reviewed changes

	def get_desired_device_type_test_bases(
	except_for=None, only_for=None, include_lazy=False, allow_mps=False, allow_xpu=False
	):
	# allow callers to specifically opt tests into being tested on MPS, similar to `include_lazy`
	test_bases = device_type_test_bases.copy()
	if allow_mps and TEST_MPS and MPSTestBase not in test_bases:
	test_bases.append(MPSTestBase)
	if allow_xpu and TEST_XPU and XPUTestBase not in test_bases:

		@requires_nccl()
		@requires_accelerator_dist_backend(["nccl", "xccl"])

Conversation

wincent8 commented Jul 24, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159033

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

wincent8 commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

guangyey Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

wincent8 Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 19, 2025

Uh oh!

guangyey commented Aug 22, 2025

Uh oh!

wincent8 commented Aug 22, 2025

Uh oh!

guangyey commented Aug 22, 2025

Uh oh!

pytorchmergebot commented Aug 22, 2025

Uh oh!

pytorchmergebot commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 23, 2025

Uh oh!

pytorchmergebot commented Aug 23, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 23, 2025

Merge failed

Uh oh!

seemethere commented Aug 25, 2025

wincent8 commented Jul 24, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 24, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jul 24, 2025 •

edited

Loading