[Reland][2/N]Port several test files under test/distributed to Intel GPU by daisyden · Pull Request #159473 · pytorch/pytorch

daisyden · 2025-07-30T13:03:09Z

For #114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

instantiate_device_type_tests()
use "torch.accelerator.current_accelerator()" to determine the accelerator backend
use requires_accelerator_dist_backend to allow both nccl and xccl test
enabled XPU for some test path
Change the hardcoded world_size according to device_count.
Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @gujinghui @fengyuan14 @guangyey

pytorch-bot · 2025-07-30T13:03:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159473

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d25250e with merge base e900a27 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-n-py3.10 / test (default, 5, 8, linux.idc.xpu) (gh) (similar failure)
export/test_nativert 1/1 failed!
xpu / linux-jammy-xpu-n-py3.10 / test (default, 7, 8, linux.idc.xpu) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

daisyden · 2025-08-01T06:44:28Z

@pytorchbot label "ciflow/xpu"

pytorch-bot · 2025-08-01T06:44:30Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'ciflow/xpu' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

daisyden · 2025-08-01T07:03:00Z

@pytorchbot label "module: xpu"
@pytorchbot label "triaged"

daisyden · 2025-08-07T03:02:44Z

@pytorchbot rebase

pytorch-bot · 2025-08-07T03:02:52Z

Didn't find following labels among repository labels: rebase

guangyey · 2025-08-15T15:48:54Z

torch/testing/_internal/common_distributed.py

+    if TEST_CUDA:
+        if not c10d.is_nccl_available():
+            return skip_but_pass_in_sandcastle(
+                "c10d was not compiled with the NCCL backend",
+            )
+        else:
+            return skip_but_pass_in_sandcastle_if(
+                torch.cuda.nccl.version() < version,
+                f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",
+            )


Suggested change

if TEST_CUDA:

if not c10d.is_nccl_available():

return skip_but_pass_in_sandcastle(

"c10d was not compiled with the NCCL backend",

)

else:

return skip_but_pass_in_sandcastle_if(

torch.cuda.nccl.version() < version,

f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",

)

if not TEST_CUDA:

return lambda f: f

if not c10d.is_nccl_available():

return skip_but_pass_in_sandcastle(

"c10d was not compiled with the NCCL backend",

)

return skip_but_pass_in_sandcastle_if(

torch.cuda.nccl.version() < version,

f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",

)

guangyey · 2025-08-18T16:00:09Z

test/distributed/test_c10d_common.py

        ]

+        if TEST_XPU:
+            backend_config_strings_and_expected_values.remove(


What will happen if here don't remove these values

guangyey · 2025-08-18T16:03:37Z

test/distributed/test_c10d_functional_native.py

+            mem_usage[i] = torch.get_device_module(
+                self.device.type
+            ).max_memory_allocated()


Suggested change

mem_usage[i] = torch.get_device_module(

self.device.type

).max_memory_allocated()

mem_usage[i] = torch.accelerator.max_memory_allocated()

refer to

pytorch/torch/accelerator/memory.py

Line 123 in 3c8c509

def max_memory_allocated(device_index: _device_t = None, /) -> int:

guangyey · 2025-08-18T16:05:02Z

test/distributed/test_device_mesh.py

        os.environ["LOCAL_RANK"] = f"{local_rank}"


+@skipIfXpu


I remember skipIfXPU has a side-effect that will ignore all UTs in this class unexpectedly.

refer to #151315

guangyey · 2025-08-18T16:08:32Z

test/distributed/test_device_mesh.py

        # Set the device on each process before DeviceMesh constructor,
        # and device to be different than the default world rank
-        torch.cuda.set_device((self.rank + 2) % self.world_size)
+        torch.accelerator.set_device_idx((self.rank + 2) % self.world_size)


set_device_idx has been deprecated. Use set_device_index instead.

guangyey · 2025-08-18T16:10:38Z

test/distributed/test_dynamo_distributed.py

-# optimizer, you should be able to repro it single process!
-@requires_nccl()
+# # optimizer, you should be able to repro it single process!
+# @skip_but_pass_in_sandcastle_if(


Are these comments unnecessary?

guangyey · 2025-08-18T16:12:40Z

test/distributed/test_dynamo_distributed.py

+    @unittest.skipIf(
+        TEST_XPU, "torch._inductor.cudagraph_trees is not supported on XPU"
+    )


@requires_cuda_and_triton is enough?

guangyey · 2025-08-18T16:14:46Z

test/distributed/test_inductor_collectives.py

-@requires_cuda
+@requires_accelerator_dist_backend(["nccl", "xccl"])
+@unittest.skipIf(
+    not torch.cuda.is_available() and not torch.xpu.is_available(),


Suggested change

not torch.cuda.is_available() and not torch.xpu.is_available(),

not torch.accelerator.is_available(),

guangyey · 2025-08-18T16:16:44Z

test/inductor/test_snode_runtime.py

        self._verify_runtime_estimation(fn, (inp,))

-    # lack of profiler on XPU
-    @expectedFailureXPU


I think it is better to separate these code changes into another PR since they are unrelated to this PR.

This case passed because I registered 'xpu' in fake_pg, so I removed the two lines in order to avoid unexpected success issue.

OK. Make sense.

guangyey

LGTM.

guangyey · 2025-08-22T01:58:11Z

@pytorchbot rebase

guangyey · 2025-09-12T09:31:36Z

test/distributed/test_device_mesh.py


    @with_comms
    def test_set_mesh_dim_group_options(self):
-        device_type = "cuda" if torch.cuda.is_available() else "cpu"


Here, the logic has been changed. Not aligned with the original one.

Can I change it into

device_type = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"

daisyden · 2025-09-13T08:39:19Z

I found that some logic has already been changed. Let’s sync offline to resolve the remaining gaps, and then re-land the change once everything is aligned.

Thanks, I have reviewed the code once more and will submit a commit to align with latest code.

guangyey

LGTM.

guangyey · 2025-09-16T07:17:15Z

The failure is irrelevant. Try to land this PR again
@pytorchbot merge -i

pytorchmergebot · 2025-09-16T07:19:22Z

Merge started

Your change will be merged while ignoring the following 5 checks: xpu / linux-jammy-xpu-n-py3.10 / test (default, 1, 8, linux.idc.xpu), xpu / linux-jammy-xpu-n-py3.10 / test (default, 7, 8, linux.idc.xpu), periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 3, 5, lf.linux.4xlarge.nvidia.gpu), periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 3, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build), s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

guangyey · 2025-09-16T07:50:15Z

@daisyden please rebase to the latest viable/strict branch.

only add xpu backend when xpu is available in register_backend when device arg is None. remove expectedFailureXPU for passed cases in test/inductor/test_snode_runtime.py when fake_pg registered on XPU

pytorchmergebot · 2025-09-16T08:10:39Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

guangyey · 2025-09-17T05:41:36Z

@pytorchbot merge

pytorchmergebot · 2025-09-17T05:43:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 30, 2025

pytorchbot added the open source label Jul 30, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 30, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 31, 2025

etaf added the ciflow/xpu Run XPU CI tasks label Jul 31, 2025

pytorch-bot bot added the ciflow/inductor label Aug 1, 2025

pytorch-bot bot added the module: xpu Intel XPU related issues label Aug 1, 2025

guangyey added this to PyTorch Intel Aug 1, 2025

daisyden added the keep-going Don't stop on first failure, keep running tests until the end label Aug 7, 2025

pytorch-bot bot added the module: inductor label Aug 14, 2025

daisyden requested a review from guangyey August 15, 2025 07:32

guangyey reviewed Aug 15, 2025

View reviewed changes

guangyey reviewed Aug 18, 2025

View reviewed changes

guangyey approved these changes Aug 20, 2025

View reviewed changes

guangyey requested a review from d4l3k August 22, 2025 01:58

guangyey reviewed Sep 12, 2025

View reviewed changes

daisyden closed this Sep 13, 2025

daisyden reopened this Sep 13, 2025

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 13, 2025

daisyden added keep-going Don't stop on first failure, keep running tests until the end and removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Sep 13, 2025

daisyden force-pushed the daisyden/distributed_s2 branch from fc09a54 to bc2f5f9 Compare September 15, 2025 02:23

daisyden changed the title ~~[2/N]Port several test files under test/distributed to Intel GPU~~ [Reland][2/N]Port several test files under test/distributed to Intel GPU Sep 16, 2025

zxd1997066 mentioned this pull request Sep 16, 2025

[distributed] RuntimeError: No backend type associated with device type xpu intel/torch-xpu-ops#2042

Closed

guangyey approved these changes Sep 16, 2025

View reviewed changes

guangyey added ciflow/mps Run MPS tests (subset of trunk) ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Sep 16, 2025

zxd1997066 mentioned this pull request Sep 16, 2025

[distributed]test_c10d_common.py::PythonProcessGroupExtensionTest::test_backend_config AssertionError: String comparison failed intel/torch-xpu-ops#2043

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 16, 2025

pytorchmergebot added the merging label Sep 16, 2025

guangyey removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/h100-distributed labels Sep 16, 2025

daisyden added 2 commits September 16, 2025 08:06

port sevearl test files under test/distributed to Intel GPU

defd2ff

only add xpu backend when xpu is available in register_backend when device arg is None. remove expectedFailureXPU for passed cases in test/inductor/test_snode_runtime.py when fake_pg registered on XPU

fix the device_type definition to align with original code

d25250e

daisyden force-pushed the daisyden/distributed_s2 branch from bc2f5f9 to d25250e Compare September 16, 2025 08:06

pytorchmergebot removed the merging label Sep 16, 2025

guangyey mentioned this pull request Sep 17, 2025

[xpu][test] port some distributed tensor test files for Intel GPU #161703

Closed

	not torch.cuda.is_available() and not torch.xpu.is_available(),
	not torch.accelerator.is_available(),

Conversation

daisyden commented Jul 30, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159473

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

daisyden commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 1, 2025

Uh oh!

daisyden commented Aug 1, 2025

Uh oh!

daisyden commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daisyden commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Sep 16, 2025

Uh oh!

pytorchmergebot commented Sep 16, 2025

Merge started

Uh oh!

guangyey commented Sep 16, 2025

Uh oh!

pytorchmergebot commented Sep 16, 2025

Merge failed

Uh oh!

guangyey commented Sep 17, 2025

Uh oh!

pytorchmergebot commented Sep 17, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

daisyden commented Jul 30, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

daisyden commented Aug 1, 2025 •

edited

Loading

daisyden commented Aug 7, 2025 •

edited

Loading

daisyden commented Sep 13, 2025 •

edited

Loading