Skip to content

[Reland][2/N]Port several test files under test/distributed to Intel GPU#159473

Closed
daisyden wants to merge 2 commits intopytorch:mainfrom
daisyden:daisyden/distributed_s2
Closed

[Reland][2/N]Port several test files under test/distributed to Intel GPU#159473
daisyden wants to merge 2 commits intopytorch:mainfrom
daisyden:daisyden/distributed_s2

Conversation

@daisyden
Copy link
Collaborator

@daisyden daisyden commented Jul 30, 2025

For #114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:

  • instantiate_device_type_tests()
  • use "torch.accelerator.current_accelerator()" to determine the accelerator backend
  • use requires_accelerator_dist_backend to allow both nccl and xccl test
  • enabled XPU for some test path
  • Change the hardcoded world_size according to device_count.
  • Unify some common code under torch/testing/_internal for multiple backend, for example:
    Added xpu for Backend.backend_capability and dist.Backend.register_backend()

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @gujinghui @fengyuan14 @guangyey

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159473

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d25250e with merge base e900a27 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 30, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Jul 30, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Jul 31, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Jul 31, 2025
@daisyden
Copy link
Collaborator Author

daisyden commented Aug 1, 2025

@pytorchbot label "ciflow/xpu"

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 1, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'ciflow/xpu' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@daisyden
Copy link
Collaborator Author

daisyden commented Aug 1, 2025

@pytorchbot label "module: xpu"
@pytorchbot label "triaged"

@pytorch-bot pytorch-bot bot added the module: xpu Intel XPU related issues label Aug 1, 2025
@daisyden daisyden added the keep-going Don't stop on first failure, keep running tests until the end label Aug 7, 2025
@daisyden
Copy link
Collaborator Author

daisyden commented Aug 7, 2025

@pytorchbot rebase

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 7, 2025

Didn't find following labels among repository labels: rebase

Comment on lines +341 to +350
if TEST_CUDA:
if not c10d.is_nccl_available():
return skip_but_pass_in_sandcastle(
"c10d was not compiled with the NCCL backend",
)
else:
return skip_but_pass_in_sandcastle_if(
torch.cuda.nccl.version() < version,
f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if TEST_CUDA:
if not c10d.is_nccl_available():
return skip_but_pass_in_sandcastle(
"c10d was not compiled with the NCCL backend",
)
else:
return skip_but_pass_in_sandcastle_if(
torch.cuda.nccl.version() < version,
f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",
)
if not TEST_CUDA:
return lambda f: f
if not c10d.is_nccl_available():
return skip_but_pass_in_sandcastle(
"c10d was not compiled with the NCCL backend",
)
return skip_but_pass_in_sandcastle_if(
torch.cuda.nccl.version() < version,
f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}",
)

]

if TEST_XPU:
backend_config_strings_and_expected_values.remove(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if here don't remove these values

Comment on lines +278 to +280
mem_usage[i] = torch.get_device_module(
self.device.type
).max_memory_allocated()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mem_usage[i] = torch.get_device_module(
self.device.type
).max_memory_allocated()
mem_usage[i] = torch.accelerator.max_memory_allocated()

refer to

def max_memory_allocated(device_index: _device_t = None, /) -> int:

os.environ["LOCAL_RANK"] = f"{local_rank}"


@skipIfXpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember skipIfXPU has a side-effect that will ignore all UTs in this class unexpectedly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to #151315

# Set the device on each process before DeviceMesh constructor,
# and device to be different than the default world rank
torch.cuda.set_device((self.rank + 2) % self.world_size)
torch.accelerator.set_device_idx((self.rank + 2) % self.world_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_device_idx has been deprecated. Use set_device_index instead.

# optimizer, you should be able to repro it single process!
@requires_nccl()
# # optimizer, you should be able to repro it single process!
# @skip_but_pass_in_sandcastle_if(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these comments unnecessary?

Comment on lines +699 to +701
@unittest.skipIf(
TEST_XPU, "torch._inductor.cudagraph_trees is not supported on XPU"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@requires_cuda_and_triton is enough?

@requires_cuda
@requires_accelerator_dist_backend(["nccl", "xccl"])
@unittest.skipIf(
not torch.cuda.is_available() and not torch.xpu.is_available(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
not torch.cuda.is_available() and not torch.xpu.is_available(),
not torch.accelerator.is_available(),

self._verify_runtime_estimation(fn, (inp,))

# lack of profiler on XPU
@expectedFailureXPU
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to separate these code changes into another PR since they are unrelated to this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case passed because I registered 'xpu' in fake_pg, so I removed the two lines in order to avoid unexpected success issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Make sense.

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@guangyey guangyey requested a review from d4l3k August 22, 2025 01:58

@with_comms
def test_set_mesh_dim_group_options(self):
device_type = "cuda" if torch.cuda.is_available() else "cpu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the logic has been changed. Not aligned with the original one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I change it into

device_type = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"

@daisyden
Copy link
Collaborator Author

daisyden commented Sep 13, 2025

I found that some logic has already been changed. Let’s sync offline to resolve the remaining gaps, and then re-land the change once everything is aligned.

Thanks, I have reviewed the code once more and will submit a commit to align with latest code.

@daisyden daisyden closed this Sep 13, 2025
@daisyden daisyden reopened this Sep 13, 2025
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 13, 2025
@daisyden daisyden added keep-going Don't stop on first failure, keep running tests until the end and removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Sep 13, 2025
@daisyden daisyden force-pushed the daisyden/distributed_s2 branch from fc09a54 to bc2f5f9 Compare September 15, 2025 02:23
@daisyden daisyden changed the title [2/N]Port several test files under test/distributed to Intel GPU [Reland][2/N]Port several test files under test/distributed to Intel GPU Sep 16, 2025
Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@guangyey
Copy link
Collaborator

The failure is irrelevant. Try to land this PR again
@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 16, 2025
@pytorchmergebot
Copy link
Collaborator

@guangyey guangyey removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/h100-distributed labels Sep 16, 2025
@guangyey
Copy link
Collaborator

@daisyden please rebase to the latest viable/strict branch.

only add xpu backend when xpu is available in register_backend when device arg is None.

remove expectedFailureXPU for passed cases in test/inductor/test_snode_runtime.py when fake_pg registered on XPU
@daisyden daisyden force-pushed the daisyden/distributed_s2 branch from bc2f5f9 to d25250e Compare September 16, 2025 08:06
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@guangyey
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end Merged module: inductor module: xpu Intel XPU related issues oncall: distributed Add this issue/PR to distributed oncall triage queue open source Reverted topic: not user facing topic category

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

9 participants