[Reland][2/N]Port several test files under test/distributed to Intel GPU#159473
[Reland][2/N]Port several test files under test/distributed to Intel GPU#159473daisyden wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159473
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit d25250e with merge base e900a27 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "ciflow/xpu" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot label "module: xpu" |
|
@pytorchbot rebase |
|
Didn't find following labels among repository labels: rebase |
| if TEST_CUDA: | ||
| if not c10d.is_nccl_available(): | ||
| return skip_but_pass_in_sandcastle( | ||
| "c10d was not compiled with the NCCL backend", | ||
| ) | ||
| else: | ||
| return skip_but_pass_in_sandcastle_if( | ||
| torch.cuda.nccl.version() < version, | ||
| f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}", | ||
| ) |
There was a problem hiding this comment.
| if TEST_CUDA: | |
| if not c10d.is_nccl_available(): | |
| return skip_but_pass_in_sandcastle( | |
| "c10d was not compiled with the NCCL backend", | |
| ) | |
| else: | |
| return skip_but_pass_in_sandcastle_if( | |
| torch.cuda.nccl.version() < version, | |
| f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}", | |
| ) | |
| if not TEST_CUDA: | |
| return lambda f: f | |
| if not c10d.is_nccl_available(): | |
| return skip_but_pass_in_sandcastle( | |
| "c10d was not compiled with the NCCL backend", | |
| ) | |
| return skip_but_pass_in_sandcastle_if( | |
| torch.cuda.nccl.version() < version, | |
| f"Requires NCCL version greater than or equal to: {version}, found: {torch.cuda.nccl.version()}, reason: {msg}", | |
| ) |
test/distributed/test_c10d_common.py
Outdated
| ] | ||
|
|
||
| if TEST_XPU: | ||
| backend_config_strings_and_expected_values.remove( |
There was a problem hiding this comment.
What will happen if here don't remove these values
| mem_usage[i] = torch.get_device_module( | ||
| self.device.type | ||
| ).max_memory_allocated() |
There was a problem hiding this comment.
| mem_usage[i] = torch.get_device_module( | |
| self.device.type | |
| ).max_memory_allocated() | |
| mem_usage[i] = torch.accelerator.max_memory_allocated() |
refer to
pytorch/torch/accelerator/memory.py
Line 123 in 3c8c509
test/distributed/test_device_mesh.py
Outdated
| os.environ["LOCAL_RANK"] = f"{local_rank}" | ||
|
|
||
|
|
||
| @skipIfXpu |
There was a problem hiding this comment.
I remember skipIfXPU has a side-effect that will ignore all UTs in this class unexpectedly.
test/distributed/test_device_mesh.py
Outdated
| # Set the device on each process before DeviceMesh constructor, | ||
| # and device to be different than the default world rank | ||
| torch.cuda.set_device((self.rank + 2) % self.world_size) | ||
| torch.accelerator.set_device_idx((self.rank + 2) % self.world_size) |
There was a problem hiding this comment.
set_device_idx has been deprecated. Use set_device_index instead.
| # optimizer, you should be able to repro it single process! | ||
| @requires_nccl() | ||
| # # optimizer, you should be able to repro it single process! | ||
| # @skip_but_pass_in_sandcastle_if( |
There was a problem hiding this comment.
Are these comments unnecessary?
| @unittest.skipIf( | ||
| TEST_XPU, "torch._inductor.cudagraph_trees is not supported on XPU" | ||
| ) |
There was a problem hiding this comment.
@requires_cuda_and_triton is enough?
| @requires_cuda | ||
| @requires_accelerator_dist_backend(["nccl", "xccl"]) | ||
| @unittest.skipIf( | ||
| not torch.cuda.is_available() and not torch.xpu.is_available(), |
There was a problem hiding this comment.
| not torch.cuda.is_available() and not torch.xpu.is_available(), | |
| not torch.accelerator.is_available(), |
| self._verify_runtime_estimation(fn, (inp,)) | ||
|
|
||
| # lack of profiler on XPU | ||
| @expectedFailureXPU |
There was a problem hiding this comment.
I think it is better to separate these code changes into another PR since they are unrelated to this PR.
There was a problem hiding this comment.
This case passed because I registered 'xpu' in fake_pg, so I removed the two lines in order to avoid unexpected success issue.
|
@pytorchbot rebase |
|
|
||
| @with_comms | ||
| def test_set_mesh_dim_group_options(self): | ||
| device_type = "cuda" if torch.cuda.is_available() else "cpu" |
There was a problem hiding this comment.
Here, the logic has been changed. Not aligned with the original one.
There was a problem hiding this comment.
Can I change it into
device_type = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
Thanks, I have reviewed the code once more and will submit a commit to align with latest code. |
fc09a54 to
bc2f5f9
Compare
|
The failure is irrelevant. Try to land this PR again |
Merge startedYour change will be merged while ignoring the following 5 checks: xpu / linux-jammy-xpu-n-py3.10 / test (default, 1, 8, linux.idc.xpu), xpu / linux-jammy-xpu-n-py3.10 / test (default, 7, 8, linux.idc.xpu), periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 3, 5, lf.linux.4xlarge.nvidia.gpu), periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 3, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build), s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@daisyden please rebase to the latest viable/strict branch. |
only add xpu backend when xpu is available in register_backend when device arg is None. remove expectedFailureXPU for passed cases in test/inductor/test_snode_runtime.py when fake_pg registered on XPU
bc2f5f9 to
d25250e
Compare
Merge failedReason: New commits were pushed while merging. Please rerun the merge command. Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
For #114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @gujinghui @fengyuan14 @guangyey