Skip to content

[ROCm] Skip test_index on MI300X due to timeout#173181

Closed
c0de128 wants to merge 1 commit intopytorch:mainfrom
c0de128:fix/rocm-skip-test-index-mi300
Closed

[ROCm] Skip test_index on MI300X due to timeout#173181
c0de128 wants to merge 1 commit intopytorch:mainfrom
c0de128:fix/rocm-skip-test-index-mi300

Conversation

@c0de128
Copy link
Copy Markdown
Contributor

@c0de128 c0de128 commented Jan 23, 2026

Summary

Skip DistTensorOpsTest.test_index on MI300 architecture (gfx942) in addition to MI200. The test times out after 300 seconds on MI300X.

Fixes #171119

Background

The test was disabled after #171051 updated the slow tests list. The test already had @skipIfRocmArch(MI200_ARCH) decorator; this PR extends it to also skip on MI300X (MI300_ARCH) until the underlying performance issue is resolved.

Changes

  • Added MI300_ARCH import to test file
  • Extended @skipIfRocmArch(MI200_ARCH) to @skipIfRocmArch(MI200_ARCH + MI300_ARCH)

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang


PR authored with assistance from Claude.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 23, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/173181

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b0367b7 with merge base c7e67ec (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Jan 23, 2026

@pytorchbot label 'module: rocm' 'topic: not user facing'

@pytorch-bot pytorch-bot Bot added module: rocm AMD GPU support for Pytorch topic: not user facing topic category labels Jan 23, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Jan 23, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: c0de128 / name: Kevin McKay (b0367b7)

Skip DistTensorOpsTest.test_index on MI300 architecture (gfx942) in addition
to MI200. The test times out after 300 seconds on MI300X, similar to the
existing MI200 skip.

Fixes pytorch#171119

Signed-off-by: c0de128 <kevin.mckay@outlook.com>
@c0de128 c0de128 force-pushed the fix/rocm-skip-test-index-mi300 branch from 2e6dc23 to b0367b7 Compare January 23, 2026 15:59
@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Jan 24, 2026

@jeffdaily @sunway513 Could you approve CI for this ROCm fix? It addresses issue #171119 - extends test_index skip to MI300X (test times out after 300s). Thanks!

@bdhirsh bdhirsh requested a review from jeffdaily January 26, 2026 17:06
@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 26, 2026
weifengpy added a commit that referenced this pull request Feb 14, 2026
It's timing out because it's moved out of slow test #171051

some device disabled test_index already, just not cuda device: #173181


from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Feb 14, 2026
It's timing out because it's moved out of slow test #171051

some device disabled test_index already, just not cuda device: #173181


from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Feb 17, 2026
It's timing out because it's moved out of slow test #171051

some device disabled test_index already, just not cuda device: #173181

from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.
Pull Request resolved: #175030
Approved by: https://github.com/wconstab
@c0de128
Copy link
Copy Markdown
Contributor Author

c0de128 commented Feb 24, 2026

Closing — no maintainer engagement after 4+ weeks.

@c0de128 c0de128 closed this Feb 24, 2026
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
It's timing out because it's moved out of slow test pytorch#171051

some device disabled test_index already, just not cuda device: pytorch#173181

from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.
Pull Request resolved: pytorch#175030
Approved by: https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: rocm AMD GPU support for Pytorch open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISABLED test_index (__main__.DistTensorOpsTest)

3 participants