[ROCm] Skip test_index on MI300X due to timeout#173181
Closed
c0de128 wants to merge 1 commit intopytorch:mainfrom
Closed
[ROCm] Skip test_index on MI300X due to timeout#173181c0de128 wants to merge 1 commit intopytorch:mainfrom
c0de128 wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/173181
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b0367b7 with merge base c7e67ec ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
Author
|
@pytorchbot label 'module: rocm' 'topic: not user facing' |
|
|
Skip DistTensorOpsTest.test_index on MI300 architecture (gfx942) in addition to MI200. The test times out after 300 seconds on MI300X, similar to the existing MI200 skip. Fixes pytorch#171119 Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2e6dc23 to
b0367b7
Compare
Contributor
Author
|
@jeffdaily @sunway513 Could you approve CI for this ROCm fix? It addresses issue #171119 - extends test_index skip to MI300X (test times out after 300s). Thanks! |
weifengpy
added a commit
that referenced
this pull request
Feb 14, 2026
It's timing out because it's moved out of slow test #171051 some device disabled test_index already, just not cuda device: #173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]
weifengpy
added a commit
that referenced
this pull request
Feb 14, 2026
It's timing out because it's moved out of slow test #171051 some device disabled test_index already, just not cuda device: #173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]
pytorchmergebot
pushed a commit
that referenced
this pull request
Feb 17, 2026
It's timing out because it's moved out of slow test #171051 some device disabled test_index already, just not cuda device: #173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. Pull Request resolved: #175030 Approved by: https://github.com/wconstab
Contributor
Author
|
Closing — no maintainer engagement after 4+ weeks. |
EmanueleCoradin
pushed a commit
to EmanueleCoradin/pytorch
that referenced
this pull request
Mar 30, 2026
It's timing out because it's moved out of slow test pytorch#171051 some device disabled test_index already, just not cuda device: pytorch#173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. Pull Request resolved: pytorch#175030 Approved by: https://github.com/wconstab
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Skip
DistTensorOpsTest.test_indexon MI300 architecture (gfx942) in addition to MI200. The test times out after 300 seconds on MI300X.Fixes #171119
Background
The test was disabled after #171051 updated the slow tests list. The test already had
@skipIfRocmArch(MI200_ARCH)decorator; this PR extends it to also skip on MI300X (MI300_ARCH) until the underlying performance issue is resolved.Changes
MI300_ARCHimport to test file@skipIfRocmArch(MI200_ARCH)to@skipIfRocmArch(MI200_ARCH + MI300_ARCH)cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang
PR authored with assistance from Claude.