[Dist][CI] fix distributed timeout by weifengpy · Pull Request #175030 · pytorch/pytorch

weifengpy · 2026-02-14T14:53:59Z

Stack from ghstack (oldest at bottom):

-> [Dist][CI] fix distributed timeout #175030

It's timing out because it's moved out of slow test #171051

some device disabled test_index already, just not cuda device: #173181

from claude

Root Cause

The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking

10 minutes for a single test, with the full suite never completing).

Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
(distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
and once in DistTensorOpsTestWithLocalTensor.

Breakdown of combinations per call:

2-tensor calls: 8-16 combinations each (76 total) — reasonable
3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

Fix

Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:

x[z, y] — basic multi-index (64 combinations)
x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
full suite from never-completing to ~11 minutes.

from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]

pytorch-bot · 2026-02-14T14:54:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175030

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 9eaa9ff with merge base c031272 ():

NEW FAILURE - The following job has failed:

trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh)
test/functorch/test_aotdispatch.py::TestEagerFusionOpInfoCPU::test_aot_autograd_exhaustive_quantile_cpu_float32

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, linux.arm64.m7g.4xlarge) (gh) (similar failure)
'Test'
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (similar failure)
test/functorch/test_aotdispatch.py::TestEagerFusionOpInfoCPU::test_aot_autograd_disable_functionalization_exhaustive_nanquantile_cpu_float32
trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (similar failure)
test/test_ops_unbacked.py::TestOpsUnbackedCPU::test_unbacked_op_db_nanquantile_cpu_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. ghstack-source-id: 2f0bbd9 Pull Request resolved: #175030

from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]

from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. ghstack-source-id: 0ce0bfb Pull Request resolved: #175030

It's timing out because it's moved out of slow test #171051 some device disabled test_index already, just not cuda device: #173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]

from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. ghstack-source-id: 3b02003 Pull Request resolved: #175030

wconstab · 2026-02-17T17:17:41Z

-                torch.randint(5, (12, 8, 12)),
-                torch.randint(2, (12, 8, 12)),
-            )
+            # Commented out to fix distributed CI timeout: each 3-tensor call


@seemethere @atalman wondering if it is sufficient to instead of commenting this out, wrap it with TEST_WITH_SLOW? Do you know what the timeout threshold is for the TEST_WITH_SLOW category?

wconstab

stamp to unblock. i will follow up with possibly converting these to TEST_WITH_SLOW instead

wconstab · 2026-02-17T17:50:05Z

@pytorchbot merge

pytorchmergebot · 2026-02-17T17:52:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-02-17T19:22:41Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

wconstab · 2026-02-17T22:11:42Z

@pytorchbot merge -i

pytorchmergebot · 2026-02-17T22:13:48Z

Merge started

Your change will be merged while ignoring the following 4 checks: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, linux.arm64.m7g.4xlarge), trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

It's timing out because it's moved out of slow test pytorch#171051 some device disabled test_index already, just not cuda device: pytorch#173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. Pull Request resolved: pytorch#175030 Approved by: https://github.com/wconstab

weifengpy mentioned this pull request Feb 14, 2026

[DTensor] enable single dim strategy for mm and bmm #172385

Closed

pytorch-bot Bot added the topic: not user facing topic category label Feb 14, 2026

weifengpy requested a review from wconstab February 14, 2026 15:49

wconstab reviewed Feb 17, 2026

View reviewed changes

wconstab approved these changes Feb 17, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 17, 2026

pytorchmergebot added the merging label Feb 17, 2026

pytorchmergebot removed the merging label Feb 17, 2026

pytorchmergebot added the merging label Feb 17, 2026

pytorchmergebot added the Merged label Feb 17, 2026

pytorchmergebot closed this in 42286b6 Feb 17, 2026

pytorchmergebot removed the merging label Feb 17, 2026

github-actions Bot deleted the gh/weifengpy/78/head branch March 20, 2026 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dist][CI] fix distributed timeout#175030

[Dist][CI] fix distributed timeout#175030
weifengpy wants to merge 3 commits intogh/weifengpy/78/basefrom
gh/weifengpy/78/head

weifengpy commented Feb 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

wconstab Feb 17, 2026 •

edited

Loading

Uh oh!

wconstab left a comment

Uh oh!

wconstab commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

wconstab commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

weifengpy commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175030

❌ 1 New Failure, 3 Unrelated Failures

Uh oh!

wconstab Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Merge started

Uh oh!

pytorchmergebot commented Feb 17, 2026

Merge failed

Uh oh!

wconstab commented Feb 17, 2026

Uh oh!

pytorchmergebot commented Feb 17, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weifengpy commented Feb 14, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 14, 2026 •

edited

Loading

wconstab Feb 17, 2026 •

edited

Loading