[Dist][CI] fix distributed timeout#175030
[Dist][CI] fix distributed timeout#175030weifengpy wants to merge 3 commits intogh/weifengpy/78/basefrom
Conversation
from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175030
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 3 Unrelated FailuresAs of commit 9eaa9ff with merge base c031272 ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. ghstack-source-id: 2f0bbd9 Pull Request resolved: #175030
from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]
from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. ghstack-source-id: 0ce0bfb Pull Request resolved: #175030
It's timing out because it's moved out of slow test #171051 some device disabled test_index already, just not cuda device: #173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. [ghstack-poisoned]
from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. ghstack-source-id: 3b02003 Pull Request resolved: #175030
| torch.randint(5, (12, 8, 12)), | ||
| torch.randint(2, (12, 8, 12)), | ||
| ) | ||
| # Commented out to fix distributed CI timeout: each 3-tensor call |
There was a problem hiding this comment.
@seemethere @atalman wondering if it is sufficient to instead of commenting this out, wrap it with TEST_WITH_SLOW? Do you know what the timeout threshold is for the TEST_WITH_SLOW category?
wconstab
left a comment
There was a problem hiding this comment.
stamp to unblock. i will follow up with possibly converting these to TEST_WITH_SLOW instead
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 4 checks: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, linux.arm64.m7g.4xlarge), trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
It's timing out because it's moved out of slow test pytorch#171051 some device disabled test_index already, just not cuda device: pytorch#173181 from claude Root Cause The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking >10 minutes for a single test, with the full suite never completing). Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest and once in DistTensorOpsTestWithLocalTensor. Breakdown of combinations per call: - 2-tensor calls: 8-16 combinations each (76 total) — reasonable - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products Fix Reduced the 3-tensor _test_op calls from 8 to 2 representative ones: 1. x[z, y] — basic multi-index (64 combinations) 2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations) This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the full suite from never-completing to ~11 minutes. Pull Request resolved: pytorch#175030 Approved by: https://github.com/wconstab
Stack from ghstack (oldest at bottom):
It's timing out because it's moved out of slow test #171051
some device disabled test_index already, just not cuda device: #173181
from claude
Root Cause
The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
(distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
and once in DistTensorOpsTestWithLocalTensor.
Breakdown of combinations per call:
Fix
Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
full suite from never-completing to ~11 minutes.