Skip to content

[Dist][CI] fix distributed timeout#175030

Closed
weifengpy wants to merge 3 commits intogh/weifengpy/78/basefrom
gh/weifengpy/78/head
Closed

[Dist][CI] fix distributed timeout#175030
weifengpy wants to merge 3 commits intogh/weifengpy/78/basefrom
gh/weifengpy/78/head

Conversation

@weifengpy
Copy link
Copy Markdown
Contributor

@weifengpy weifengpy commented Feb 14, 2026

Stack from ghstack (oldest at bottom):

It's timing out because it's moved out of slow test #171051

some device disabled test_index already, just not cuda device: #173181

from claude

Root Cause

The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking

10 minutes for a single test, with the full suite never completing).

Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
(distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
and once in DistTensorOpsTestWithLocalTensor.

Breakdown of combinations per call:

  • 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  • 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

Fix

Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:

  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
full suite from never-completing to ~11 minutes.

from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175030

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 9eaa9ff with merge base c031272 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

weifengpy added a commit that referenced this pull request Feb 14, 2026
from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

ghstack-source-id: 2f0bbd9
Pull Request resolved: #175030
from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Feb 14, 2026
from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

ghstack-source-id: 0ce0bfb
Pull Request resolved: #175030
It's timing out because it's moved out of slow test #171051

some device disabled test_index already, just not cuda device: #173181


from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Feb 14, 2026
from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.

ghstack-source-id: 3b02003
Pull Request resolved: #175030
@weifengpy weifengpy requested a review from wconstab February 14, 2026 15:49
torch.randint(5, (12, 8, 12)),
torch.randint(2, (12, 8, 12)),
)
# Commented out to fix distributed CI timeout: each 3-tensor call
Copy link
Copy Markdown
Contributor

@wconstab wconstab Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seemethere @atalman wondering if it is sufficient to instead of commenting this out, wrap it with TEST_WITH_SLOW? Do you know what the timeout threshold is for the TEST_WITH_SLOW category?

Copy link
Copy Markdown
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp to unblock. i will follow up with possibly converting these to TEST_WITH_SLOW instead

@wconstab
Copy link
Copy Markdown
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 17, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable)

Details for Dev Infra team Raised by workflow job

@wconstab
Copy link
Copy Markdown
Contributor

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 4 checks: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, linux.arm64.m7g.4xlarge), trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions Bot deleted the gh/weifengpy/78/head branch March 20, 2026 02:22
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
It's timing out because it's moved out of slow test pytorch#171051

some device disabled test_index already, just not cuda device: pytorch#173181

from claude

  Root Cause

  The test_index method in test/distributed/tensor/test_tensor_ops.py:623 was causing the test suite to hang (taking
  >10 minutes for a single test, with the full suite never completing).

  Why: test_index made 15 calls to _test_op, which uses DTensorConverter to generate all possible sharding placement
  combinations via itertools.product. The 8 three-tensor calls (lines 672-729) each generated 40-80 combinations, for
  a total of ~504 combinations out of 564. Each combination requires multiple NCCL collective operations
  (distribute_tensor + full_tensor), making the test extremely slow. The test runs twice — once in DistTensorOpsTest
  and once in DistTensorOpsTestWithLocalTensor.

  Breakdown of combinations per call:
  - 2-tensor calls: 8-16 combinations each (76 total) — reasonable
  - 3-tensor calls: 40-80 combinations each (504 total) — combinatorial explosion from 4×4×4=64 or 5×4×4=80 products

  Fix

  Reduced the 3-tensor _test_op calls from 8 to 2 representative ones:
  1. x[z, y] — basic multi-index (64 combinations)
  2. x[:, z, :, y] with broadcast — covers 4D tensor + broadcast pattern (60 combinations)

  This reduces total combinations from 564 to ~200, bringing test_index from >10 minutes down to ~2 minutes, and the
  full suite from never-completing to ~11 minutes.
Pull Request resolved: pytorch#175030
Approved by: https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants