Skip to content

[fix] DISABLED test_index (__main__.DistTensorOpsTest)#172373

Open
umarinkovic wants to merge 7 commits intopytorch:mainfrom
umarinkovic:fix/dist_tensorops_test
Open

[fix] DISABLED test_index (__main__.DistTensorOpsTest)#172373
umarinkovic wants to merge 7 commits intopytorch:mainfrom
umarinkovic:fix/dist_tensorops_test

Conversation

@umarinkovic
Copy link
Copy Markdown
Contributor

Fixes #171119

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jan 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172373

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added the topic: not user facing topic category label Jan 13, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Jan 13, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Copy Markdown
Contributor

@soulitzer soulitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to special case in the tools/testing file?

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 14, 2026
@umarinkovic
Copy link
Copy Markdown
Contributor Author

Not sure we want to special case in the tools/testing file?

Hello, sorry for the delayed response. Yes that makes perfect sense, my idea had been to perhaps get more probing done w/ the CI to see if the cause of the failure really was moving the test outside the slow category.

I have since looked into the testing infra more and the test itself, since there are no logs available from the failure bar the message:

Timing out after 300 seconds and killing subprocesses.

It seems to me to be a resource contention issue. Adding this decorator should fix that. I've looked into splitting the test as well but I don't think it makes a lot of sense atm, since the indexing ops are ran serially. Perhaps a combination would be in order where some indexing ops are left to run in parallel with other tests while only the ops that cause the resource exhaustion are ran serially in a different test. Although until we get more logs I don't think it's possible to determine which ops should be put in the parallel/serial categories.

@umarinkovic umarinkovic requested a review from soulitzer January 28, 2026 00:27
@soulitzer
Copy link
Copy Markdown
Contributor

Thanks, looks like you need to sign cla still

@umarinkovic
Copy link
Copy Markdown
Contributor Author

Thanks, looks like you need to sign cla still

Hi, I've signed the CLA.

soulitzer
soulitzer previously approved these changes Jan 30, 2026
@soulitzer
Copy link
Copy Markdown
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pytorch-bot pytorch-bot Bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2026
@umarinkovic
Copy link
Copy Markdown
Contributor Author

@soulitzer

Hi, I see that some jobs failed. From the logs:

2026-01-30T21:26:09.7613267Z ##[warning]Attempt 2 failed. Reason: Child_process exited with error code 1
2026-01-30T21:26:09.7828363Z + pushd ./.ci/docker
2026-01-30T21:26:09.7829291Z ~/actions-runner/_work/pytorch/pytorch/.ci/docker ~/actions-runner/_work/pytorch/pytorch
2026-01-30T21:26:09.7834468Z ++ echo ci-image:pytorch-linux-jammy-py3.14t-clang15-ae24b0be5176ba0a7a4a3a2a6ecb08b195685363
2026-01-30T21:26:09.7835099Z ++ awk -F '[:,]' '{print $1}'
2026-01-30T21:26:09.7854914Z + IMAGE_NAME=ci-image
2026-01-30T21:26:09.7856407Z + ./build.sh ci-image -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-py3.14t-clang15-ae24b0be5176ba0a7a4a3a2a6ecb08b195685363
2026-01-30T21:26:09.7868385Z + image=ci-image
2026-01-30T21:26:09.7868863Z + shift
2026-01-30T21:26:09.7869238Z + '[' -z ci-image ']'
2026-01-30T21:26:09.7869652Z + [[ ci-image == *xla* ]]
2026-01-30T21:26:09.7870404Z + [[ ci-image == *-jammy* ]]
2026-01-30T21:26:09.7870712Z + [[ ci-image == *-noble* ]]
2026-01-30T21:26:09.7871166Z + [[ ci-image == *ubuntu* ]]
2026-01-30T21:26:09.7871686Z + '[' -n '' ']'
2026-01-30T21:26:09.7872063Z + echo 'Unable to derive operating system base...'
2026-01-30T21:26:09.7872413Z + exit 1
2026-01-30T21:26:09.7872651Z Unable to derive operating system base...
2026-01-30T21:27:40.8558456Z ##[error]Final attempt failed. Child_process exited with error code 1

I pushed a linter-fix during the CI run, could that be the reason?

@soulitzer
Copy link
Copy Markdown
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased fix/dist_tensorops_test onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix/dist_tensorops_test && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the fix/dist_tensorops_test branch from feccb91 to cf87455 Compare February 2, 2026 17:06
@soulitzer
Copy link
Copy Markdown
Contributor

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@jeffdaily
Copy link
Copy Markdown
Collaborator

This change did not fix the issue as claimed. Reverting to remove unnecessary change.

@jeffdaily
Copy link
Copy Markdown
Collaborator

@pytorchbot revert -c nosignal -m "PR claims to fix ROCm DISABLED issue but it did not"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 16, 2026
…)"

This reverts commit 7072636.

Reverted #172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](#172373 (comment)))
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@umarinkovic your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Feb 16, 2026
@pytorch-bot pytorch-bot Bot dismissed soulitzer’s stale review February 16, 2026 16:59

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

jeffdaily pushed a commit that referenced this pull request Feb 16, 2026
…)"

This reverts commit 7072636.

Reverted #172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](#172373 (comment)))
atalman pushed a commit that referenced this pull request Feb 16, 2026
…)" (#175094)

This reverts commit 7072636.

Reverted #172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](#172373 (comment)))

Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
kumartanmay-28 pushed a commit to kumartanmay-28/pytorch_tanmay that referenced this pull request Feb 17, 2026
…ch#172373)"

This reverts commit 7072636.

Reverted pytorch#172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](pytorch#172373 (comment)))
@pytorch-bot pytorch-bot Bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Feb 20, 2026
@umarinkovic
Copy link
Copy Markdown
Contributor Author

umarinkovic commented Feb 27, 2026

This change did not fix the issue as claimed. Reverting to remove unnecessary change.

Hi, I've changed the decorator to slowTest instead to see if we can get it working this way - since the current indication is that the test started failing upon being taken out of the slow category of tests. Do we have any logs from the failure, I wasn't able to find any and the only thing I'm going by is the comment in the original issue saying that the test timed out after being taken out the slow category.

Also, I don't have the means of testing this locally, so we should make sure CI passes consistently before merging.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR Merged open source Reverted Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISABLED test_index (__main__.DistTensorOpsTest)

5 participants