[fix] DISABLED test_index (__main__.DistTensorOpsTest)#172373
[fix] DISABLED test_index (__main__.DistTensorOpsTest)#172373umarinkovic wants to merge 7 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/172373
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
soulitzer
left a comment
There was a problem hiding this comment.
Not sure we want to special case in the tools/testing file?
Hello, sorry for the delayed response. Yes that makes perfect sense, my idea had been to perhaps get more probing done w/ the CI to see if the cause of the failure really was moving the test outside the slow category. I have since looked into the testing infra more and the test itself, since there are no logs available from the failure bar the message:
It seems to me to be a resource contention issue. Adding this decorator should fix that. I've looked into splitting the test as well but I don't think it makes a lot of sense atm, since the indexing ops are ran serially. Perhaps a combination would be in order where some indexing ops are left to run in parallel with other tests while only the ops that cause the resource exhaustion are ran serially in a different test. Although until we get more logs I don't think it's possible to determine which ops should be put in the parallel/serial categories. |
|
Thanks, looks like you need to sign cla still |
Hi, I've signed the CLA. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
Hi, I see that some jobs failed. From the logs: 2026-01-30T21:26:09.7613267Z ##[warning]Attempt 2 failed. Reason: Child_process exited with error code 1
2026-01-30T21:26:09.7828363Z + pushd ./.ci/docker
2026-01-30T21:26:09.7829291Z ~/actions-runner/_work/pytorch/pytorch/.ci/docker ~/actions-runner/_work/pytorch/pytorch
2026-01-30T21:26:09.7834468Z ++ echo ci-image:pytorch-linux-jammy-py3.14t-clang15-ae24b0be5176ba0a7a4a3a2a6ecb08b195685363
2026-01-30T21:26:09.7835099Z ++ awk -F '[:,]' '{print $1}'
2026-01-30T21:26:09.7854914Z + IMAGE_NAME=ci-image
2026-01-30T21:26:09.7856407Z + ./build.sh ci-image -t 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-py3.14t-clang15-ae24b0be5176ba0a7a4a3a2a6ecb08b195685363
2026-01-30T21:26:09.7868385Z + image=ci-image
2026-01-30T21:26:09.7868863Z + shift
2026-01-30T21:26:09.7869238Z + '[' -z ci-image ']'
2026-01-30T21:26:09.7869652Z + [[ ci-image == *xla* ]]
2026-01-30T21:26:09.7870404Z + [[ ci-image == *-jammy* ]]
2026-01-30T21:26:09.7870712Z + [[ ci-image == *-noble* ]]
2026-01-30T21:26:09.7871166Z + [[ ci-image == *ubuntu* ]]
2026-01-30T21:26:09.7871686Z + '[' -n '' ']'
2026-01-30T21:26:09.7872063Z + echo 'Unable to derive operating system base...'
2026-01-30T21:26:09.7872413Z + exit 1
2026-01-30T21:26:09.7872651Z Unable to derive operating system base...
2026-01-30T21:27:40.8558456Z ##[error]Final attempt failed. Child_process exited with error code 1I pushed a linter-fix during the CI run, could that be the reason? |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
feccb91 to
cf87455
Compare
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 6 checks: Lint / lintrunner-noclang-all / linux-job, trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable), trunk / macos-py3-arm64 / test (openreg, 1, 1, macos-m1-stable), trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
This change did not fix the issue as claimed. Reverting to remove unnecessary change. |
|
@pytorchbot revert -c nosignal -m "PR claims to fix ROCm DISABLED issue but it did not" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…)" This reverts commit 7072636. Reverted #172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](#172373 (comment)))
|
@umarinkovic your PR has been successfully reverted. |
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
…)" This reverts commit 7072636. Reverted #172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](#172373 (comment)))
…)" (#175094) This reverts commit 7072636. Reverted #172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](#172373 (comment))) Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
…ch#172373)" This reverts commit 7072636. Reverted pytorch#172373 on behalf of https://github.com/jeffdaily due to PR claims to fix ROCm DISABLED issue but it did not ([comment](pytorch#172373 (comment)))
Hi, I've changed the decorator to slowTest instead to see if we can get it working this way - since the current indication is that the test started failing upon being taken out of the slow category of tests. Do we have any logs from the failure, I wasn't able to find any and the only thing I'm going by is the comment in the original issue saying that the test timed out after being taken out the slow category. Also, I don't have the means of testing this locally, so we should make sure CI passes consistently before merging. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Fixes #171119