Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177101
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 3 Pending, 1 Unrelated FailureAs of commit 1fd9c49 with merge base 5843119 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ryanzhang22
left a comment
There was a problem hiding this comment.
one CI test failure but from first glance it looks unrelated
|
I dug into the new failure (https://github.com/pytorch/pytorch/actions/runs/22930355625/job/66606473310?pr=177101), and I think it's unrelated to the Kineto changes as it's a test about CUDA graph capture: See the test: |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.4) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge), trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.4) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "I could be wrong, but look like this introduced rocm distributed failures, see https://hud.pytorch.org/hud/pytorch/pytorch/d9d7c0bef5db069c6c47a49c1472f2d3fe034aec/2?per_page=50&name_filter=trunk%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(dist&mergeEphemeralLF=true" -c ignoredsignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 17af810. Reverted #177101 on behalf of https://github.com/malfet due to I could be wrong, but look like this introduced rocm distributed failures, see https://hud.pytorch.org/hud/pytorch/pytorch/d9d7c0bef5db069c6c47a49c1472f2d3fe034aec/2?per_page=50&name_filter=trunk%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(dist&mergeEphemeralLF=true ([comment](#177101 (comment)))
|
@scotts your PR has been successfully reverted. |
@malfet @scotts While it does seem like the kineto bump (and hence rocprofiler-sdk) made ROCm distributed tests run slower and hence frequently timeout, there are also other issues causing distributed timeouts on ROCm. One of them is being addressed by #176251. While we try to get this situation under control, I have marked the trunk distributed jobs as unstable: #177301 |
|
Let's reland this then? |
|
@pytorchbot merge cc @malfet in case you want to cancel the merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
## Motivation Fixes #3962 - The `rocprofiler-sdk` shared library is not being preloaded, causing `librocprofiler-sdk.so.1` to be missing at runtime. This is because the PyTorch `kineto` submodule was bumped which switched from `roctracer` to `rocprofiler-sdk`: pytorch/pytorch#177101 - `test_mempool_expandable` was enabled on ROCm by pytorch/pytorch#173330. This test was failing as it requires the rocm[devel] packages but was causing a crash: https://github.com/ROCm/TheRock/actions/runs/23164829934/job/67321547840. This test is currently already skipped for other torch versions. - Also skip `test_mempool_empty_cache_inactive`, `test_mempool_limited_memory_with_allocator`, `test_deleted_mempool_not_used_on_oom`, and `test_mempool_ctx_multithread` as these also require building `dummy_allocator` and are skipped in other torch versions. ## Technical Details - Adds `rocprofiler-sdk` to `LINUX_LIBRARY_PRELOADS` in `build_prod_wheels.py` so that `librocprofiler-sdk.so` is loaded - Registers `rocprofiler-sdk` as a `LibraryEntry` in `_dist_info.py` so the `rocm_sdk` package can resolve the name to the actual `.so` file. ## Test Plan - Verify that ROCm builds, the nightly smoke tests pass and that running the torch tests do not crash ## Test Result - ROCm builds successfully: https://github.com/ROCm/TheRock/actions/runs/23152017500 - Smoke tests pass for torch nightly and the runner is not crashing: https://github.com/ROCm/TheRock/actions/runs/23253453219 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Fixes #3962 - The `rocprofiler-sdk` shared library is not being preloaded, causing `librocprofiler-sdk.so.1` to be missing at runtime. This is because the PyTorch `kineto` submodule was bumped which switched from `roctracer` to `rocprofiler-sdk`: pytorch/pytorch#177101 - `test_mempool_expandable` was enabled on ROCm by pytorch/pytorch#173330. This test was failing as it requires the rocm[devel] packages but was causing a crash: https://github.com/ROCm/TheRock/actions/runs/23164829934/job/67321547840. This test is currently already skipped for other torch versions. - Also skip `test_mempool_empty_cache_inactive`, `test_mempool_limited_memory_with_allocator`, `test_deleted_mempool_not_used_on_oom`, and `test_mempool_ctx_multithread` as these also require building `dummy_allocator` and are skipped in other torch versions. ## Technical Details - Adds `rocprofiler-sdk` to `LINUX_LIBRARY_PRELOADS` in `build_prod_wheels.py` so that `librocprofiler-sdk.so` is loaded - Registers `rocprofiler-sdk` as a `LibraryEntry` in `_dist_info.py` so the `rocm_sdk` package can resolve the name to the actual `.so` file. ## Test Plan - Verify that ROCm builds, the nightly smoke tests pass and that running the torch tests do not crash ## Test Result - ROCm builds successfully: https://github.com/ROCm/TheRock/actions/runs/23152017500 - Smoke tests pass for torch nightly and the runner is not crashing: https://github.com/ROCm/TheRock/actions/runs/23253453219 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Updates Kineto to hash 00355051f09eef00ba32c366326e73e8057421da from March 10, 2026. See: https://github.com/pytorch/kineto/tree/00355051f09eef00ba32c366326e73e8057421da Pull Request resolved: pytorch#177101 Approved by: https://github.com/ryanzhang22, https://github.com/jathu
This reverts commit 17af810. Reverted pytorch#177101 on behalf of https://github.com/malfet due to I could be wrong, but look like this introduced rocm distributed failures, see https://hud.pytorch.org/hud/pytorch/pytorch/d9d7c0bef5db069c6c47a49c1472f2d3fe034aec/2?per_page=50&name_filter=trunk%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(dist&mergeEphemeralLF=true ([comment](pytorch#177101 (comment)))
Updates Kineto to hash 00355051f09eef00ba32c366326e73e8057421da from March 10, 2026. See: https://github.com/pytorch/kineto/tree/00355051f09eef00ba32c366326e73e8057421da Pull Request resolved: pytorch#177101 Approved by: https://github.com/ryanzhang22, https://github.com/jathu
Updates Kineto to hash 00355051f09eef00ba32c366326e73e8057421da from March 10, 2026. See: https://github.com/pytorch/kineto/tree/00355051f09eef00ba32c366326e73e8057421da