Skip to content

bump kineto submodule to 0035505#177101

Closed
scotts wants to merge 1 commit intopytorch:mainfrom
scotts:bump_kineto_03_10_2026
Closed

bump kineto submodule to 0035505#177101
scotts wants to merge 1 commit intopytorch:mainfrom
scotts:bump_kineto_03_10_2026

Conversation

@scotts
Copy link
Copy Markdown
Contributor

@scotts scotts commented Mar 11, 2026

Updates Kineto to hash 00355051f09eef00ba32c366326e73e8057421da from March 10, 2026. See: https://github.com/pytorch/kineto/tree/00355051f09eef00ba32c366326e73e8057421da

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 11, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177101

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 3 Pending, 1 Unrelated Failure

As of commit 1fd9c49 with merge base 5843119 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@scotts scotts added ciflow/trunk Trigger trunk jobs on your pull request ci-no-td Do not run TD on this PR ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Mar 11, 2026
Copy link
Copy Markdown
Contributor

@ryanzhang22 ryanzhang22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one CI test failure but from first glance it looks unrelated

Copy link
Copy Markdown
Contributor

@jathu jathu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 11, 2026

I dug into the new failure (https://github.com/pytorch/pytorch/actions/runs/22930355625/job/66606473310?pr=177101), and I think it's unrelated to the Kineto changes as it's a test about CUDA graph capture:

distributed/test_c10d_ops_nccl.py::ProcessGroupNCCLOpTest::test_allreduce_in_cudagraph [rank1]:[E311 13:57:29.074453410 ProcessGroupNCCL.cpp:2126] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: operation not permitted when stream is capturing

AMD docs: https://rocm.docs.amd.com/projects/HIP/en/docs-6.3.3/reference/error_codes.html#term-hipErrorStreamCaptureUnsupported

See the test:

def test_allreduce_in_cudagraph(self):

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 11, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.4)

Details for Dev Infra team Raised by workflow job

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Mar 11, 2026

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge), trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Copy Markdown
Contributor

malfet commented Mar 12, 2026

@pytorchbot revert -m "I could be wrong, but look like this introduced rocm distributed failures, see https://hud.pytorch.org/hud/pytorch/pytorch/d9d7c0bef5db069c6c47a49c1472f2d3fe034aec/2?per_page=50&name_filter=trunk%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(dist&mergeEphemeralLF=true" -c ignoredsignal

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@scotts your PR has been successfully reverted.

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@pytorchbot revert -m "I could be wrong, but look like this introduced rocm distributed failures, see https://hud.pytorch.org/hud/pytorch/pytorch/d9d7c0bef5db069c6c47a49c1472f2d3fe034aec/2?per_page=50&name_filter=trunk%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(dist&mergeEphemeralLF=true" -c ignoredsignal

@malfet @scotts While it does seem like the kineto bump (and hence rocprofiler-sdk) made ROCm distributed tests run slower and hence frequently timeout, there are also other issues causing distributed timeouts on ROCm. One of them is being addressed by #176251. While we try to get this situation under control, I have marked the trunk distributed jobs as unstable: #177301

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Mar 12, 2026

Let's reland this then?

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Mar 12, 2026

@pytorchbot merge

cc @malfet in case you want to cancel the merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

darren-amd added a commit to ROCm/TheRock that referenced this pull request Mar 19, 2026
## Motivation

Fixes #3962

- The `rocprofiler-sdk` shared library is not being preloaded, causing
`librocprofiler-sdk.so.1` to be missing at runtime. This is because the
PyTorch `kineto` submodule was bumped which switched from `roctracer` to
`rocprofiler-sdk`: pytorch/pytorch#177101
- `test_mempool_expandable` was enabled on ROCm by
pytorch/pytorch#173330. This test was failing as
it requires the rocm[devel] packages but was causing a crash:
https://github.com/ROCm/TheRock/actions/runs/23164829934/job/67321547840.
This test is currently already skipped for other torch versions.
- Also skip `test_mempool_empty_cache_inactive`,
`test_mempool_limited_memory_with_allocator`,
`test_deleted_mempool_not_used_on_oom`, and
`test_mempool_ctx_multithread` as these also require building
`dummy_allocator` and are skipped in other torch versions.

## Technical Details

- Adds `rocprofiler-sdk` to `LINUX_LIBRARY_PRELOADS` in
`build_prod_wheels.py` so that `librocprofiler-sdk.so` is loaded
- Registers `rocprofiler-sdk` as a `LibraryEntry` in `_dist_info.py` so
the `rocm_sdk` package can resolve the name to the actual `.so` file.

## Test Plan

- Verify that ROCm builds, the nightly smoke tests pass and that running
the torch tests do not crash

## Test Result

- ROCm builds successfully:
https://github.com/ROCm/TheRock/actions/runs/23152017500
- Smoke tests pass for torch nightly and the runner is not crashing:
https://github.com/ROCm/TheRock/actions/runs/23253453219

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
chiranjeevipattigidi pushed a commit to ROCm/TheRock that referenced this pull request Mar 23, 2026
## Motivation

Fixes #3962

- The `rocprofiler-sdk` shared library is not being preloaded, causing
`librocprofiler-sdk.so.1` to be missing at runtime. This is because the
PyTorch `kineto` submodule was bumped which switched from `roctracer` to
`rocprofiler-sdk`: pytorch/pytorch#177101
- `test_mempool_expandable` was enabled on ROCm by
pytorch/pytorch#173330. This test was failing as
it requires the rocm[devel] packages but was causing a crash:
https://github.com/ROCm/TheRock/actions/runs/23164829934/job/67321547840.
This test is currently already skipped for other torch versions.
- Also skip `test_mempool_empty_cache_inactive`,
`test_mempool_limited_memory_with_allocator`,
`test_deleted_mempool_not_used_on_oom`, and
`test_mempool_ctx_multithread` as these also require building
`dummy_allocator` and are skipped in other torch versions.

## Technical Details

- Adds `rocprofiler-sdk` to `LINUX_LIBRARY_PRELOADS` in
`build_prod_wheels.py` so that `librocprofiler-sdk.so` is loaded
- Registers `rocprofiler-sdk` as a `LibraryEntry` in `_dist_info.py` so
the `rocm_sdk` package can resolve the name to the actual `.so` file.

## Test Plan

- Verify that ROCm builds, the nightly smoke tests pass and that running
the torch tests do not crash

## Test Result

- ROCm builds successfully:
https://github.com/ROCm/TheRock/actions/runs/23152017500
- Smoke tests pass for torch nightly and the runner is not crashing:
https://github.com/ROCm/TheRock/actions/runs/23253453219

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants