Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/173330
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 13 Unrelated FailuresAs of commit 7a4a90a with merge base 8be2451 ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
b19983c to
3b3ffe3
Compare
3b3ffe3 to
e4e0d36
Compare
|
We have found that for unit tests to fully pass we need this HIP patch ROCm/rocm-systems#3023. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
e4e0d36 to
d6fed57
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
78869d4 to
922a0d4
Compare
|
@pytorchbot merge |
|
This PR needs to be approved by an authorized maintainer before merge. |
|
Noting that the 1 current failure seen is also seen on other PRs so it's not related. |
|
@pytorchbot merge -f "need to use force merge due to unrelated blocking failure, all other flaky CI is known; reason for revert has been addressed" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "reverted internally, original:D96556656, revert diff: D96725665" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 088c5a7. Reverted #173330 on behalf of https://github.com/yangw-dev due to reverted internally, original:D96556656, revert diff: D96725665 ([comment](#173330 (comment)))
|
@pragupta your PR has been successfully reverted. |
|
Let me import to reland this |
## Motivation Fixes #3962 - The `rocprofiler-sdk` shared library is not being preloaded, causing `librocprofiler-sdk.so.1` to be missing at runtime. This is because the PyTorch `kineto` submodule was bumped which switched from `roctracer` to `rocprofiler-sdk`: pytorch/pytorch#177101 - `test_mempool_expandable` was enabled on ROCm by pytorch/pytorch#173330. This test was failing as it requires the rocm[devel] packages but was causing a crash: https://github.com/ROCm/TheRock/actions/runs/23164829934/job/67321547840. This test is currently already skipped for other torch versions. - Also skip `test_mempool_empty_cache_inactive`, `test_mempool_limited_memory_with_allocator`, `test_deleted_mempool_not_used_on_oom`, and `test_mempool_ctx_multithread` as these also require building `dummy_allocator` and are skipped in other torch versions. ## Technical Details - Adds `rocprofiler-sdk` to `LINUX_LIBRARY_PRELOADS` in `build_prod_wheels.py` so that `librocprofiler-sdk.so` is loaded - Registers `rocprofiler-sdk` as a `LibraryEntry` in `_dist_info.py` so the `rocm_sdk` package can resolve the name to the actual `.so` file. ## Test Plan - Verify that ROCm builds, the nightly smoke tests pass and that running the torch tests do not crash ## Test Result - ROCm builds successfully: https://github.com/ROCm/TheRock/actions/runs/23152017500 - Smoke tests pass for torch nightly and the runner is not crashing: https://github.com/ROCm/TheRock/actions/runs/23253453219 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Summary: Original pull request: #173330 Fixes #168737. Fixes #168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385
Summary: Original pull request: #173330 Fixes #168737. Fixes #168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385 Pull Request resolved: #177974 Approved by: https://github.com/jeffdaily, https://github.com/echen4096
## Motivation Fixes #3962 - The `rocprofiler-sdk` shared library is not being preloaded, causing `librocprofiler-sdk.so.1` to be missing at runtime. This is because the PyTorch `kineto` submodule was bumped which switched from `roctracer` to `rocprofiler-sdk`: pytorch/pytorch#177101 - `test_mempool_expandable` was enabled on ROCm by pytorch/pytorch#173330. This test was failing as it requires the rocm[devel] packages but was causing a crash: https://github.com/ROCm/TheRock/actions/runs/23164829934/job/67321547840. This test is currently already skipped for other torch versions. - Also skip `test_mempool_empty_cache_inactive`, `test_mempool_limited_memory_with_allocator`, `test_deleted_mempool_not_used_on_oom`, and `test_mempool_ctx_multithread` as these also require building `dummy_allocator` and are skipped in other torch versions. ## Technical Details - Adds `rocprofiler-sdk` to `LINUX_LIBRARY_PRELOADS` in `build_prod_wheels.py` so that `librocprofiler-sdk.so` is loaded - Registers `rocprofiler-sdk` as a `LibraryEntry` in `_dist_info.py` so the `rocm_sdk` package can resolve the name to the actual `.so` file. ## Test Plan - Verify that ROCm builds, the nightly smoke tests pass and that running the torch tests do not crash ## Test Result - ROCm builds successfully: https://github.com/ROCm/TheRock/actions/runs/23152017500 - Smoke tests pass for torch nightly and the runner is not crashing: https://github.com/ROCm/TheRock/actions/runs/23253453219 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
|
closing this one as relanded here: #177974 |
…77974) Summary: Original pull request: pytorch#173330 Fixes pytorch#168737. Fixes pytorch#168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385 Pull Request resolved: pytorch#177974 Approved by: https://github.com/jeffdaily, https://github.com/echen4096 (cherry picked from commit 5792701)
…77974) Summary: Original pull request: pytorch#173330 Fixes pytorch#168737. Fixes pytorch#168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385 Pull Request resolved: pytorch#177974 Approved by: https://github.com/jeffdaily, https://github.com/echen4096 (cherry picked from commit 5792701)
…77974) (#3106) Summary: Original pull request: pytorch#173330 Fixes pytorch#168737. Fixes pytorch#168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385 Pull Request resolved: pytorch#177974 Approved by: https://github.com/jeffdaily, https://github.com/echen4096 (cherry picked from commit 5792701) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Haoyu Zhang <haoyuz@meta.com>
…77974) Summary: Original pull request: pytorch#173330 Fixes pytorch#168737. Fixes pytorch#168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385 Pull Request resolved: pytorch#177974 Approved by: https://github.com/jeffdaily, https://github.com/echen4096
…77974) Summary: Original pull request: pytorch#173330 Fixes pytorch#168737. Fixes pytorch#168736. The original diff enabled expandable segments for ROCm by adding `#ifdef USE_ROCM` guards throughout CUDACachingAllocator.cpp to use HIP APIs (hipMemAddressReserve, hipMemCreate, hipMemMap, etc.) instead of CUDA driver APIs when building for ROCm. Root cause: In HIP/ROCm 6.2.1, the field name for memory allocation properties is `requestedHandleType` (singular), not `requestedHandleTypes` (plural) as in CUDA. Additionally, `hipMemHandleTypeFabric` does not exist in HIP, so the `CU_MEM_HANDLE_TYPE_FABRIC` assignment must be skipped on ROCm. Fix applied on top of the original diff (from D96652342): - Use `prop.requestedHandleType = hipMemHandleTypePosixFileDescriptor` under `#ifdef USE_ROCM` (singular field name, HIP constant) - Use `prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` for CUDA (plural field name, CUDA constant) - Skip the `CU_MEM_HANDLE_TYPE_FABRIC` assignment entirely on ROCm under `#ifndef USE_ROCM`, as `hipMemHandleTypeFabric` does not exist in HIP Co-authored-by: Prachi Gupta prachi.gupta@amd.com Co-authored-by: Jeff Daily jeff.daily@amd.com Co-authored-by: moonshadow-25 moonshadow-25@users.noreply.github.com Co-authored-by: Vighanesh Sharma vighaneshsharma@gmail.com Test Plan: ``` fbpkg build //aps_models/ads/ecosystem/eval/cogwheel_tests/amd:cogwheel_aps_ads_icvr_kd_eval_amd_test_harness --build-remote ``` https://www.internalfb.com/sandcastle/workflow/1049338713192153464 Differential Revision: D97211385 Pull Request resolved: pytorch#177974 Approved by: https://github.com/jeffdaily, https://github.com/echen4096
Pull Request resolved: pytorch#173330 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fixes #168737.
Fixes #168736.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela