[ROCm] Build FBGEMM_GENAI for gfx942 only#162648
[ROCm] Build FBGEMM_GENAI for gfx942 only#162648jithunnair-amd wants to merge 5 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162648
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 5 Cancelled Jobs, 1 Unrelated FailureAs of commit 0aac9b5 with merge base 3a7db34 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Can you clarify what code exactly was removed in my PR that cause build time to increase for ROCM...? It's not clear to me and I'd like to understand, thanks |
b6d0a9e#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777L277-L282 |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
3e42b38 to
a55349f
Compare
|
@jithunnair-amd this change looks good to me overall, just clarified what I meant before as we should only build gfx942 for now with fbgemm + ROCm. Do you plan to merge it soon? |
@cthi Yes, however, the CUDA build failures were a bit baffling. I'm going to try rebasing again. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
a55349f to
5b12488
Compare
|
@pytorchbot merge -f "CI failures unrelated. Merging to restore nightly libtorch builds" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch). Pull Request resolved: #163776 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fixes build timeouts >4h on libtorch build jobs: https://hud.pytorch.org/hud/pytorch/pytorch/75e7f49f9c70116d7c4f8f86c3d0688ade306284/1?per_page=50&name_filter=inux-binary-libtorch%20%2F%20libtorch-rocm&mergeEphemeralLF=true Brings back code to narrow down CK compilation targets from pytorch@69a25f6#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777 gfx942 supports fp8 Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738 Validation: [rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h. Pull Request resolved: pytorch#162648 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fixes build timeouts >4h on libtorch build jobs: https://hud.pytorch.org/hud/pytorch/pytorch/75e7f49f9c70116d7c4f8f86c3d0688ade306284/1?per_page=50&name_filter=inux-binary-libtorch%20%2F%20libtorch-rocm&mergeEphemeralLF=true Brings back code to narrow down CK compilation targets from 69a25f6#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777 gfx942 supports fp8 Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738 Validation: [rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h. Pull Request resolved: #162648 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch). Pull Request resolved: #163776 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch). Pull Request resolved: #163776 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit 0ec946a)
[ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776) Despite narrowing down the [FBGEMM_GENAI build to gfx942](#162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897). This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently. This PR is a more ROCm-targeted version of #162880 (which is for release/2.9 branch). Pull Request resolved: #163776 Approved by: https://github.com/jeffdaily (cherry picked from commit 0ec946a) Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fixes build timeouts >4h on libtorch build jobs: https://hud.pytorch.org/hud/pytorch/pytorch/75e7f49f9c70116d7c4f8f86c3d0688ade306284/1?per_page=50&name_filter=inux-binary-libtorch%20%2F%20libtorch-rocm&mergeEphemeralLF=true
Brings back code to narrow down CK compilation targets from 69a25f6#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777
gfx942 supports fp8
Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738
Validation:
rocm6.4 and rocm6.3 libtorch builds finished within 3.9h.
cc @jeffdaily @sunway513 @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @danielvegamyhre (since their change had removed this snippet, causing ROCm builds to increase >4h)