[ROCm] Optimize the stride one indexing backwards kernel by doru1004 · Pull Request #146420 · pytorch/pytorch

doru1004 · 2025-02-04T17:24:38Z

This patch makes several changes to the stride 1 backwards indexing kernel as follows:

enables the computation across the sorted_indices array to happen in parallel by all the lanes in the warp, this means that the accesses to sorted_indices are now fully coalesced.
the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different idx.
enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
for low number of duplicates i.e. we have less than warp-size duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
for high number of duplicates i.e. when we have more than warp-size duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single idx which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi Hashem.Hashemi@amd.com

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-02-04T17:24:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146420

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 18 Pending, 4 Unrelated Failures

As of commit 7e97847 with merge base d0f08dc ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.2, module:rocm, oncall:distributed) (gh) (similar failure)
'test/distributed/_composable/fsdp/test_fully_shard_init.py::TestHSDPWithCustomHook::test_custom_hook_custom_stream'
periodic / linux-focal-rocm6.3-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.2, module:rocm, oncall:distributed) (gh) (disabled by #147754)
distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_single

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 1, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) (gh) (trunk failure)
inductor/test_cutlass_backend.py::TestCutlassBackend::test_max_autotune_cutlass_backend_mixed_mm_dynamic_False
periodic / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
/var/lib/jenkins/workspace/BUILD.bazel:434:11: Compiling aten/src/ATen/native/cuda/RowwiseScaledMM.cu failed: (Exit 1): nvcc failed: error executing command (from target //:aten_cuda) external/local_cuda/cuda/bin/nvcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++17' -MD -MF ... (remaining 320 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

doru1004 · 2025-02-18T00:36:56Z

@pytorchbot rebase

pytorchmergebot · 2025-02-18T00:38:22Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-18T00:38:25Z

Successfully rebased improve-backwards-indexing-with-stride-1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout improve-backwards-indexing-with-stride-1 && git pull --rebase)

…se/OneDupOpt (#1897)" (#1918) This reverts commit a6f1375. We have a better fix that is being validated (pytorch#146420)

pytorchmergebot · 2025-02-22T00:50:44Z

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 1, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck)

Details for Dev Infra team

Raised by workflow job

doru1004 · 2025-02-24T15:38:06Z

@pytorchbot rebase

pytorchmergebot · 2025-02-24T15:39:41Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-24T15:39:43Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/146420/head returned non-zero exit code 1

Rebasing (1/9)
Rebasing (2/9)
Rebasing (3/9)
Rebasing (4/9)
Rebasing (5/9)
Rebasing (6/9)
Auto-merging tools/amd_build/build_amd.py
CONFLICT (content): Merge conflict in tools/amd_build/build_amd.py
error: could not apply bd10dc3714a... do not replace kernel launches
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply bd10dc3714a... do not replace kernel launches

Raised by https://github.com/pytorch/pytorch/actions/runs/13501846960

doru1004 · 2025-02-24T15:52:40Z

@pytorchbot merge

pytorchmergebot · 2025-02-24T15:54:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-24T15:54:39Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

doru1004 · 2025-02-24T17:26:27Z

@pytorchbot rebase

pytorchmergebot · 2025-02-24T17:27:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-24T17:28:00Z

Tried to rebase and push PR #146420, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

jeffdaily · 2025-02-24T21:17:06Z

@pytorchbot merge -f "rocm-only ifdef'd change, all relevant rocm jobs passing; dist rocm jobs are failing due to broken commit on main, not this PR"

pytorchmergebot · 2025-02-24T21:18:47Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: #146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: pytorch#146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

doru1004 requested review from eqy and syed-ahmed as code owners February 4, 2025 17:24

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Feb 4, 2025

pytorchbot added the open source label Feb 4, 2025

pruthvistony added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm rocm This tag is for PRs from ROCm team topic: not user facing topic category and removed release notes: cuda release notes category labels Feb 4, 2025

soulitzer requested a review from jeffdaily February 5, 2025 04:12

doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from d08531f to d409805 Compare February 5, 2025 06:42

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025

doru1004 changed the title ~~[ROCm] Enable tunable warp size for stride one indexing backwards kernel~~ [ROCm] Optimize the stride one indexing backwards kernel Feb 17, 2025

doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch 3 times, most recently from 5a1e23e to 7468adc Compare February 17, 2025 22:03

pytorchmergebot force-pushed the improve-backwards-indexing-with-stride-1 branch from 7468adc to 4c8777a Compare February 18, 2025 00:38

doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch 4 times, most recently from f0129a2 to 533b929 Compare February 19, 2025 22:11

jerrymannil mentioned this pull request Feb 20, 2025

Revert "[ROCm] Indexing perf optimization via Unroll/WideFetch/IdxReuse/OneDupOpt (#1897)" ROCm/pytorch#1918

Merged

pruthvistony approved these changes Feb 21, 2025

View reviewed changes

jeffdaily approved these changes Feb 21, 2025

View reviewed changes

pytorchmergebot removed the merging label Feb 22, 2025

doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from 18b49e9 to 2ce680b Compare February 24, 2025 15:49

pytorchmergebot added the merging label Feb 24, 2025

pytorchmergebot removed the merging label Feb 24, 2025

doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from 2ce680b to bee9dd0 Compare February 24, 2025 17:33

doru1004 added 2 commits February 24, 2025 12:34

Enable tunable warp size for stride one indexing backwards kernel

f3b1c34

Perform parallel duplicate search and full-warp parallel reduction

7e97847

doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from bee9dd0 to 7e97847 Compare February 24, 2025 17:34

pytorchmergebot added the merging label Feb 24, 2025

pytorchmergebot closed this in 96acb56 Feb 24, 2025

pytorchmergebot added Merged and removed merging labels Feb 24, 2025

Conversation

doru1004 commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146420

⏳ 18 Pending, 4 Unrelated Failures

Uh oh!

doru1004 commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 22, 2025

Merge failed

Uh oh!

doru1004 commented Feb 24, 2025

Uh oh!

pytorchmergebot commented Feb 24, 2025

Uh oh!

pytorchmergebot commented Feb 24, 2025

Uh oh!

doru1004 commented Feb 24, 2025

Uh oh!

pytorchmergebot commented Feb 24, 2025

Merge started

Uh oh!

pytorchmergebot commented Feb 24, 2025

Merge failed

Uh oh!

doru1004 commented Feb 24, 2025

Uh oh!

pytorchmergebot commented Feb 24, 2025

Uh oh!

pytorchmergebot commented Feb 24, 2025

Uh oh!

jeffdaily commented Feb 24, 2025

Uh oh!

pytorchmergebot commented Feb 24, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

doru1004 commented Feb 4, 2025 •

edited

Loading

pytorch-bot bot commented Feb 4, 2025 •

edited

Loading