Skip to content

[ROCm] Optimize the stride one indexing backwards kernel#146420

Closed
doru1004 wants to merge 2 commits intopytorch:mainfrom
doru1004:improve-backwards-indexing-with-stride-1
Closed

[ROCm] Optimize the stride one indexing backwards kernel#146420
doru1004 wants to merge 2 commits intopytorch:mainfrom
doru1004:improve-backwards-indexing-with-stride-1

Conversation

@doru1004
Copy link
Contributor

@doru1004 doru1004 commented Feb 4, 2025

This patch makes several changes to the stride 1 backwards indexing kernel as follows:

  • enables the computation across the sorted_indices array to happen in parallel by all the lanes in the warp, this means that the accesses to sorted_indices are now fully coalesced.
  • the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different idx.
  • enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
  • for low number of duplicates i.e. we have less than warp-size duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
  • for high number of duplicates i.e. when we have more than warp-size duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single idx which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi Hashem.Hashemi@amd.com

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146420

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 18 Pending, 4 Unrelated Failures

As of commit 7e97847 with merge base d0f08dc (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Feb 4, 2025
@pruthvistony pruthvistony added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm rocm This tag is for PRs from ROCm team topic: not user facing topic category and removed release notes: cuda release notes category labels Feb 4, 2025
@soulitzer soulitzer requested a review from jeffdaily February 5, 2025 04:12
@doru1004 doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from d08531f to d409805 Compare February 5, 2025 06:42
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025
@doru1004 doru1004 changed the title [ROCm] Enable tunable warp size for stride one indexing backwards kernel [ROCm] Optimize the stride one indexing backwards kernel Feb 17, 2025
@doru1004 doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch 3 times, most recently from 5a1e23e to 7468adc Compare February 17, 2025 22:03
@doru1004
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased improve-backwards-indexing-with-stride-1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout improve-backwards-indexing-with-stride-1 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the improve-backwards-indexing-with-stride-1 branch from 7468adc to 4c8777a Compare February 18, 2025 00:38
@doru1004 doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch 4 times, most recently from f0129a2 to 533b929 Compare February 19, 2025 22:11
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Feb 20, 2025
…se/OneDupOpt (#1897)" (#1918)

This reverts commit a6f1375. 
We have a better fix that is being validated
(pytorch#146420)
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 1, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck)

Details for Dev Infra team Raised by workflow job

@doru1004
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/146420/head returned non-zero exit code 1

Rebasing (1/9)
Rebasing (2/9)
Rebasing (3/9)
Rebasing (4/9)
Rebasing (5/9)
Rebasing (6/9)
Auto-merging tools/amd_build/build_amd.py
CONFLICT (content): Merge conflict in tools/amd_build/build_amd.py
error: could not apply bd10dc3714a... do not replace kernel launches
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply bd10dc3714a... do not replace kernel launches

Raised by https://github.com/pytorch/pytorch/actions/runs/13501846960

@doru1004 doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from 18b49e9 to 2ce680b Compare February 24, 2025 15:49
@doru1004
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@doru1004
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Tried to rebase and push PR #146420, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

@doru1004 doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from 2ce680b to bee9dd0 Compare February 24, 2025 17:33
@doru1004 doru1004 force-pushed the improve-backwards-indexing-with-stride-1 branch from bee9dd0 to 7e97847 Compare February 24, 2025 17:34
@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "rocm-only ifdef'd change, all relevant rocm jobs passing; dist rocm jobs are failing due to broken commit on main, not this PR"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

aditew01 pushed a commit that referenced this pull request Feb 28, 2025
This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: #146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Mar 4, 2025
)

This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: pytorch#146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
rocm-mici pushed a commit to ROCm/pytorch that referenced this pull request Mar 4, 2025
)

This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: pytorch#146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Mar 11, 2025
)

This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: pytorch#146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants