[ROCm] Optimize the stride one indexing backwards kernel#146420
[ROCm] Optimize the stride one indexing backwards kernel#146420doru1004 wants to merge 2 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146420
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 18 Pending, 4 Unrelated FailuresAs of commit 7e97847 with merge base d0f08dc ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
d08531f to
d409805
Compare
5a1e23e to
7468adc
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
7468adc to
4c8777a
Compare
f0129a2 to
533b929
Compare
…se/OneDupOpt (#1897)" (#1918) This reverts commit a6f1375. We have a better fix that is being validated (pytorch#146420)
Merge failedReason: 1 jobs have failed, first few of them are: periodic / linux-focal-cuda12.4-py3-gcc9-slow-gradcheck / test (default, 1, 8, linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command Raised by https://github.com/pytorch/pytorch/actions/runs/13501846960 |
18b49e9 to
2ce680b
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 3 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
2ce680b to
bee9dd0
Compare
bee9dd0 to
7e97847
Compare
|
@pytorchbot merge -f "rocm-only ifdef'd change, all relevant rocm jobs passing; dist rocm jobs are failing due to broken commit on main, not this PR" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: #146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: pytorch#146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: pytorch#146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
) This patch makes several changes to the stride 1 backwards indexing kernel as follows: - enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced. - the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`. - enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count. - for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values). - for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel. Benefits on examples extracted from workloads show a 3.6x to 10x speed-up. co-author: Hashem Hashemi <Hashem.Hashemi@amd.com> Pull Request resolved: pytorch#146420 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
This patch makes several changes to the stride 1 backwards indexing kernel as follows:
sorted_indicesarray to happen in parallel by all the lanes in the warp, this means that the accesses tosorted_indicesare now fully coalesced.idx.warp-sizeduplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).warp-sizeduplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a singleidxwhich has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.
co-author: Hashem Hashemi Hashem.Hashemi@amd.com
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd