Use cub::BlockRadixSort to improve medium length sort performance by peterbell10 · Pull Request #79628 · pytorch/pytorch

peterbell10 · 2022-06-15T17:47:59Z

Stack from ghstack (oldest at bottom):

-> Use cub::BlockRadixSort to improve medium length sort performance #79628

In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
descending branch into the kernel itself which I found not to effect
performance. The result is a 1.9 MB decrease in torch_cuda.so on
my build for one cuda architecture.

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 7261a56 Pull Request resolved: #79628

facebook-github-bot · 2022-06-15T17:48:14Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79628
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 0ddb0e4 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

peterbell10 · 2022-06-16T16:48:24Z

Here are my detailed benchmark results for all sizes in range(0, 4097, 4), expressed as speedup relative to the old implementation. All results come from an RTX 2060.

Unstable sort varies from 1-3x speedup

Stable sort varies from 1-25x speedup (note that the 1x is for 0-length sort, included as a sanity check)

…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 6fb39b3 Pull Request resolved: pytorch#79628

…mprove medium length sort performance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

…um length sort performance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 6a02461 Pull Request resolved: #79628

…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 8e38580 Pull Request resolved: #79628

ngimel · 2022-06-21T21:30:19Z

+          typename K, typename V, typename IndexType>
+C10_LAUNCH_BOUNDS_1(block_size)
+__global__ void
+radixSortKVInPlace(at::cuda::detail::TensorInfo<K, IndexType> keys,


is this better than DeviceSegmentedRadix sort that's used for some configurations already? It seems like it should be pretty similar.

Yes, it is significantly better. DeviceSegmentedRadixSort is used by launch_stable_sort_kernel and if you look at the stable sort speedup graph you can see radixSortKVInPlace is at worst 1.5x faster and at best 25x faster, depending length of the dimension being sorted.

Hm, interesting, seems like cub should have gotten it right

ngimel · 2022-06-22T16:57:45Z

@pytorchbot merge

pytorchmergebot · 2022-06-22T16:58:58Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-22T16:58:59Z

Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2543862234

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 03875d2 Pull Request resolved: #79628

…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: fa42980 Pull Request resolved: #79628

peterbell10 · 2022-06-22T19:27:47Z

@pytorchbot merge -g

pytorchmergebot · 2022-06-22T19:30:02Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-22T22:00:35Z

Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule superuser
Raised by https://github.com/pytorch/pytorch/actions/runs/2544645176

…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 7bafb13 Pull Request resolved: #79628

peterbell10 · 2022-06-23T13:41:46Z

@pytorchbot merge -g

pytorchmergebot · 2022-06-23T13:45:03Z

@pytorchbot successfully started a merge job. Check the current status here

janeyx99 · 2022-06-23T14:39:48Z

@pytorchbot revert -m "Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/67a5d0bf40b10d8ebfb6b10b86f73583b9a8c461" -c nosignal

To get rocm signal when you reopen this PR, please add the ciflow/trunk label!

pytorchmergebot · 2022-06-23T14:41:15Z

@pytorchbot successfully started a revert job. Check the current status here

pytorchmergebot · 2022-06-23T14:41:22Z

@peterbell10 your PR has been successfully reverted.

…ance" This reverts commit 67a5d0b. Reverted #79628 on behalf of https://github.com/janeyx99 due to Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/67a5d0bf40b10d8ebfb6b10b86f73583b9a8c461

…um length sort performance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 5944c4e Pull Request resolved: #79628

peterbell10 · 2022-06-23T20:59:53Z

@pytorchbot merge

pytorchmergebot · 2022-06-23T21:01:07Z

@pytorchbot successfully started a merge job. Check the current status here

…9628) (#79628) Summary: In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. Pull Request resolved: #79628 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/8c0796e57fa7ad2ad588874168698c0ff1f76e67 Reviewed By: seemethere Differential Revision: D37423665 Pulled By: seemethere fbshipit-source-id: 881d5efd9ded6bbcc561d11ad5bac77f4e86cc99

In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Pull Request resolved: pytorch#79628 Approved by: https://github.com/ngimel

…ance" This reverts commit f50b227. Reverted pytorch#79628 on behalf of https://github.com/janeyx99 due to Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/f50b227c28a5cbff51e7a79b83c443adf81e322b

…torch#79628) In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. Pull Request resolved: pytorch#79628 Approved by: https://github.com/ngimel

peterbell10 mentioned this pull request Jun 15, 2022

Improve small sort performance on CUDA #79627

Closed

facebook-github-bot added the cla signed label Jun 15, 2022

pytorchbot added the open source label Jun 15, 2022

peterbell10 marked this pull request as ready for review June 16, 2022 16:56

peterbell10 requested a review from ngimel June 16, 2022 16:56

peterbell10 added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general release notes: cuda release notes category topic: performance topic category labels Jun 16, 2022

peterbell10 requested a review from zasdfgbnm June 17, 2022 19:26

ngimel reviewed Jun 21, 2022

View reviewed changes

ngimel approved these changes Jun 22, 2022

View reviewed changes

pytorchmergebot added the Merged label Jun 23, 2022

pytorchmergebot closed this in 67a5d0b Jun 23, 2022

peterbell10 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 23, 2022

pytorchmergebot added the Reverted label Jun 23, 2022

peterbell10 reopened this Jun 23, 2022

pytorchmergebot closed this in 8c0796e Jun 23, 2022

facebook-github-bot deleted the gh/peterbell10/335/head branch June 27, 2022 14:17

Conversation

peterbell10 commented Jun 15, 2022 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

peterbell10 commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngimel Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

peterbell10 Jun 22, 2022

Choose a reason for hiding this comment

Uh oh!

ngimel Jun 22, 2022

Choose a reason for hiding this comment

Uh oh!

ngimel commented Jun 22, 2022

Uh oh!

pytorchmergebot commented Jun 22, 2022

Uh oh!

pytorchmergebot commented Jun 22, 2022

Uh oh!

peterbell10 commented Jun 22, 2022

Uh oh!

pytorchmergebot commented Jun 22, 2022

Uh oh!

pytorchmergebot commented Jun 22, 2022

Uh oh!

peterbell10 commented Jun 23, 2022

Uh oh!

pytorchmergebot commented Jun 23, 2022

Uh oh!

janeyx99 commented Jun 23, 2022

Uh oh!

pytorchmergebot commented Jun 23, 2022

Uh oh!

pytorchmergebot commented Jun 23, 2022

Uh oh!

peterbell10 commented Jun 23, 2022

Uh oh!

pytorchmergebot commented Jun 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

peterbell10 commented Jun 15, 2022 •

edited by pytorchmergebot

Loading

facebook-github-bot commented Jun 15, 2022 •

edited

Loading

peterbell10 commented Jun 16, 2022 •

edited

Loading