Use cub::BlockRadixSort to improve medium length sort performance#79628
Use cub::BlockRadixSort to improve medium length sort performance#79628peterbell10 wants to merge 10 commits intogh/peterbell10/335/basefrom
Conversation
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 7261a56 Pull Request resolved: #79628
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 0ddb0e4 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 6fb39b3 Pull Request resolved: pytorch#79628
…mprove medium length sort performance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
…um length sort performance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 6a02461 Pull Request resolved: #79628
…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 8e38580 Pull Request resolved: #79628
| typename K, typename V, typename IndexType> | ||
| C10_LAUNCH_BOUNDS_1(block_size) | ||
| __global__ void | ||
| radixSortKVInPlace(at::cuda::detail::TensorInfo<K, IndexType> keys, |
There was a problem hiding this comment.
is this better than DeviceSegmentedRadix sort that's used for some configurations already? It seems like it should be pretty similar.
There was a problem hiding this comment.
Yes, it is significantly better. DeviceSegmentedRadixSort is used by launch_stable_sort_kernel and if you look at the stable sort speedup graph you can see radixSortKVInPlace is at worst 1.5x faster and at best 25x faster, depending length of the dimension being sorted.
There was a problem hiding this comment.
Hm, interesting, seems like cub should have gotten it right
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. |
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 03875d2 Pull Request resolved: #79628
…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: fa42980 Pull Request resolved: #79628
|
@pytorchbot merge -g |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule superuser |
…ormance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 7bafb13 Pull Request resolved: #79628
|
@pytorchbot merge -g |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
@pytorchbot revert -m "Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/67a5d0bf40b10d8ebfb6b10b86f73583b9a8c461" -c nosignal To get rocm signal when you reopen this PR, please add the ciflow/trunk label! |
|
@pytorchbot successfully started a revert job. Check the current status here |
|
@peterbell10 your PR has been successfully reverted. |
…ance" This reverts commit 67a5d0b. Reverted #79628 on behalf of https://github.com/janeyx99 due to Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/67a5d0bf40b10d8ebfb6b10b86f73583b9a8c461
…um length sort performance" In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. [ghstack-poisoned]
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. ghstack-source-id: 5944c4e Pull Request resolved: #79628
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
…9628) (#79628) Summary: In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. Pull Request resolved: #79628 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/8c0796e57fa7ad2ad588874168698c0ff1f76e67 Reviewed By: seemethere Differential Revision: D37423665 Pulled By: seemethere fbshipit-source-id: 881d5efd9ded6bbcc561d11ad5bac77f4e86cc99
In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread and so it does break down at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Pull Request resolved: pytorch#79628 Approved by: https://github.com/ngimel
…ance" This reverts commit f50b227. Reverted pytorch#79628 on behalf of https://github.com/janeyx99 due to Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/f50b227c28a5cbff51e7a79b83c443adf81e322b
…torch#79628) In my testing, replacing the custom bitonic sort with cub's block level radix sort primitives improves overall sort performance by up to 3x, depending on input length. This also benefits from being a stable sort, and so we get up to 25x speedup for small stable sorts and around 2x speedup on the largest supported size. In testing, the radix sort benefits a lot from having more items per thread meaning it breaks down a bit at very small sizes. So, for the 32-item sort I've left the bitonic sorting algorithm in place. Binary size is also reduced in this change, because I have moved the `descending` branch into the kernel itself which I found not to effect performance. The result is a 1.9 MB decrease in `torch_cuda.so` on my build for one cuda architecture. Pull Request resolved: pytorch#79628 Approved by: https://github.com/ngimel


Stack from ghstack (oldest at bottom):
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.
In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.
Binary size is also reduced in this change, because I have moved the
descendingbranch into the kernel itself which I found not to effectperformance. The result is a 1.9 MB decrease in
torch_cuda.soonmy build for one cuda architecture.