Skip to content

Use cub::BlockRadixSort to improve medium length sort performance#79628

Closed
peterbell10 wants to merge 10 commits intogh/peterbell10/335/basefrom
gh/peterbell10/335/head
Closed

Use cub::BlockRadixSort to improve medium length sort performance#79628
peterbell10 wants to merge 10 commits intogh/peterbell10/335/basefrom
gh/peterbell10/335/head

Conversation

@peterbell10
Copy link
Copy Markdown
Collaborator

@peterbell10 peterbell10 commented Jun 15, 2022

Stack from ghstack (oldest at bottom):

In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
descending branch into the kernel itself which I found not to effect
performance. The result is a 1.9 MB decrease in torch_cuda.so on
my build for one cuda architecture.

In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 15, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 7261a56
Pull Request resolved: #79628
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jun 15, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 0ddb0e4 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@peterbell10
Copy link
Copy Markdown
Collaborator Author

peterbell10 commented Jun 16, 2022

Here are my detailed benchmark results for all sizes in range(0, 4097, 4), expressed as speedup relative to the old implementation. All results come from an RTX 2060.

Unstable sort varies from 1-3x speedup
image

Stable sort varies from 1-25x speedup (note that the 1x is for 0-length sort, included as a sanity check)
image

…ormance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
@peterbell10 peterbell10 marked this pull request as ready for review June 16, 2022 16:56
@peterbell10 peterbell10 requested a review from ngimel June 16, 2022 16:56
@peterbell10 peterbell10 added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general release notes: cuda release notes category topic: performance topic category labels Jun 16, 2022
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Jun 16, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 6fb39b3
Pull Request resolved: pytorch#79628
…mprove medium length sort performance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
…ormance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
…um length sort performance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 16, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 6a02461
Pull Request resolved: #79628
…ormance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 17, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 8e38580
Pull Request resolved: #79628
@peterbell10 peterbell10 requested a review from zasdfgbnm June 17, 2022 19:26
typename K, typename V, typename IndexType>
C10_LAUNCH_BOUNDS_1(block_size)
__global__ void
radixSortKVInPlace(at::cuda::detail::TensorInfo<K, IndexType> keys,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this better than DeviceSegmentedRadix sort that's used for some configurations already? It seems like it should be pretty similar.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is significantly better. DeviceSegmentedRadixSort is used by launch_stable_sort_kernel and if you look at the stable sort speedup graph you can see radixSortKVInPlace is at worst 1.5x faster and at best 25x faster, depending length of the dimension being sorted.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting, seems like cub should have gotten it right

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jun 22, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2543862234

pytorchmergebot added a commit that referenced this pull request Jun 22, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 03875d2
Pull Request resolved: #79628
…ormance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 22, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: fa42980
Pull Request resolved: #79628
@peterbell10
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed due to Refusing to merge as mandatory check(s) pull failed for rule superuser
Raised by https://github.com/pytorch/pytorch/actions/runs/2544645176

…ormance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 23, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 7bafb13
Pull Request resolved: #79628
@peterbell10
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@janeyx99
Copy link
Copy Markdown
Contributor

@pytorchbot revert -m "Sorry, reverting as it breaks ROCm build on trunk https://hud.pytorch.org/pytorch/pytorch/commit/67a5d0bf40b10d8ebfb6b10b86f73583b9a8c461" -c nosignal

To get rocm signal when you reopen this PR, please add the ciflow/trunk label!

@peterbell10 peterbell10 added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 23, 2022
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@peterbell10 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Jun 23, 2022
@peterbell10 peterbell10 reopened this Jun 23, 2022
…um length sort performance"


In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect 
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

[ghstack-poisoned]
peterbell10 added a commit that referenced this pull request Jun 23, 2022
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

ghstack-source-id: 5944c4e
Pull Request resolved: #79628
@peterbell10
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/335/head branch June 27, 2022 14:17
facebook-github-bot pushed a commit that referenced this pull request Jun 27, 2022
…9628) (#79628)

Summary:
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.

Pull Request resolved: #79628
Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/8c0796e57fa7ad2ad588874168698c0ff1f76e67

Reviewed By: seemethere

Differential Revision: D37423665

Pulled By: seemethere

fbshipit-source-id: 881d5efd9ded6bbcc561d11ad5bac77f4e86cc99
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread and so it does break down at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Pull Request resolved: pytorch#79628

Approved by: https://github.com/ngimel
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
…torch#79628)

In my testing, replacing the custom bitonic sort with cub's block
level radix sort primitives improves overall sort performance by up to
3x, depending on input length. This also benefits from being a stable
sort, and so we get up to 25x speedup for small stable sorts and
around 2x speedup on the largest supported size.

In testing, the radix sort benefits a lot from having more items per
thread meaning it breaks down a bit at very small sizes. So, for the
32-item sort I've left the bitonic sorting algorithm in place.

Binary size is also reduced in this change, because I have moved the
`descending` branch into the kernel itself which I found not to effect
performance.  The result is a 1.9 MB decrease in `torch_cuda.so` on
my build for one cuda architecture.
Pull Request resolved: pytorch#79628
Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue open source release notes: cuda release notes category Reverted topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants