Improve small sort performance on CUDA#79627
Improve small sort performance on CUDA#79627peterbell10 wants to merge 7 commits intogh/peterbell10/334/basefrom
Conversation
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 3b24801 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. ghstack-source-id: c67c1a1 Pull Request resolved: pytorch#79627
|
What would happen if you don't have enough slices, that is, the whole array you are sorting is small, like 128x128? With your approach you'd launch fewer threadblocks and possibly regress performance? (Although the whole operation should be pretty fast anyway) |
|
Yes, there is absolutely a trade-off at small batch sizes. Though a batch of 1024 is enough for this to be a significant win for 32 and a wash for 128.
|
| case 2: \ | ||
| HANDLE_CASE(TYPE, A, 32); \ | ||
| break; \ | ||
| HANDLE_CASE(TYPE, A, 32, 16); \ |
There was a problem hiding this comment.
These batch sizes can be tweaked to find a different balance for small batch performance if desired. Or I could even re-purpose some of the binary size savings from #79628 to have separate kernels for small and large batch sizes.
There was a problem hiding this comment.
Oh actually, it should be possible to run the same kernel with a different runtime value for block.y as long as it's not greater than the templated block_dim_y. I'll see how that works out.
…ce on CUDA" Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]
|
These new commits use the occupancy calculator functions to suggest the minimum grid size, then scales the batch size to achieve it. In my testing, this almost completely removed the slowdown on the small end.
|
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]
|
@pytorchbot merge |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. |
|
@pytorchbot rebase |
|
@pytorchbot successfully started a rebase job. Check the current status here |
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]
|
Successfully rebased |
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. ghstack-source-id: 87c7d9c Pull Request resolved: #79627
|
@pytorchbot merge -g |
|
@pytorchbot successfully started a merge job. Check the current status here |
|
@peterbell10 your PR has been successfully merged. |
Summary: Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: #79627 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/61305cd638b6fcd73a0b66b4cde7014fecb9e8ce Reviewed By: atalman Differential Revision: D37343983 Pulled By: atalman fbshipit-source-id: b842735b6e621457a57d314d7d327f2003c30a75
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: pytorch#79627 Approved by: https://github.com/ngimel
Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: pytorch#79627 Approved by: https://github.com/ngimel
Stack from ghstack (oldest at bottom):
Currently,
bitonicSortKVInPlaceis written to sort one array perblock of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.
Instead, this changes
bitonicSortKVInPlaceto operate with a 2dblock. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.
Running under
nvprofshows speedups for both 32 and 128, which are the sizes changed here:Binary size is also marginally reduced, with my
torch_cuda.sogetting 500KB smaller.