Improve small sort performance on CUDA by peterbell10 · Pull Request #79627 · pytorch/pytorch

peterbell10 · 2022-06-15T17:47:54Z

Stack from ghstack (oldest at bottom):

-> Improve small sort performance on CUDA #79627

Currently, bitonicSortKVInPlace is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes bitonicSortKVInPlace to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under nvprof shows speedups for both 32 and 128, which are the sizes changed here:

Shape	Master (ms)	This PR (ms)	Speedup
(128, 1024, 32)	1.4794	0.9683	1.53
(128, 1024, 128)	6.8891	6.5528	1.05

Binary size is also marginally reduced, with my torch_cuda.so getting 500KB smaller.

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. [ghstack-poisoned]

facebook-github-bot · 2022-06-15T17:48:02Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79627
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 3b24801 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. ghstack-source-id: c67c1a1 Pull Request resolved: pytorch#79627

ngimel · 2022-06-16T17:22:28Z

What would happen if you don't have enough slices, that is, the whole array you are sorting is small, like 128x128? With your approach you'd launch fewer threadblocks and possibly regress performance? (Although the whole operation should be pretty fast anyway)

peterbell10 · 2022-06-16T17:52:37Z

Yes, there is absolutely a trade-off at small batch sizes. Though a batch of 1024 is enough for this to be a significant win for 32 and a wash for 128.

Shape	Master (us)	This PR (us)	Speedup
(128, 1024, 32)	1479.4	968.3	1.53
(128, 1024, 128)	6889.1	6552.8	1.05
(1024, 32)	21.312	17.187	1.25
(1024, 128)	79.680	79.608	1.00
(128, 32)	7.7030	9.0610	0.85
(128, 128)	15.041	18.735	0.80

peterbell10 · 2022-06-16T18:01:13Z

      case 2:                                           \
-      HANDLE_CASE(TYPE, A, 32);                         \
-      break;                                            \
+        HANDLE_CASE(TYPE, A, 32, 16);                   \


These batch sizes can be tweaked to find a different balance for small batch performance if desired. Or I could even re-purpose some of the binary size savings from #79628 to have separate kernels for small and large batch sizes.

Oh actually, it should be possible to run the same kernel with a different runtime value for block.y as long as it's not greater than the templated block_dim_y. I'll see how that works out.

…ce on CUDA" Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]

peterbell10 · 2022-06-16T22:07:49Z

These new commits use the occupancy calculator functions to suggest the minimum grid size, then scales the batch size to achieve it. In my testing, this almost completely removed the slowdown on the small end.

Shape	Master (us)	This PR (us)	Speedup
(128, 1024, 32)	1479.4	968.5	1.53
(128, 1024, 128)	6889.1	6553.3	1.05
(1024, 32)	21.312	15.584	1.37
(1024, 128)	79.680	79.136	1.01
(128, 32)	7.7030	7.744	0.99
(128, 128)	15.041	14.976	1.00

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]

ngimel · 2022-06-21T21:20:40Z

@pytorchbot merge

pytorchmergebot · 2022-06-21T21:21:59Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-21T21:22:01Z

Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2538359620

ngimel · 2022-06-21T21:33:20Z

@pytorchbot rebase

pytorchmergebot · 2022-06-21T21:46:01Z

@pytorchbot successfully started a rebase job. Check the current status here

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here: | Shape | Master (ms) | This PR (ms) | Speedup | |------------------|-------------|--------------|---------| | (128, 1024, 32) | 1.4794 | 0.9683 | 1.53 | | (128, 1024, 128) | 6.8891 | 6.5528 | 1.05 | Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller. [ghstack-poisoned]

pytorchmergebot · 2022-06-21T21:46:18Z

Successfully rebased gh/peterbell10/334/orig onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/79627)

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. ghstack-source-id: 87c7d9c Pull Request resolved: #79627

peterbell10 · 2022-06-21T22:09:16Z

@pytorchbot merge -g

pytorchmergebot · 2022-06-21T22:10:30Z

@pytorchbot successfully started a merge job. Check the current status here

pytorchmergebot · 2022-06-22T00:51:03Z

@peterbell10 your PR has been successfully merged.

Summary: Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: #79627 Approved by: https://github.com/ngimel Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/61305cd638b6fcd73a0b66b4cde7014fecb9e8ce Reviewed By: atalman Differential Revision: D37343983 Pulled By: atalman fbshipit-source-id: b842735b6e621457a57d314d7d327f2003c30a75

Currently, `bitonicSortKVInPlace` is written to sort one array per block of threads. If that dimension happens to be very small (<128 elements), this results in low thread occupancy. Instead, this changes `bitonicSortKVInPlace` to operate with a 2d block. Sorting happens along the x dimension, and the y dimension is a fixed size batch. Pull Request resolved: pytorch#79627 Approved by: https://github.com/ngimel

facebook-github-bot added the cla signed label Jun 15, 2022

peterbell10 mentioned this pull request Jun 15, 2022

Use cub::BlockRadixSort to improve medium length sort performance #79628

Closed

pytorchbot added the open source label Jun 15, 2022

peterbell10 marked this pull request as ready for review June 16, 2022 16:56

peterbell10 requested a review from ngimel June 16, 2022 16:56

peterbell10 added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general release notes: cuda release notes category topic: performance topic category labels Jun 16, 2022

ngimel requested a review from zasdfgbnm June 16, 2022 17:20

peterbell10 commented Jun 16, 2022

View reviewed changes

peterbell10 added 3 commits June 16, 2022 22:42

ngimel approved these changes Jun 21, 2022

View reviewed changes

pytorchmergebot added the Merged label Jun 22, 2022

pytorchmergebot closed this in 61305cd Jun 22, 2022

facebook-github-bot deleted the gh/peterbell10/334/head branch June 25, 2022 14:16

Conversation

peterbell10 commented Jun 15, 2022 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

ngimel commented Jun 16, 2022

Uh oh!

peterbell10 commented Jun 16, 2022

Uh oh!

peterbell10 Jun 16, 2022

Choose a reason for hiding this comment

Uh oh!

peterbell10 Jun 16, 2022

Choose a reason for hiding this comment

Uh oh!

peterbell10 commented Jun 16, 2022

Uh oh!

ngimel commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

ngimel commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

peterbell10 commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 21, 2022

Uh oh!

pytorchmergebot commented Jun 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

peterbell10 commented Jun 15, 2022 •

edited by pytorchmergebot

Loading

facebook-github-bot commented Jun 15, 2022 •

edited

Loading