Skip to content

Improve small sort performance on CUDA#79627

Closed
peterbell10 wants to merge 7 commits intogh/peterbell10/334/basefrom
gh/peterbell10/334/head
Closed

Improve small sort performance on CUDA#79627
peterbell10 wants to merge 7 commits intogh/peterbell10/334/basefrom
gh/peterbell10/334/head

Conversation

@peterbell10
Copy link
Copy Markdown
Collaborator

@peterbell10 peterbell10 commented Jun 15, 2022

Stack from ghstack (oldest at bottom):

Currently, bitonicSortKVInPlace is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes bitonicSortKVInPlace to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under nvprof shows speedups for both 32 and 128, which are the sizes changed here:

Shape Master (ms) This PR (ms) Speedup
(128, 1024, 32) 1.4794 0.9683 1.53
(128, 1024, 128) 6.8891 6.5528 1.05

Binary size is also marginally reduced, with my torch_cuda.so getting 500KB smaller.

Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jun 15, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 3b24801 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here:

| Shape            | Master (ms) | This PR (ms) | Speedup |
|------------------|-------------|--------------|---------|
| (128, 1024, 32)  | 1.4794      | 0.9683       | 1.53    |
| (128, 1024, 128) | 6.8891      | 6.5528       | 1.05    |

Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller.

[ghstack-poisoned]
@peterbell10 peterbell10 marked this pull request as ready for review June 16, 2022 16:56
@peterbell10 peterbell10 requested a review from ngimel June 16, 2022 16:56
@peterbell10 peterbell10 added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general release notes: cuda release notes category topic: performance topic category labels Jun 16, 2022
peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Jun 16, 2022
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

ghstack-source-id: c67c1a1
Pull Request resolved: pytorch#79627
@ngimel ngimel requested a review from zasdfgbnm June 16, 2022 17:20
@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jun 16, 2022

What would happen if you don't have enough slices, that is, the whole array you are sorting is small, like 128x128? With your approach you'd launch fewer threadblocks and possibly regress performance? (Although the whole operation should be pretty fast anyway)

@peterbell10
Copy link
Copy Markdown
Collaborator Author

Yes, there is absolutely a trade-off at small batch sizes. Though a batch of 1024 is enough for this to be a significant win for 32 and a wash for 128.

Shape Master (us) This PR (us) Speedup
(128, 1024, 32) 1479.4 968.3 1.53
(128, 1024, 128) 6889.1 6552.8 1.05
(1024, 32) 21.312 17.187 1.25
(1024, 128) 79.680 79.608 1.00
(128, 32) 7.7030 9.0610 0.85
(128, 128) 15.041 18.735 0.80

case 2: \
HANDLE_CASE(TYPE, A, 32); \
break; \
HANDLE_CASE(TYPE, A, 32, 16); \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These batch sizes can be tweaked to find a different balance for small batch performance if desired. Or I could even re-purpose some of the binary size savings from #79628 to have separate kernels for small and large batch sizes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually, it should be possible to run the same kernel with a different runtime value for block.y as long as it's not greater than the templated block_dim_y. I'll see how that works out.

…ce on CUDA"


Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here:

| Shape            | Master (ms) | This PR (ms) | Speedup |
|------------------|-------------|--------------|---------|
| (128, 1024, 32)  | 1.4794      | 0.9683       | 1.53    |
| (128, 1024, 128) | 6.8891      | 6.5528       | 1.05    |

Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller.

[ghstack-poisoned]
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here:

| Shape            | Master (ms) | This PR (ms) | Speedup |
|------------------|-------------|--------------|---------|
| (128, 1024, 32)  | 1.4794      | 0.9683       | 1.53    |
| (128, 1024, 128) | 6.8891      | 6.5528       | 1.05    |

Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller.

[ghstack-poisoned]
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here:

| Shape            | Master (ms) | This PR (ms) | Speedup |
|------------------|-------------|--------------|---------|
| (128, 1024, 32)  | 1.4794      | 0.9683       | 1.53    |
| (128, 1024, 128) | 6.8891      | 6.5528       | 1.05    |

Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller.

[ghstack-poisoned]
@peterbell10
Copy link
Copy Markdown
Collaborator Author

These new commits use the occupancy calculator functions to suggest the minimum grid size, then scales the batch size to achieve it. In my testing, this almost completely removed the slowdown on the small end.

Shape Master (us) This PR (us) Speedup
(128, 1024, 32) 1479.4 968.5 1.53
(128, 1024, 128) 6889.1 6553.3 1.05
(1024, 32) 21.312 15.584 1.37
(1024, 128) 79.680 79.136 1.01
(128, 32) 7.7030 7.744 0.99
(128, 128) 15.041 14.976 1.00

Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here:

| Shape            | Master (ms) | This PR (ms) | Speedup |
|------------------|-------------|--------------|---------|
| (128, 1024, 32)  | 1.4794      | 0.9683       | 1.53    |
| (128, 1024, 128) | 6.8891      | 6.5528       | 1.05    |

Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller.

[ghstack-poisoned]
@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jun 21, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed due to This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2538359620

@ngimel
Copy link
Copy Markdown
Collaborator

ngimel commented Jun 21, 2022

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Running under `nvprof` shows speedups for both 32 and 128, which are the sizes changed here:

| Shape            | Master (ms) | This PR (ms) | Speedup |
|------------------|-------------|--------------|---------|
| (128, 1024, 32)  | 1.4794      | 0.9683       | 1.53    |
| (128, 1024, 128) | 6.8891      | 6.5528       | 1.05    |

Binary size is also marginally reduced, with my `torch_cuda.so` getting 500KB smaller.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased gh/peterbell10/334/orig onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/79627)

pytorchmergebot added a commit that referenced this pull request Jun 21, 2022
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

ghstack-source-id: 87c7d9c
Pull Request resolved: #79627
@peterbell10
Copy link
Copy Markdown
Collaborator Author

@pytorchbot merge -g

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@peterbell10 your PR has been successfully merged.

facebook-github-bot pushed a commit that referenced this pull request Jun 22, 2022
Summary:
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Pull Request resolved: #79627

Approved by: https://github.com/ngimel

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/61305cd638b6fcd73a0b66b4cde7014fecb9e8ce

Reviewed By: atalman

Differential Revision: D37343983

Pulled By: atalman

fbshipit-source-id: b842735b6e621457a57d314d7d327f2003c30a75
@facebook-github-bot facebook-github-bot deleted the gh/peterbell10/334/head branch June 25, 2022 14:16
miladm pushed a commit to miladm/pytorch that referenced this pull request Jun 27, 2022
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Pull Request resolved: pytorch#79627

Approved by: https://github.com/ngimel
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 25, 2026
Currently, `bitonicSortKVInPlace` is written to sort one array per
block of threads. If that dimension happens to be very small
(<128 elements), this results in low thread occupancy.

Instead, this changes `bitonicSortKVInPlace` to operate with a 2d
block. Sorting happens along the x dimension, and the y dimension
is a fixed size batch.

Pull Request resolved: pytorch#79627

Approved by: https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general module: performance Issues related to performance, either of kernel code or framework glue open source release notes: cuda release notes category topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants