Removes threadfence from topk kernel to improve AMD performance by ngimel · Pull Request #145536 · pytorch/pytorch

ngimel · 2025-01-23T21:35:09Z

Also marginally improves cuda perf

pytorch-bot · 2025-01-23T21:35:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145536

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 7e0d3ec with merge base 0d28188 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (disabled by #131303 but the issue was closed recently and a rebase is needed to make it pass)
dynamo/test_exc.py::ExcTests::test_trigger_bisect_on_error

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh) (trunk failure)
ModuleNotFoundError: No module named 'torch.version'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2025-01-23T21:47:55Z

aten/src/ATen/native/cuda/TensorTopK.cu

Are these comments still in-sync with e.g.,
auto ks_to_find_buffer = allocator.allocate(2 * numInputSlices * sizeof(uint32_t)); below

In the kernel, the size of ks_to_find_in is still num_slices, so the kernel comment is correct. The allocation is now twice the size because we cannot update inplace.

facebook-github-bot · 2025-01-24T17:55:16Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-01-25T00:59:17Z

@ngimel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ngimel · 2025-01-29T01:19:55Z

@pytorchbot merge

pytorchmergebot · 2025-01-29T01:23:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#145536) Also marginally improves cuda perf Pull Request resolved: pytorch#145536 Approved by: https://github.com/eqy

…to eliminate redundant memory access (#164459) # TLDR This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs. The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs. | Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) | | -------------- | --------------- | --------------- | ------------------ | | (1, 1B) | 36.6 | 1564.1 | 25.6 | | (1, 100M) | 3.56 | 17.4 | 2.54 | | (1, 1000,000) | 0.135 | 0.145 | 0.098 | | (512, 128000) | 1.33 | 1.33 | 1.32 | | (8192, 128000) | 19.6 | 19.6 | 19.4 | # Background After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**. Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](#145536). # Analysis `torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages: 1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory. 2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram. Before [PR #145536](#145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction. In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**: - `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks. - For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory. # This PR To address this inefficiency, we introduce the following optimizations: 1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`. 2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads. # Performance We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs. | Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) | | -------------- | --------------- | --------------- | ------------------ | | (1, 1B) | 36.6 | 1564.1 | 25.6 | | (1, 100M) | 3.56 | 17.4 | 2.54 | | (1, 1000,000) | 0.135 | 0.145 | 0.098 | | (512, 128000) | 1.33 | 1.33 | 1.32 | | (8192, 128000) | 19.6 | 19.6 | 19.4 | Besides, I have verified the correctness of this PR with different inputs. Pull Request resolved: #164459 Approved by: https://github.com/ngimel, https://github.com/Skylion007

…to eliminate redundant memory access (pytorch#164459) # TLDR This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs. The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs. | Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) | | -------------- | --------------- | --------------- | ------------------ | | (1, 1B) | 36.6 | 1564.1 | 25.6 | | (1, 100M) | 3.56 | 17.4 | 2.54 | | (1, 1000,000) | 0.135 | 0.145 | 0.098 | | (512, 128000) | 1.33 | 1.33 | 1.32 | | (8192, 128000) | 19.6 | 19.6 | 19.4 | # Background After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**. Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR pytorch#145536](pytorch#145536). # Analysis `torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages: 1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory. 2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram. Before [PR pytorch#145536](pytorch#145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction. In PR pytorch#145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**: - `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks. - For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory. # This PR To address this inefficiency, we introduce the following optimizations: 1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`. 2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads. # Performance We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs. | Input Shape | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) | | -------------- | --------------- | --------------- | ------------------ | | (1, 1B) | 36.6 | 1564.1 | 25.6 | | (1, 100M) | 3.56 | 17.4 | 2.54 | | (1, 1000,000) | 0.135 | 0.145 | 0.098 | | (512, 128000) | 1.33 | 1.33 | 1.32 | | (8192, 128000) | 19.6 | 19.6 | 19.4 | Besides, I have verified the correctness of this PR with different inputs. Pull Request resolved: pytorch#164459 Approved by: https://github.com/ngimel, https://github.com/Skylion007

ngimel requested review from eqy and syed-ahmed as code owners January 23, 2025 21:35

pytorch-bot bot added the release notes: cuda release notes category label Jan 23, 2025

first version

36832ab

ngimel force-pushed the ngimel/topk2 branch from 8a0717f to 36832ab Compare January 23, 2025 21:58

eqy approved these changes Jan 23, 2025

View reviewed changes

fix digit_count if there's a single block per slice

576c4da

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 24, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 24, 2025 18:25 Inactive

fix old version build, lint

7e0d3ec

pytorch-bot bot temporarily deployed to upload-benchmark-results January 25, 2025 01:30 Inactive

pytorchmergebot added the merging label Jan 29, 2025

pytorchmergebot closed this in 331f490 Jan 29, 2025

pytorchmergebot added Merged and removed merging labels Jan 29, 2025

apakbin added a commit to ROCm/pytorch that referenced this pull request Feb 4, 2025

Topk: no fence algorithm, mirroring pytorch#145536

0c0251d

apakbin pushed a commit to ROCm/pytorch that referenced this pull request Feb 21, 2025

Removes threadfence from topk kernel to improve AMD performance (pyto…

4cd9747

…rch#145536) Also marginally improves cuda perf Pull Request resolved: pytorch#145536 Approved by: https://github.com/eqy

apakbin mentioned this pull request Feb 21, 2025

[release/2.5] [ROCm] TopK optimizations for AMD GPUs #146387 ROCm/pytorch#1919

Merged

apakbin pushed a commit to ROCm/pytorch that referenced this pull request Feb 26, 2025

Removes threadfence from topk kernel to improve AMD performance (pyto…

b68d11d

…rch#145536) Also marginally improves cuda perf Pull Request resolved: pytorch#145536 Approved by: https://github.com/eqy

apakbin mentioned this pull request Feb 26, 2025

[release/2.6] [ROCm] TopK optimizations for AMD GPUs #146387 ROCm/pytorch#1930

Merged

github-actions bot deleted the ngimel/topk2 branch February 28, 2025 02:09

YyWangCS mentioned this pull request Oct 2, 2025

torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access #164459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes threadfence from topk kernel to improve AMD performance#145536

Removes threadfence from topk kernel to improve AMD performance#145536
ngimel wants to merge 3 commits intomainfrom
ngimel/topk2

ngimel commented Jan 23, 2025

Uh oh!

pytorch-bot bot commented Jan 23, 2025 •

edited

Loading

Uh oh!

eqy Jan 23, 2025

Uh oh!

ngimel Jan 23, 2025

Uh oh!

facebook-github-bot commented Jan 24, 2025

Uh oh!

facebook-github-bot commented Jan 25, 2025

Uh oh!

ngimel commented Jan 29, 2025

Uh oh!

pytorchmergebot commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ngimel commented Jan 23, 2025

Uh oh!

pytorch-bot bot commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145536

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

eqy Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jan 24, 2025

Uh oh!

facebook-github-bot commented Jan 25, 2025

Uh oh!

ngimel commented Jan 29, 2025

Uh oh!

pytorchmergebot commented Jan 29, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Jan 23, 2025 •

edited

Loading