[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config#135397
[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config#135397hongxiayang wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135397
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 6cd7c04 with merge base de74aaf ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
HI, @malfet : Can you help to merge this PR? The two test failures are not related. Thank you! |
|
@pytorchbot merge -f "Lint is green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchmergebot cherry-pick --help |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchmergebot cherry-pick --onto release/2.5 -c critical |
…d in reduce config (#135397) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: #135397 Approved by: https://github.com/eqy, https://github.com/malfet (cherry picked from commit eb38ee2)
Cherry picking #135397The cherry pick PR is at #135624 and it is recommended to link a critical cherry pick PR with an issue. Details for Dev Infra teamRaised by workflow job |
#1588) …d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet Fixes #ISSUE_NUMBER Co-authored-by: hongxyan <hongxyan@amd.com>
) Follow-up to pytorch#135397. AMD gpus perform better with fewer thread blocks. So increase the min_values_per_thread as well. This helped improved [CvT](https://github.com/facebookresearch/FAMBench/tree/main/benchmarks/cvt) benchmark performance on MI300X Co-author: @carlobertolli
…d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet
|
thanks @hongxiayang ! i can confirm that this fixes it |
#1588) …d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet Fixes #ISSUE_NUMBER Co-authored-by: hongxyan <hongxyan@amd.com> (cherry picked from commit 4360582)
) Follow-up to pytorch#135397. AMD gpus perform better with fewer thread blocks. So increase the min_values_per_thread as well. This helped improved [CvT](https://github.com/facebookresearch/FAMBench/tree/main/benchmarks/cvt) benchmark performance on MI300X Co-author: @carlobertolli (cherry picked from commit c1b6f60)
#1588) …d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet Fixes #ISSUE_NUMBER Co-authored-by: hongxyan <hongxyan@amd.com>
) Follow-up to pytorch#135397. AMD gpus perform better with fewer thread blocks. So increase the min_values_per_thread as well. This helped improved [CvT](https://github.com/facebookresearch/FAMBench/tree/main/benchmarks/cvt) benchmark performance on MI300X Co-author: @carlobertolli
Fixes #132964
This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform.
By increasing this parameter, it uses fewer threadblocks and improved the performance for large tensors.
Test:
Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s).
Co-author: @carlobertolli
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo