[ROCm] TopK optimizations for AMD GPUs by apakbin · Pull Request #146387 · pytorch/pytorch

apakbin · 2025-02-04T02:49:19Z

TopK performance on ROCm performs better on the test suite with the default config.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-02-04T02:49:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146387

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ff7648f with merge base 71855a1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet

Sure, though it would be good to paste an example of benchmark run and perf observed before and after the change

apakbin · 2025-02-04T17:42:38Z

Sure, though it would be good to paste an example of benchmark run and perf observed before and after the change

thank you, will add that as well. This PR is work in progress.

apakbin · 2025-02-11T17:24:20Z

A comparison between ROCm config versus default config, which was the reason for changing the calculations of regs_per_mp and max_blocks_per_mp:

apakbin · 2025-02-11T18:08:43Z

The reason why heuristic was changed: we first observed that sort path is better for single dimensional data:

Looking into the case of 1-dimensional data closer, we observed:

Then we arrived at the heuristic: choose sort if {tensor is 1dimensional} and {tensor has more than 10000 elements}

pruthvistony · 2025-02-13T00:35:36Z

@ngimel ,
Can you please review this PR.

ngimel · 2025-02-13T01:05:45Z

aten/src/ATen/native/cuda/TensorTopK.cpp

+  size_t n_multidims = 0; // number of dimensions with dimensionality more than one
+  for(int s: self.sizes()) {
+    n_multidims += (s > 1);
+  }
+  return (n_multidims == 1 && self.numel()>=10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387


I think you want self.numel() == self.sizes(dim)

thank you @ngimel for your suggestion. Will change it to "return (self.numel() == self.sizes(dim) && self.numel()>=10000);"

apakbin · 2025-02-14T16:49:06Z

@pytorchbot merge

pytorch-bot · 2025-02-14T16:49:10Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

pruthvistony · 2025-02-14T18:00:47Z

@apakbin
waiting for PR to finish the CI run.

apakbin · 2025-02-14T18:32:11Z

@pytorchbot rebase

pruthvistony · 2025-02-18T01:18:26Z

@apakbin ,
The UTs are failing on the post merge job which runs on MI300 compared to the pre-merge job which runs on MI200. Debug and check what could be causing this.

cc @jeffdaily @jithunnair-amd

ngimel · 2025-02-18T04:46:52Z

aten/src/ATen/native/cuda/TensorTopK.cpp

 bool disable_sort_for_topk();
 bool should_use_sort(const Tensor& self, int64_t dim) {
+#if defined(USE_ROCM)
+  return (self.numel() == self.size(dim) && self.numel() >= 10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387


Suggested change

return (self.numel() == self.size(dim) && self.numel() >= 10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387

return (self.numel() >= 10000 && self.numel() == self.size(dim)); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387

thank you @ngimel.

…f.size()

apakbin · 2025-02-18T23:49:24Z

@pytorchbot merge

pytorchmergebot · 2025-02-18T23:51:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-18T23:51:15Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

apakbin · 2025-02-19T17:03:23Z

@pytorchbot merge

pytorchmergebot · 2025-02-19T17:05:11Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

TopK performance on ROCm performs better on the test suite with the default config. Pull Request resolved: pytorch#146387 Approved by: https://github.com/malfet, https://github.com/ngimel

apakbin requested review from eqy and syed-ahmed as code owners February 4, 2025 02:49

pytorch-bot bot added the release notes: cuda release notes category label Feb 4, 2025

pytorchbot added the open source label Feb 4, 2025

malfet approved these changes Feb 4, 2025

View reviewed changes

apakbin marked this pull request as draft February 4, 2025 17:41

apakbin changed the title ~~AMD: reverting to default config for better performance.~~ TOPK: AMD-specific optimizations: calculations of regs_per_mp and max_blocks_per_mp + heuristic Feb 11, 2025

apakbin mentioned this pull request Feb 11, 2025

[Phantom] Pytorch topk perf (SWDEV-499808) ROCm/pytorch#1848

Closed

apakbin marked this pull request as ready for review February 12, 2025 16:13

apakbin changed the title ~~TOPK: AMD-specific optimizations: calculations of regs_per_mp and max_blocks_per_mp + heuristic~~ [ROCm] TopK optimizations for AMD GPUs Feb 12, 2025

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Feb 12, 2025

pruthvistony requested a review from ngimel February 13, 2025 00:35

ngimel approved these changes Feb 13, 2025

View reviewed changes

apakbin added a commit to apakbin/pytorch that referenced this pull request Feb 14, 2025

mirroring pytorch#146387

5c13ba1

apakbin mentioned this pull request Feb 14, 2025

Release/2.5: [ROCm] TopK optimizations for AMD GPUs #147174

Closed

pytorchmergebot removed the merging label Feb 17, 2025

apakbin added a commit to apakbin/pytorch that referenced this pull request Feb 17, 2025

TopK optimizations for AMD GPUs, mirroring pytorch#146387

3affba0

apakbin mentioned this pull request Feb 17, 2025

Release/2.5: [ROCm] TopK optimizations for AMD GPUs ROCm/pytorch#1908

Closed

apakbin added a commit to apakbin/pytorch that referenced this pull request Feb 18, 2025

[ROCm] TopK optimizations for AMD GPUs, mirroring pytorch#146387

9f14d5d

apakbin mentioned this pull request Feb 18, 2025

[release/2.5] [ROCm] TopK optimizations for AMD GPUs #146387 ROCm/pytorch#1909

Closed

ngimel reviewed Feb 18, 2025

View reviewed changes

apakbin added 2 commits February 18, 2025 19:14

debug: make sure the tensor is not 0 dimensional before accessing sel…

b67e649

…f.size()

bool sort not yet supported by ROCm

ff7648f

pytorchmergebot added the merging label Feb 18, 2025

pytorchmergebot removed the merging label Feb 18, 2025

pytorchmergebot added the merging label Feb 19, 2025

pytorchmergebot closed this in 5220d40 Feb 19, 2025

pytorchmergebot added Merged and removed merging labels Feb 19, 2025

apakbin mentioned this pull request Feb 21, 2025

[release/2.5] [ROCm] TopK optimizations for AMD GPUs #146387 ROCm/pytorch#1919

Merged

apakbin mentioned this pull request Feb 26, 2025

[release/2.6] [ROCm] TopK optimizations for AMD GPUs #146387 ROCm/pytorch#1930

Merged

	return (self.numel() == self.size(dim) && self.numel() >= 10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387
	return (self.numel() >= 10000 && self.numel() == self.size(dim)); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387

Conversation

apakbin commented Feb 4, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146387

✅ No Failures

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

apakbin commented Feb 4, 2025

Uh oh!

apakbin commented Feb 11, 2025

Uh oh!

apakbin commented Feb 11, 2025

Uh oh!

pruthvistony commented Feb 13, 2025

Uh oh!

ngimel Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

apakbin Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

apakbin commented Feb 14, 2025

Uh oh!

pytorch-bot bot commented Feb 14, 2025

Uh oh!

pruthvistony commented Feb 14, 2025

Uh oh!

apakbin commented Feb 14, 2025

Uh oh!

pruthvistony commented Feb 18, 2025

Uh oh!

ngimel Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

apakbin Feb 18, 2025

Choose a reason for hiding this comment

Uh oh!

apakbin commented Feb 18, 2025

Uh oh!

pytorchmergebot commented Feb 18, 2025

Merge started

Uh oh!

pytorchmergebot commented Feb 18, 2025

Merge failed

Uh oh!

apakbin commented Feb 19, 2025

Uh oh!

pytorchmergebot commented Feb 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

apakbin commented Feb 4, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 4, 2025 •

edited

Loading