Skip to content

[ROCm] TopK optimizations for AMD GPUs#146387

Closed
apakbin wants to merge 7 commits intopytorch:mainfrom
apakbin:topk_rocm_tune
Closed

[ROCm] TopK optimizations for AMD GPUs#146387
apakbin wants to merge 7 commits intopytorch:mainfrom
apakbin:topk_rocm_tune

Conversation

@apakbin
Copy link
Contributor

@apakbin apakbin commented Feb 4, 2025

TopK performance on ROCm performs better on the test suite with the default config.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label Feb 4, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146387

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ff7648f with merge base 71855a1 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, though it would be good to paste an example of benchmark run and perf observed before and after the change

@apakbin apakbin marked this pull request as draft February 4, 2025 17:41
@apakbin
Copy link
Contributor Author

apakbin commented Feb 4, 2025

Sure, though it would be good to paste an example of benchmark run and perf observed before and after the change

thank you, will add that as well. This PR is work in progress.

@apakbin
Copy link
Contributor Author

apakbin commented Feb 11, 2025

A comparison between ROCm config versus default config, which was the reason for changing the calculations of regs_per_mp and max_blocks_per_mp:
default_config

@apakbin apakbin changed the title AMD: reverting to default config for better performance. TOPK: AMD-specific optimizations: calculations of regs_per_mp and max_blocks_per_mp + heuristic Feb 11, 2025
@apakbin
Copy link
Contributor Author

apakbin commented Feb 11, 2025

The reason why heuristic was changed: we first observed that sort path is better for single dimensional data:
latency_plt_sort_vs_native

Looking into the case of 1-dimensional data closer, we observed:
latency_plt_1dim

Then we arrived at the heuristic: choose sort if {tensor is 1dimensional} and {tensor has more than 10000 elements}

@pruthvistony pruthvistony added topic: not user facing topic category rocm This tag is for PRs from ROCm team rocm priority high priority ROCm PRs from performance or other aspects ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm and removed release notes: cuda release notes category labels Feb 11, 2025
@apakbin apakbin marked this pull request as ready for review February 12, 2025 16:13
@apakbin apakbin changed the title TOPK: AMD-specific optimizations: calculations of regs_per_mp and max_blocks_per_mp + heuristic [ROCm] TopK optimizations for AMD GPUs Feb 12, 2025
@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Feb 12, 2025
@pruthvistony pruthvistony requested a review from ngimel February 13, 2025 00:35
@pruthvistony
Copy link
Collaborator

@ngimel ,
Can you please review this PR.

Comment on lines +38 to +42
size_t n_multidims = 0; // number of dimensions with dimensionality more than one
for(int s: self.sizes()) {
n_multidims += (s > 1);
}
return (n_multidims == 1 && self.numel()>=10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want self.numel() == self.sizes(dim)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @ngimel for your suggestion. Will change it to "return (self.numel() == self.sizes(dim) && self.numel()>=10000);"

apakbin added a commit to apakbin/pytorch that referenced this pull request Feb 14, 2025
@apakbin
Copy link
Contributor Author

apakbin commented Feb 14, 2025

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 14, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@pruthvistony
Copy link
Collaborator

@apakbin
waiting for PR to finish the CI run.

@apakbin
Copy link
Contributor Author

apakbin commented Feb 14, 2025

@pytorchbot rebase

@pruthvistony
Copy link
Collaborator

@apakbin ,
The UTs are failing on the post merge job which runs on MI300 compared to the pre-merge job which runs on MI200. Debug and check what could be causing this.

cc @jeffdaily @jithunnair-amd

bool disable_sort_for_topk();
bool should_use_sort(const Tensor& self, int64_t dim) {
#if defined(USE_ROCM)
return (self.numel() == self.size(dim) && self.numel() >= 10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (self.numel() == self.size(dim) && self.numel() >= 10000); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387
return (self.numel() >= 10000 && self.numel() == self.size(dim)); // based on the experiments in https://github.com/pytorch/pytorch/pull/146387

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @ngimel.

@apakbin
Copy link
Contributor Author

apakbin commented Feb 18, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@apakbin
Copy link
Contributor Author

apakbin commented Feb 19, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

apakbin added a commit to ROCm/pytorch that referenced this pull request Feb 21, 2025
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: pytorch#146387
Approved by: https://github.com/malfet, https://github.com/ngimel
jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Feb 21, 2025
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: pytorch#146387
Approved by: https://github.com/malfet, https://github.com/ngimel
apakbin added a commit to ROCm/pytorch that referenced this pull request Feb 26, 2025
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: pytorch#146387
Approved by: https://github.com/malfet, https://github.com/ngimel
jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Feb 26, 2025
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: pytorch#146387
Approved by: https://github.com/malfet, https://github.com/ngimel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source rocm priority high priority ROCm PRs from performance or other aspects rocm This tag is for PRs from ROCm team topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants