[Graph][Benchmark] Update benchmark function#363
Merged
Aalanli merged 11 commits intohidet-org:mainfrom Oct 12, 2023
Merged
Conversation
Member
|
Hi Allan, The PR looks good to me. But before merging, it is better to have some demos on the improvement of the accuracy on the selection of parallel k when we clear the L2 cache. |
Contributor
Author
|
After some further investigation, it appears that the clearing of the l2 cache is not the greatest contributor, but the usage of |
Contributor
Author
|
orig-latency is the original benchmark function |
Contributor
Author
|
I just removed the torch dependencies. |
yaoyaoding
approved these changes
Oct 11, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


The old benchmarking function did not clear the l2 cache, so repeated runs are biased.
This is especially prevalent in tuning for parallel-k parts, which always selects k_parts=1 due to l2 cache hits, even when it is not the fastest implementation.