Skip to content

[Graph][Benchmark] Update benchmark function#363

Merged
Aalanli merged 11 commits intohidet-org:mainfrom
Aalanli:update-benchmark
Oct 12, 2023
Merged

[Graph][Benchmark] Update benchmark function#363
Aalanli merged 11 commits intohidet-org:mainfrom
Aalanli:update-benchmark

Conversation

@Aalanli
Copy link
Copy Markdown
Contributor

@Aalanli Aalanli commented Oct 10, 2023

The old benchmarking function did not clear the l2 cache, so repeated runs are biased.
This is especially prevalent in tuning for parallel-k parts, which always selects k_parts=1 due to l2 cache hits, even when it is not the fastest implementation.

@yaoyaoding
Copy link
Copy Markdown
Member

Hi Allan,

The PR looks good to me. But before merging, it is better to have some demos on the improvement of the accuracy on the selection of parallel k when we clear the L2 cache.

@Aalanli
Copy link
Copy Markdown
Contributor Author

Aalanli commented Oct 11, 2023

After some further investigation, it appears that the clearing of the l2 cache is not the greatest contributor, but the usage of torch.cuda.Event, which I assume to be more accurate than time.time().
Here is my benchmarking script for reference: https://gist.github.com/Aalanli/b81d1a751a78ea72b491d872aa993f9e

image
image

@Aalanli
Copy link
Copy Markdown
Contributor Author

Aalanli commented Oct 11, 2023

orig-latency is the original benchmark function
new-latency is the benchmark function used by this pr
orig-latency-with-event is a benchmark function that uses torch.cuda.Event, but does not clear l2 cache.

@Aalanli
Copy link
Copy Markdown
Contributor Author

Aalanli commented Oct 11, 2023

I just removed the torch dependencies.

@Aalanli Aalanli merged commit 82ddb8c into hidet-org:main Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants