Skip to content

empty gpu memory cache between different benchmark cases (#2242)#2243

Merged
liqiangxl merged 1 commit intodevelfrom
llu/benckmark_variance
Dec 6, 2022
Merged

empty gpu memory cache between different benchmark cases (#2242)#2243
liqiangxl merged 1 commit intodevelfrom
llu/benckmark_variance

Conversation

@liqiangxl
Copy link
Copy Markdown
Collaborator

Fixes #2242
(1) Reason: PyTorch uses a caching memory allocator to speed up memory allocations. Testing multi-cases in a single run lead to less avilable gpu memroy for the last case and more allocated gpu memory pieces in the memory pool. It may take longer to find the appropriate piece of memory.
(2) Fix: clear memory pool.
(3) Results: after fix the performance difference reduced from 13% to 1%.
Before fix: run 7 cases the performance of last case is: 1.042 TB/s (1067/1024)
After fix: run 7 cases the performance of last case is: 1.190 TB/s
run only the last case: 1.178 TB/s

@liqiangxl liqiangxl requested a review from naoyam December 6, 2022 14:03
@csarofeen
Copy link
Copy Markdown
Owner

I was just having a conversation with someone recently about how we profile results, specifically if we're concerned about how performance can change under the full workload versus isolated benchmarks. Might have been with @drzejan2 this is a pretty interesting example where even the allocator can interfere. The caching allocator has worked really well for us for a long time, but might be time for us to start placing allocations more carefully, in real workloads.

I'm fine with the change, just an interesting concrete example of concerns that are often hand-wavy.

@naoyam
Copy link
Copy Markdown
Collaborator

naoyam commented Dec 6, 2022

That's a nice catch, @liqiangxl! I thought we didn't include memory allocations in the measurement.

Copy link
Copy Markdown
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liqiangxl liqiangxl merged commit 673d40c into devel Dec 6, 2022
@liqiangxl liqiangxl deleted the llu/benckmark_variance branch December 6, 2022 23:23
@drzejan2
Copy link
Copy Markdown
Collaborator

drzejan2 commented Dec 7, 2022

@csarofeen thanks for letting me know about this fix.
I haven't moved to the benchmark infrastructure in the project but I plan to start moving in that direction really soon - for the time being I will rely on simple sample and CUDA events from its execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10% difference in benckmark results of different runs with different number of cases

4 participants