empty gpu memory cache between different benchmark cases (#2242)#2243
empty gpu memory cache between different benchmark cases (#2242)#2243
Conversation
|
I was just having a conversation with someone recently about how we profile results, specifically if we're concerned about how performance can change under the full workload versus isolated benchmarks. Might have been with @drzejan2 this is a pretty interesting example where even the allocator can interfere. The caching allocator has worked really well for us for a long time, but might be time for us to start placing allocations more carefully, in real workloads. I'm fine with the change, just an interesting concrete example of concerns that are often hand-wavy. |
|
That's a nice catch, @liqiangxl! I thought we didn't include memory allocations in the measurement. |
|
@csarofeen thanks for letting me know about this fix. |
Fixes #2242
(1) Reason: PyTorch uses a caching memory allocator to speed up memory allocations. Testing multi-cases in a single run lead to less avilable gpu memroy for the last case and more allocated gpu memory pieces in the memory pool. It may take longer to find the appropriate piece of memory.
(2) Fix: clear memory pool.
(3) Results: after fix the performance difference reduced from 13% to 1%.
Before fix: run 7 cases the performance of last case is: 1.042 TB/s (1067/1024)
After fix: run 7 cases the performance of last case is: 1.190 TB/s
run only the last case: 1.178 TB/s