Profile: Shuffle profile round robin thread order before taking every sample#41732
Conversation
|
@vtjnash: is this more or less what you had in mind? Thanks! |
|
Note that I don't think this will not reduce the appearance of artificial contention, since they are still likely to pile up against mutexes held while stopped for the profile, just that the contention will now be evenly distributed over threads, instead of being proportional to thread id. |
|
I don't know if I have the power to invoke a benchmark suite run.. Can you do that? Do i just run NanoSoldier? Are there profiling-specific benchmarks to run, here? |
|
Actually, it looks like there aren't any benchmarks for |
Uses O(n) "modern Fisher–Yates shuffle" - https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm Add C buffer to store order for sampling threads in Profile, which is shuffled on every sample.
c458fec to
8caa33c
Compare
|
Okay excellent, it looks like there's no measurable perf impact here: Before (Julia 1.7): julia> Threads.nthreads()
100
julia> function myfunc()
A = rand(200, 200, 400)
maximum(A)
end
myfunc (generic function with 1 method)
julia> @btime (Profile.init(n=Int(1e8), delay=0.0001); Profile.clear(); @profile myfunc())
40.615 ms (2 allocations: 122.07 MiB)
0.9999999702097592
julia> @btime (Profile.init(n=Int(1e8), delay=0.0000001); Profile.clear(); @profile myfunc())
42.577 ms (2 allocations: 122.07 MiB)
0.9999998284031375After (this PR): julia> @btime (Profile.init(n=Int(1e8), delay=0.0001); Profile.clear(); @profile myfunc())
41.659 ms (2 allocations: 122.07 MiB)
0.9999999728178411
julia> @btime (Profile.init(n=Int(1e8), delay=0.0000001); Profile.clear(); @profile myfunc())
41.263 ms (2 allocations: 122.07 MiB)
0.9999999880968238I also tried with So, do i just merge this now? 😬 sorry still not sure about the etiquette now that i have commit rights! 😬 |
|
Okay, after some offline confirmation, I'm merging this now! |
|
Excellent! |
One approach at addressing #33490.
As described here: #9224 (comment), this PR has Profile.jl randomly permute the order in which it samples the threads for profiling. It still samples all threads on every pause, but it maintains an array with the order for which to sample the threads, and shuffles that array before each sample, then samples the threads according to that order.
The hope is that this will reduce the likelihood of introducing artificial contention into the program, by avoiding pausing the threads in a consistent order, which can cause a pileup of threads if they share a mutex, as described here:
#33490 (comment)
We seem to think that the best approach is to change the profiler to:
But this PR is a good step in the right direction.
One potential concern is that it will unacceptably reduce performance of profiling, so that remains to be seen.
I did some small measurements, and shuffling the array seems reasonably fast: