Skip to content

Profile: Shuffle profile round robin thread order before taking every sample#41732

Merged
NHDaly merged 1 commit intoJuliaLang:masterfrom
NHDaly:nhd-profile-shuffle-thread-order
Aug 2, 2021
Merged

Profile: Shuffle profile round robin thread order before taking every sample#41732
NHDaly merged 1 commit intoJuliaLang:masterfrom
NHDaly:nhd-profile-shuffle-thread-order

Conversation

@NHDaly
Copy link
Copy Markdown
Member

@NHDaly NHDaly commented Jul 30, 2021

One approach at addressing #33490.

As described here: #9224 (comment), this PR has Profile.jl randomly permute the order in which it samples the threads for profiling. It still samples all threads on every pause, but it maintains an array with the order for which to sample the threads, and shuffles that array before each sample, then samples the threads according to that order.

The hope is that this will reduce the likelihood of introducing artificial contention into the program, by avoiding pausing the threads in a consistent order, which can cause a pileup of threads if they share a mutex, as described here:
#33490 (comment)


We seem to think that the best approach is to change the profiler to:

  • use a random sample interval
  • only sample one thread each time, chosen randomly

But this PR is a good step in the right direction.

One potential concern is that it will unacceptably reduce performance of profiling, so that remains to be seen.

I did some small measurements, and shuffling the array seems reasonably fast:

julia> const x = [1,2,3]
3-element Vector{Int64}:
 1
 2
 3

julia> const seed = Ref(0)
Base.RefValue{Int64}(0)

julia> @btime @ccall jl_shuffle_int_array_inplace(pointer(x)::Ptr{Int}, length(x)::Csize_t, seed::Ptr{Int})::Cvoid
  17.172 ns (0 allocations: 0 bytes)
julia> const x = collect(1:100)
100-element Vector{Int64}:
   1
   
 100

julia> const seed = Ref(0)
Base.RefValue{Int64}(0)

julia> @btime @ccall jl_shuffle_int_array_inplace(pointer(x)::Ptr{Int}, length(x)::Csize_t, seed::Ptr{Int})::Cvoid
  710.507 ns (0 allocations: 0 bytes)

@NHDaly NHDaly added feature Indicates new feature / enhancement requests performance Must go faster labels Jul 30, 2021
@NHDaly NHDaly requested a review from vtjnash July 30, 2021 01:32
@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jul 30, 2021

@vtjnash: is this more or less what you had in mind? Thanks!

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Jul 30, 2021

Note that I don't think this will not reduce the appearance of artificial contention, since they are still likely to pile up against mutexes held while stopped for the profile, just that the contention will now be evenly distributed over threads, instead of being proportional to thread id.

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jul 30, 2021

I don't know if I have the power to invoke a benchmark suite run.. Can you do that? Do i just run NanoSoldier? Are there profiling-specific benchmarks to run, here?

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Jul 30, 2021

Actually, it looks like there aren't any benchmarks for Profile? 🤔
https://github.com/JuliaCI/BaseBenchmarks.jl/search?q=profile

Uses O(n) "modern Fisher–Yates shuffle"
 - https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm

Add C buffer to store order for sampling threads in Profile, which is
shuffled on every sample.
@NHDaly NHDaly force-pushed the nhd-profile-shuffle-thread-order branch from c458fec to 8caa33c Compare August 1, 2021 21:43
@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Aug 1, 2021

Okay excellent, it looks like there's no measurable perf impact here:

Before (Julia 1.7):

julia> Threads.nthreads()
100

julia> function myfunc()
           A = rand(200, 200, 400)
           maximum(A)
       end
myfunc (generic function with 1 method)

julia> @btime (Profile.init(n=Int(1e8), delay=0.0001); Profile.clear(); @profile myfunc())
  40.615 ms (2 allocations: 122.07 MiB)
0.9999999702097592

julia> @btime (Profile.init(n=Int(1e8), delay=0.0000001); Profile.clear(); @profile myfunc())
  42.577 ms (2 allocations: 122.07 MiB)
0.9999998284031375

After (this PR):

julia> @btime (Profile.init(n=Int(1e8), delay=0.0001); Profile.clear(); @profile myfunc())
  41.659 ms (2 allocations: 122.07 MiB)
0.9999999728178411

julia> @btime (Profile.init(n=Int(1e8), delay=0.0000001); Profile.clear(); @profile myfunc())
  41.263 ms (2 allocations: 122.07 MiB)
0.9999999880968238

I also tried with peakflops() with varying delay sizes, and with different numbers of threads, and similarly never saw an impact. :)


So, do i just merge this now? 😬 sorry still not sure about the etiquette now that i have commit rights! 😬

@NHDaly
Copy link
Copy Markdown
Member Author

NHDaly commented Aug 2, 2021

Okay, after some offline confirmation, I'm merging this now!

@NHDaly NHDaly merged commit ed13d09 into JuliaLang:master Aug 2, 2021
@NHDaly NHDaly deleted the nhd-profile-shuffle-thread-order branch August 2, 2021 16:49
@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Aug 2, 2021

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Indicates new feature / enhancement requests performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants