Profile: Shuffle profile round robin thread order before taking every sample by NHDaly · Pull Request #41732 · JuliaLang/julia

NHDaly · 2021-07-30T01:32:32Z

One approach at addressing #33490.

As described here: #9224 (comment), this PR has Profile.jl randomly permute the order in which it samples the threads for profiling. It still samples all threads on every pause, but it maintains an array with the order for which to sample the threads, and shuffles that array before each sample, then samples the threads according to that order.

The hope is that this will reduce the likelihood of introducing artificial contention into the program, by avoiding pausing the threads in a consistent order, which can cause a pileup of threads if they share a mutex, as described here:
#33490 (comment)

We seem to think that the best approach is to change the profiler to:

use a random sample interval
only sample one thread each time, chosen randomly

But this PR is a good step in the right direction.

One potential concern is that it will unacceptably reduce performance of profiling, so that remains to be seen.

I did some small measurements, and shuffling the array seems reasonably fast:

julia> const x = [1,2,3]
3-element Vector{Int64}:
 1
 2
 3

julia> const seed = Ref(0)
Base.RefValue{Int64}(0)

julia> @btime @ccall jl_shuffle_int_array_inplace(pointer(x)::Ptr{Int}, length(x)::Csize_t, seed::Ptr{Int})::Cvoid
  17.172 ns (0 allocations: 0 bytes)

julia> const x = collect(1:100)
100-element Vector{Int64}:
   1
   ⋮
 100

julia> const seed = Ref(0)
Base.RefValue{Int64}(0)

julia> @btime @ccall jl_shuffle_int_array_inplace(pointer(x)::Ptr{Int}, length(x)::Csize_t, seed::Ptr{Int})::Cvoid
  710.507 ns (0 allocations: 0 bytes)

NHDaly · 2021-07-30T01:33:13Z

@vtjnash: is this more or less what you had in mind? Thanks!

vtjnash · 2021-07-30T14:58:39Z

Note that I don't think this will not reduce the appearance of artificial contention, since they are still likely to pile up against mutexes held while stopped for the profile, just that the contention will now be evenly distributed over threads, instead of being proportional to thread id.

NHDaly · 2021-07-30T18:48:42Z

I don't know if I have the power to invoke a benchmark suite run.. Can you do that? Do i just run NanoSoldier? Are there profiling-specific benchmarks to run, here?

NHDaly · 2021-07-30T18:50:51Z

Actually, it looks like there aren't any benchmarks for Profile? 🤔
https://github.com/JuliaCI/BaseBenchmarks.jl/search?q=profile

Uses O(n) "modern Fisher–Yates shuffle" - https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm Add C buffer to store order for sampling threads in Profile, which is shuffled on every sample.

NHDaly · 2021-08-01T22:05:10Z

Okay excellent, it looks like there's no measurable perf impact here:

Before (Julia 1.7):

julia> Threads.nthreads()
100

julia> function myfunc()
           A = rand(200, 200, 400)
           maximum(A)
       end
myfunc (generic function with 1 method)

julia> @btime (Profile.init(n=Int(1e8), delay=0.0001); Profile.clear(); @profile myfunc())
  40.615 ms (2 allocations: 122.07 MiB)
0.9999999702097592

julia> @btime (Profile.init(n=Int(1e8), delay=0.0000001); Profile.clear(); @profile myfunc())
  42.577 ms (2 allocations: 122.07 MiB)
0.9999998284031375

After (this PR):

julia> @btime (Profile.init(n=Int(1e8), delay=0.0001); Profile.clear(); @profile myfunc())
  41.659 ms (2 allocations: 122.07 MiB)
0.9999999728178411

julia> @btime (Profile.init(n=Int(1e8), delay=0.0000001); Profile.clear(); @profile myfunc())
  41.263 ms (2 allocations: 122.07 MiB)
0.9999999880968238

I also tried with peakflops() with varying delay sizes, and with different numbers of threads, and similarly never saw an impact. :)

So, do i just merge this now? 😬 sorry still not sure about the etiquette now that i have commit rights! 😬

NHDaly · 2021-08-02T16:27:55Z

Okay, after some offline confirmation, I'm merging this now!

vtjnash · 2021-08-02T17:15:55Z

Excellent!

NHDaly added feature Indicates new feature / enhancement requests performance Must go faster labels Jul 30, 2021

NHDaly requested a review from vtjnash July 30, 2021 01:32

vtjnash approved these changes Jul 30, 2021

View reviewed changes

NHDaly force-pushed the nhd-profile-shuffle-thread-order branch from c458fec to 8caa33c Compare August 1, 2021 21:43

NHDaly merged commit ed13d09 into JuliaLang:master Aug 2, 2021

NHDaly deleted the nhd-profile-shuffle-thread-order branch August 2, 2021 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Profile: Shuffle profile round robin thread order before taking every sample#41732

Profile: Shuffle profile round robin thread order before taking every sample#41732
NHDaly merged 1 commit intoJuliaLang:masterfrom
NHDaly:nhd-profile-shuffle-thread-order

NHDaly commented Jul 30, 2021 •

edited

Loading

Uh oh!

NHDaly commented Jul 30, 2021

Uh oh!

vtjnash commented Jul 30, 2021

Uh oh!

NHDaly commented Jul 30, 2021

Uh oh!

NHDaly commented Jul 30, 2021

Uh oh!

NHDaly commented Aug 1, 2021

Uh oh!

NHDaly commented Aug 2, 2021

Uh oh!

vtjnash commented Aug 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

NHDaly commented Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NHDaly commented Jul 30, 2021

Uh oh!

vtjnash commented Jul 30, 2021

Uh oh!

NHDaly commented Jul 30, 2021

Uh oh!

NHDaly commented Jul 30, 2021

Uh oh!

NHDaly commented Aug 1, 2021

Uh oh!

NHDaly commented Aug 2, 2021

Uh oh!

vtjnash commented Aug 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NHDaly commented Jul 30, 2021 •

edited

Loading