Add optimized mapreduce implementation for SkipMissing by nalimilan · Pull Request #27743 · JuliaLang/julia

nalimilan · 2018-06-23T14:16:34Z

Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions, for SkipMissing{Vector{Int}}, but also improves performance for SkipMissing{Vector{Float64}}.

The performance improvement is clear for Int (3×-5× less time), but less so for Float64 (-30%). However, since SIMD now works for Int, there's a chance it could eventually be enabled for Float64, and it should be quite easier than with the generic mapreduce fallback. Also, these 30% are enough to get as fast as R (maybe even a bit faster), which is an important reference point (see this Discourse thread).

Fixes #27679.

# With Vector{Int}
X1 = rand(Int, 10_000_000);
X2 = Vector{Union{Missing,Int}}(X1);
X3 = ifelse.(rand(length(X2)) .< 0.9, X2, missing);

# With Vector{Float64}
Y1 = rand(Float64, 10_000_000);
Y2 = Vector{Union{Missing,Float64}}(Y1);
Y3 = ifelse.(rand(length(Y2)) .< 0.9, Y2, missing);

using BenchmarkTools

# Before

julia> @btime sum(skipmissing(X1));
 7.316 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(X2));
 18.175 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(X3));
 28.240 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y1));
 13.816 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y2));
 18.254 ms (2 allocations: 32 bytes)

julia> @btime sum(skipmissing(Y3));
 28.430 ms (2 allocations: 32 bytes)

# After

julia> @btime sum(skipmissing(X1));
  5.918 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(X2));
  6.113 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(X3));
  6.142 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(Y1));
  5.830 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(Y2));
  13.816 ms (4 allocations: 64 bytes)

julia> @btime sum(skipmissing(Y3));
  19.018 ms (4 allocations: 64 bytes)

nalimilan · 2018-06-23T14:20:58Z

+        i > ilast && return mapreduce_first(f, op, a1)
+        a2 = ai::eltype(itr)
+        # Unexpectedly, the following assertion allows SIMD instructions to be emitted
+        A[i]::eltype(itr)


The fact that this line is required to get SIMD is really intriguing. In the present case, we know A[i] is non-missing, so it should be perfectly fine to keep. And this assertion doesn't seem to force the compiler to incorrectly assume that all entries in A are non-missing, since we do get missing values in the loop below.

Minimal reproducer:

function g(A) v = 0 # Uncomment to get SIMD # A[1]::Base.nonmissingtype(eltype(A)) @simd for i = 1:length(A) @inbounds ai = A[i] if ai !== missing v += ai end end return v end @code_llvm g([1,missing])

Maybe open the minimal reproducer in a standalone issue in order to get more attention?

Seems like something might be preventing LICM (loop-invariant-code-motion), since in the non-simd version, it's reloading the data pointer on every iteration, and any non-functional code that uses it in the header will allow for SIMD.

julia> _false = false julia> function g(A) v = 0 # Uncomment to get SIMD x = A[1] if isa(x, Missing) && _false error() end @simd for i = 1:length(A) @inbounds ai = A[i] if ai !== missing v += ai end end return v end

Yes, I can file a separate issue if that's useful, but @vtjnash can probably write a much more useful description than I can.

nalimilan · 2018-06-25T08:48:14Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2018-06-25T16:36:27Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

nalimilan · 2018-06-26T08:56:51Z

Nanosoldier results are really contrasted: sum is about twice faster when there are 10% of missing values in the input, but about 10%-30% slower when there are no missing values. The pattern is quite consistent across types, so it doesn't look like noise. That's weird, since locally I see only one regression (most likely spurious since it's only for Int8).

@ararslan Could something weird be going on with the comparison? I ask because at #27386 there are improvements which I cannot reproduce locally either.

nalimilan · 2018-06-26T15:28:43Z

Probable explanation here: #27386 (comment). Let's run it again.

@nanosoldier runbenchmarks("union", vs = ":master")

ararslan · 2018-06-26T17:15:38Z

Restarted the server. @nanosoldier runbenchmarks("union", vs=":master")

mbauman · 2018-06-26T17:19:52Z

Probable explanation here: #\27386 (comment).

Unfortunately, I don't think that was the case here. I've not (yet) known Nanosoldier to blatantly lie to us. In this case, it reported comparing 9e7f85d against 0978251. Those are the correct commits — neither the PR nor master had the random change incorporated.

nanosoldier · 2018-06-26T19:31:04Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

nalimilan · 2018-06-27T09:04:04Z

Indeed, the regressions are still there. I'm really stuck. I've tried again locally after rebasing on master, and for example I get this for Vector{Union{Float64,Missing}} without missing values, while the Nanosoldier report says it's a 32% regression! The increase in allocations is consistent though (and it's negligible).

julia> master["array"]["(\"skipmissing\", sum, Float64, false)"]
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.462 μs (0.00% GC)
  median time:      1.524 μs (0.00% GC)
  mean time:        1.568 μs (0.00% GC)
  maximum time:     33.038 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> pr["array"]["(\"skipmissing\", sum, Float64, false)"]
BenchmarkTools.Trial:
  memory estimate:  64 bytes
  allocs estimate:  4
  --------------
  minimum time:     732.000 ns (0.00% GC)
  median time:      793.000 ns (0.00% GC)
  mean time:        5.053 μs (0.00% GC)
  maximum time:     42.571 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

What can we do about that? Would somebody check locally whether my results are confirmed?

nalimilan · 2018-07-05T12:43:24Z

@nanosoldier runbenchmarks("union", vs=":master")

nanosoldier · 2018-07-05T14:04:37Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions, for SkipMissing{Vector{Int}}, but also improves performance for SkipMissing{Vector{Float64}}.

nalimilan · 2018-07-05T18:43:52Z

I've found out why Nanosoldier find regressions. I'm still not sure why I wasn't able to reproduce the problem before, but at least now I see it when running BaseBenchmarks. It only happens for vectors of ~1000 entries or less (as in the benchmark, but not in my other checks), and only when !(eltype(A) >: Missing). It turns out the benchmarks tested that when there are no missing values, because a comprehension narrowed the type automatically, which I hadn't realized. So I wasn't doing the right comparison.

I've fixed it by only using the new method when eltype(A) >: Missing. The specialized method is clearly not supposed to be faster when !(eltype(A) >: Missing) since it adds two loops per block to identify the two first non-missing entries.

There's still a 20% regression locally for Array{Float32}, which is probably due to the fact that vectorization currently doesn't work for floats, so the new method doesn't really add much (contrary to integers), and the extra code has a fixed cost which is significant for arrays of ~1000 entries. Let's see whether Nanosoldier confirms it.

@nanosoldier runbenchmarks("union", vs=":master")

nanosoldier · 2018-07-05T20:05:43Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

nalimilan · 2018-07-05T20:54:09Z

Good news. Most regressions have disappeared and turned into improvements. The only remaining regressions which are related are for Float64 and Float32. That sounds silly given that it's one of the main use cases for missing, but actually they are only there for arrays of length 1000, and progressively disappear for larger vectors, turning into improvements. I've filed JuliaCI/BaseBenchmarks.jl#220 to use larger arrays in the benchmark since that's more representative of work on real data sets. Anyway, this PR is one step in the direction of enabling SIMD instructions, which will make a large difference (as currently for integers).

So barring objections I'll merge this tomorrow.

matthieugomez · 2018-07-07T15:19:49Z

Awesome! But, just to keep in mind, another common operation on datasets is to compute mean per group, where typically size per group may be < 1000.

nalimilan · 2018-07-07T15:52:15Z

Yeah, I realized that too. Anyway the regression isn't too bad, and the goal is really to enable SIMD, which will be faster in all cases (except maybe very small vectors).

vtjnash · 2018-07-10T15:01:48Z

+        i > ilast && return Some(mapreduce_first(f, op, a1))
+        a2 = ai::eltype(itr)
+        # Unexpectedly, the following assertion allows SIMD instructions to be emitted
+        A[i]::eltype(itr)


Fixed by d803472 (missing align attribute on dereferenceable Arguments)

Let's check this: #28035.

nalimilan added performance Must go faster missing data Base.missing and related functionality labels Jun 23, 2018

nalimilan commented Jun 23, 2018

View reviewed changes

nalimilan mentioned this pull request Jun 23, 2018

Making sum(skipmissing(x)) faster #27679

Closed

nalimilan mentioned this pull request Jun 26, 2018

Guarantee inbounds iteration over Array{T} #27386

Merged

nalimilan force-pushed the nl/mapreduce branch from cab4ebd to 4c6b742 Compare July 2, 2018 15:56

nalimilan force-pushed the nl/mapreduce branch from 4c6b742 to 3b88f3c Compare July 5, 2018 18:36

nalimilan added 3 commits July 5, 2018 20:37

Add optimized mapreduce implementation for SkipMissing

afe9286

Based on the AbstractArray methods. Allows the compiler to emit SIMD instructions, for SkipMissing{Vector{Int}}, but also improves performance for SkipMissing{Vector{Float64}}.

Use nothing/Some instead of custom AllMissing singleton

058ba81

Use generic fallback for arrays with !(eltype >: Missing)

8f73286

nalimilan force-pushed the nl/mapreduce branch from 3b88f3c to 8f73286 Compare July 5, 2018 18:38

nalimilan merged commit a60119b into master Jul 6, 2018

nalimilan deleted the nl/mapreduce branch July 6, 2018 20:54

nalimilan mentioned this pull request Jul 10, 2018

Support mapreduce over dimensions with SkipMissing #28027

Open

vtjnash reviewed Jul 10, 2018

View reviewed changes

tkf mentioned this pull request Oct 11, 2019

Transducer as an optimization: map, filter and flatten #33526

Merged

Uh oh!

Conversation

nalimilan commented Jun 23, 2018

Uh oh!

nalimilan Jun 23, 2018

Choose a reason for hiding this comment

Uh oh!

StefanKarpinski Jun 23, 2018

Choose a reason for hiding this comment

Uh oh!

vtjnash Jun 23, 2018

Choose a reason for hiding this comment

Uh oh!

nalimilan Jun 23, 2018

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Jun 25, 2018

Uh oh!

nanosoldier commented Jun 25, 2018

Uh oh!

nalimilan commented Jun 26, 2018

Uh oh!

nalimilan commented Jun 26, 2018

Uh oh!

ararslan commented Jun 26, 2018

Uh oh!

mbauman commented Jun 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nanosoldier commented Jun 26, 2018

Uh oh!

nalimilan commented Jun 27, 2018

Uh oh!

nalimilan commented Jul 5, 2018

Uh oh!

nanosoldier commented Jul 5, 2018

Uh oh!

nalimilan commented Jul 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nanosoldier commented Jul 5, 2018

Uh oh!

nalimilan commented Jul 5, 2018

Uh oh!

matthieugomez commented Jul 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Jul 7, 2018

Uh oh!

vtjnash Jul 10, 2018

Choose a reason for hiding this comment

Uh oh!

nalimilan Jul 10, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mbauman commented Jun 26, 2018 •

edited

Loading

nalimilan commented Jul 5, 2018 •

edited

Loading

matthieugomez commented Jul 7, 2018 •

edited

Loading