Simplify zip iterator and improve performance in certain cases#27415
Simplify zip iterator and improve performance in certain cases#27415haampie wants to merge 4 commits intoJuliaLang:masterfrom
Conversation
|
Very cool! Test failure in |
|
Yeah, test failures are annoying, but potentially there's still a workaround without the machinery proposed in #27412. For vectorization it seems crucial to call |
|
It seems this passes the iterators tests Can we tell the compiler somehow the first branch in |
|
Alright, things get vectorized again. This stuff is extremely fragile, so maybe I should add comments with warnings about spoiling vectorization. It could be way more robust and simpler if we did not have to worry about stateful iterators. edit: I stumbled upon an inference issue in the case of 3+ iterators -- let's see if it can be fixed. |
Furthermore, reorder the instructions for the fast path to enable vectorization in some cases.
|
n.b. #26765 |
|
Thanks for the pointer, I wasn't aware of that issue. |
|
Some observations: https://gist.github.com/haampie/5dd6803b217709c4629d3eeb75c16258 Basically, we can do really well if we specialize for stateless iterators. Type inference seems perfect even with recursion up to 14 calls of I cannot get all these great results to work with the current way stateful iterators are handled. If I find time I'll take a look at how we could use simple traits to work around this. |
|
I think I might give up on this pr -- at least on the recursive structure. It's really more of an art than a science to get performance right. The results in the comment above look promising, but in turns out it does not generate optimal code for function vectorize_multiple_products(xs::Vector{T}, ys::Vector{T}, zs::Vector{T}) where {T}
s = zero(T)
for (x,y,z) = zip(xs, ys, zs)
s += x * y * z
end
s
end
@benchmark vectorize_multiple_products(xs, ys, zs) = setup(xs = rand(1000); ys = rand(1000); zs = rand(1000);)which is about 60% slower than the current implementation of zip; unless one interchanges the |
This is a separate pr built on top of #27386. It removes the
Zip2struct, since what used to beZip1is already returning a 1-tuple and hence is a base case (turns out @StefanKarpinski was right in the comments here: eaf5a95, but maybe only afterZip1was updated.)Further it fixes a bug in
isdonewhere the tail of the zip iterator was never checked (https://github.com/JuliaLang/julia/blob/master/base/iterators.jl#L373). Should still add a test for this.I've built this on top of #27386 to show that we now finally get fully optimized code in some cases -- note the lack of
@inbounds:On current master:
On this branch:
This is now on par with writing
for (i, w) = enumerate(v)and with the handwritten loop