Fix performance regression in hvcat of simple matrices#57422
Fix performance regression in hvcat of simple matrices#57422IanButterworth merged 13 commits intoJuliaLang:masterfrom
Conversation
|
I don't think the test failure is related? It occurred testing |
|
There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further. |
|
I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4. Two other aspects addressed:
EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in |
|
@nanosoldier |
|
Your job failed. |
|
Is the nanosoldier failure something to do with the PR, or does it just need to be rerun? |
|
@vtjnash - should nanosoldier be rerun, or is something wrong that I need to fix? |
|
@nanosoldier |
|
Your job failed. |
|
It is a bug in BaseBenchmarks |
Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com>
c5e9be3 to
8fb69a8
Compare
|
I couldn't reproduce these slowdowns, unfortunately. I reverted the |
|
I noticed that in this PR, I have the start of a fix for this regression: JuliaGPU/GPUArrays.jl#672 I would just need to gate the small-array scalar optimization on |
|
@KristofferC can we please get this into 1.12? |
|
I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression: it might be easier to merge individual independent pieces of this PR. for example like the precomputed |
So, turns out this is what the |
|
and after #59025 (on top of this PR), it would be way faster still as I understand it, this PR contains four mostly independent optimizations:
the first three seem like pretty safe improvements. it's the last of these that strikes me as more fragile and harder to evaluate. for example |
|
How will #61426 interact with this? What would you recommend I look at to test the block-copying better? Also I mentioned earlier that strictly making the for loop optimization only applicable to |
|
#61426 is separate and shouldn't interact with this one way or the other I think you've convinced me that I'm being too picky and this is good to go @KristofferC let me know if the backport labels are ok here. I know it's not customary to backport performance improvements but it seems ok given the context of this issue? |
As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple
hvcats as a result of #39729 .I revisted the code and determined that the main cause was that
typed_hvncatiterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized thatCartesianIndicescould be used to restore the copy-as-a-whole pattern thattyped_hvcatused, while retaining generality for arbitrary dimensions.As I recall, I believe a limitation when I wrote the
hvncatcode was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.Before merging I would also want to check that I didn't hurt theDonehvncatperformance at all.This should ideally be marked for 1.12 backport.