Skip to content

Fix performance regression in hvcat of simple matrices#57422

Merged
IanButterworth merged 13 commits intoJuliaLang:masterfrom
BioTurboNick:fix-perf-hvcat-mats
Apr 8, 2026
Merged

Fix performance regression in hvcat of simple matrices#57422
IanButterworth merged 13 commits intoJuliaLang:masterfrom
BioTurboNick:fix-perf-hvcat-mats

Conversation

@BioTurboNick
Copy link
Copy Markdown
Contributor

@BioTurboNick BioTurboNick commented Feb 15, 2025

As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple hvcats as a result of #39729 .

I revisted the code and determined that the main cause was that typed_hvncat iterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized that CartesianIndices could be used to restore the copy-as-a-whole pattern that typed_hvcat used, while retaining generality for arbitrary dimensions.

As I recall, I believe a limitation when I wrote the hvncat code was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.

Before merging I would also want to check that I didn't hurt the hvncat performance at all. Done

This should ideally be marked for 1.12 backport.

@KristofferC KristofferC added performance Must go faster backport 1.12 Change should be backported to release-1.12 labels Feb 15, 2025
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 15, 2025

I don't think the test failure is related? It occurred testing Profile module... EDIT: Yep, not related.

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 15, 2025

There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because getindex with CartesianIndices unfortunately relies on slow integer division via _ind2sub.

EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further.

@KristofferC KristofferC mentioned this pull request Feb 15, 2025
31 tasks
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Feb 16, 2025

I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4.

Two other aspects addressed:

  • 1d arrays of pure numbers were a bit slow compared with cat, so I adopted its approach
  • Identified significant performance reduction in an important case (see below), and found unusual time spent in setindex_shape_check. Adding @inline eliminated the bottleneck entirely, though could that be a symptom of a broader regression?
const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1
e17() = [x ;;; x ;;;; y ;;; y] # 99.356 ns (6 allocations: 544 bytes), was 4x slower and many more allocations

EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in _typed_hvncat_dims is more efficient iterating over eachindex of the tuple and indexing into it. If the elements are a mixture of arrays and scalars, then iterating over the elements with enumerate is more efficient. If the situations are swapped, there's substantial overhead indexing into the tuple (mixed arrays and scalars), or substantial overhead performing the iteration itself (just arrays). Ultimately not a big impact, but a bit of gripe that the compiler can be fickle like that.

@KristofferC KristofferC mentioned this pull request Feb 17, 2025
24 tasks
This was referenced Mar 24, 2025
@KristofferC KristofferC mentioned this pull request Apr 4, 2025
51 tasks
@KristofferC KristofferC mentioned this pull request Apr 29, 2025
53 tasks
@KristofferC KristofferC mentioned this pull request May 9, 2025
58 tasks
@KristofferC KristofferC mentioned this pull request Jun 6, 2025
60 tasks
@KristofferC KristofferC mentioned this pull request Jul 22, 2025
20 tasks
@KristofferC KristofferC mentioned this pull request Aug 6, 2025
38 tasks
@KristofferC KristofferC mentioned this pull request Aug 19, 2025
27 tasks
This was referenced Sep 24, 2025
Copy link
Copy Markdown
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Oct 16, 2025

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your job failed.

@KristofferC KristofferC mentioned this pull request Oct 21, 2025
35 tasks
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

Is the nanosoldier failure something to do with the PR, or does it just need to be rerun?

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

@vtjnash - should nanosoldier be rerun, or is something wrong that I need to fix?

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Nov 21, 2025

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your job failed.

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Nov 21, 2025

It is a bug in BaseBenchmarks

      From worker 3:    ERROR: LoadError: UndefVarError: `WorldView` not defined in `BaseBenchmarks.InferenceBenchmarks`
      From worker 3:    Suggestion: this global was defined as `Compiler.WorldView` but not assigned a value.
      From worker 3:    Stacktrace:
      From worker 3:      [1] top-level scope
      From worker 3:        @ /home/nanosoldier/.julia/dev/BaseBenchmarks/src/inference/InferenceBenchmarks.jl:88

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

I couldn't reproduce these slowdowns, unfortunately. I reverted the @inline because it doesn't seem to impact anything; I assume I saw a reason to do so originally, but I don't recall what it was.

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

I noticed that in this PR, I have the start of a fix for this regression: JuliaGPU/GPUArrays.jl#672

I would just need to gate the small-array scalar optimization on isa(a, Array), but I can save that for a following PR.

@KristofferC KristofferC mentioned this pull request Feb 25, 2026
37 tasks
@BioTurboNick
Copy link
Copy Markdown
Contributor Author

@KristofferC can we please get this into 1.12?

@adienes
Copy link
Copy Markdown
Member

adienes commented Mar 26, 2026

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:

julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR

it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Mar 27, 2026

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:

julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR

it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

So, turns out this is what the @inline was for. 🙃

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf] (caches not reused: 1 for different Julia build configuration)
Precompiling BenchmarkTools finished.
  9 dependencies successfully precompiled in 24 seconds. 8 already precompiled.

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  626.927 ns (10 allocations: 736 bytes)

julia> function Base.setindex_shape_check(X::AbstractArray, I::Integer...)
           @inline
           li = ndims(X)
           lj = length(I)
           i = j = 1
           while true
               ii = length(axes(X,i))
               jj = I[j]
               if i == li || j == lj
                   while i < li
                       i += 1
                       ii *= length(axes(X,i))
                   end
                   while j < lj
                       j += 1
                       jj *= I[j]
                   end
                   if ii != jj
                       Base.throw_setindex_mismatch(X, I)
                   end
                   return
               end
               if ii == jj
                   i += 1
                   j += 1
               elseif ii == 1
                   i += 1
               elseif jj == 1
                   j += 1
               else
                   Base.throw_setindex_mismatch(X, I)
               end
           end
       end

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  209.506 ns (6 allocations: 544 bytes)

@adienes
Copy link
Copy Markdown
Member

adienes commented Mar 27, 2026

and after #59025 (on top of this PR), it would be way faster still

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  107.124 ns (6 allocations: 544 bytes)

as I understand it, this PR contains four mostly independent optimizations:

  • cat_similar + hvncat_fill! becomes a reshape call
  • precomputing cumprod
  • skip work for empty arrays (if !any(iszero, outdims))
  • the block-copying for length(a) > 4

the first three seem like pretty safe improvements. it's the last of these that strikes me as more fragile and harder to evaluate.

for example endindex = CartesianIndex(ntuple(i -> offsets[i] + cat_size(a, i), Val(N))) this may start allocating where previously the loop was non-allocating. also the threshold of length(a) > 4 seems unlikely to be uniformly appropriate across different dimensionalities and array types given that this is the method for all AbstractArrays. what about AbstractArray types without block-copying fast paths, since that will fall back to iterating over the range, will there be more overhead?

@BioTurboNick
Copy link
Copy Markdown
Contributor Author

BioTurboNick commented Mar 30, 2026

How will #61426 interact with this?

What would you recommend I look at to test the block-copying better? Also I mentioned earlier that strictly making the for loop optimization only applicable to Array as it relates to ensuring this code works for GPUArrays. If there's a need for different array types to have their own logic here, would it make sense to factor it out and be a part of the array interface?

@adienes
Copy link
Copy Markdown
Member

adienes commented Apr 7, 2026

#61426 is separate and shouldn't interact with this one way or the other

I think you've convinced me that I'm being too picky and this is good to go

@KristofferC let me know if the backport labels are ok here. I know it's not customary to backport performance improvements but it seems ok given the context of this issue?

@adienes adienes added merge me PR is reviewed. Merge when all tests are passing backport 1.12 Change should be backported to release-1.12 backport 1.13 Change should be backported to release-1.13 and removed backport 1.12 Change should be backported to release-1.12 labels Apr 7, 2026
@IanButterworth IanButterworth merged commit 211af65 into JuliaLang:master Apr 8, 2026
10 of 12 checks passed
@IanButterworth IanButterworth removed the merge me PR is reviewed. Merge when all tests are passing label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 1.12 Change should be backported to release-1.12 backport 1.13 Change should be backported to release-1.13 performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants