Fix performance regression in hvcat of simple matrices by BioTurboNick · Pull Request #57422 · JuliaLang/julia

BioTurboNick · 2025-02-15T07:11:34Z

As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple hvcats as a result of #39729 .

I revisted the code and determined that the main cause was that typed_hvncat iterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized that CartesianIndices could be used to restore the copy-as-a-whole pattern that typed_hvcat used, while retaining generality for arbitrary dimensions.

As I recall, I believe a limitation when I wrote the hvncat code was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.

~~Before merging I would also want to check that I didn't hurt the hvncat performance at all.~~ Done

This should ideally be marked for 1.12 backport.

BioTurboNick · 2025-02-15T13:42:12Z

I don't think the test failure is related? It occurred testing Profile module... EDIT: Yep, not related.

BioTurboNick · 2025-02-15T18:35:46Z

There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because getindex with CartesianIndices unfortunately relies on slow integer division via _ind2sub.

EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further.

BioTurboNick · 2025-02-16T07:32:28Z

I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4.

Two other aspects addressed:

1d arrays of pure numbers were a bit slow compared with cat, so I adopted its approach
Identified significant performance reduction in an important case (see below), and found unusual time spent in setindex_shape_check. Adding @inline eliminated the bottleneck entirely, though could that be a symptom of a broader regression?

const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1
e17() = [x ;;; x ;;;; y ;;; y] # 99.356 ns (6 allocations: 544 bytes), was 4x slower and many more allocations

EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in _typed_hvncat_dims is more efficient iterating over eachindex of the tuple and indexing into it. If the elements are a mixture of arrays and scalars, then iterating over the elements with enumerate is more efficient. If the situations are swapped, there's substantial overhead indexing into the tuple (mixed arrays and scalars), or substantial overhead performing the iteration itself (just arrays). Ultimately not a big impact, but a bit of gripe that the compiler can be fickle like that.

vtjnash

SGTM

vtjnash · 2025-10-16T16:31:35Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2025-10-16T22:36:00Z

Your job failed.

BioTurboNick · 2025-10-27T19:41:44Z

Is the nanosoldier failure something to do with the PR, or does it just need to be rerun?

BioTurboNick · 2025-11-21T04:30:03Z

@vtjnash - should nanosoldier be rerun, or is something wrong that I need to fix?

vtjnash · 2025-11-21T18:04:42Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2025-11-21T18:36:26Z

Your job failed.

vtjnash · 2025-11-21T19:45:12Z

It is a bug in BaseBenchmarks

      From worker 3:    ERROR: LoadError: UndefVarError: `WorldView` not defined in `BaseBenchmarks.InferenceBenchmarks`
      From worker 3:    Suggestion: this global was defined as `Compiler.WorldView` but not assigned a value.
      From worker 3:    Stacktrace:
      From worker 3:      [1] top-level scope
      From worker 3:        @ /home/nanosoldier/.julia/dev/BaseBenchmarks/src/inference/InferenceBenchmarks.jl:88

Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com>

BioTurboNick · 2026-02-20T05:44:02Z

I couldn't reproduce these slowdowns, unfortunately. I reverted the @inline because it doesn't seem to impact anything; I assume I saw a reason to do so originally, but I don't recall what it was.

BioTurboNick · 2026-02-20T07:05:31Z

I noticed that in this PR, I have the start of a fix for this regression: JuliaGPU/GPUArrays.jl#672

I would just need to gate the small-array scalar optimization on isa(a, Array), but I can save that for a following PR.

BioTurboNick · 2026-03-25T20:51:41Z

@KristofferC can we please get this into 1.12?

adienes · 2026-03-26T03:22:27Z

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:

julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR

it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

BioTurboNick · 2026-03-27T15:26:13Z

I think the consequences are still not entirely evaluated. e.g. I'm seeing this regression:
julia> using BenchmarkTools

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  321.659 ns (6 allocations: 544 bytes) # master
  488.887 ns (10 allocations: 736 bytes) # PR
it might be easier to merge individual independent pieces of this PR. for example like the precomputed cumprod seems like an obvious strict win

So, turns out this is what the @inline was for. 🙃

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> using BenchmarkTools
[ Info: Precompiling BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf] (caches not reused: 1 for different Julia build configuration)
Precompiling BenchmarkTools finished.
  9 dependencies successfully precompiled in 24 seconds. 8 already precompiled.

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  626.927 ns (10 allocations: 736 bytes)

julia> function Base.setindex_shape_check(X::AbstractArray, I::Integer...)
           @inline
           li = ndims(X)
           lj = length(I)
           i = j = 1
           while true
               ii = length(axes(X,i))
               jj = I[j]
               if i == li || j == lj
                   while i < li
                       i += 1
                       ii *= length(axes(X,i))
                   end
                   while j < lj
                       j += 1
                       jj *= I[j]
                   end
                   if ii != jj
                       Base.throw_setindex_mismatch(X, I)
                   end
                   return
               end
               if ii == jj
                   i += 1
                   j += 1
               elseif ii == 1
                   i += 1
               elseif jj == 1
                   j += 1
               else
                   Base.throw_setindex_mismatch(X, I)
               end
           end
       end

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  209.506 ns (6 allocations: 544 bytes)

adienes · 2026-03-27T16:23:07Z

and after #59025 (on top of this PR), it would be way faster still

julia> x = [1 2; 3 4;;; 5 6; 7 8];

julia> @btime [$x ;;; $x ;;;; $x ;;; $x];
  107.124 ns (6 allocations: 544 bytes)

as I understand it, this PR contains four mostly independent optimizations:

cat_similar + hvncat_fill! becomes a reshape call
precomputing cumprod
skip work for empty arrays (if !any(iszero, outdims))
the block-copying for length(a) > 4

the first three seem like pretty safe improvements. it's the last of these that strikes me as more fragile and harder to evaluate.

for example endindex = CartesianIndex(ntuple(i -> offsets[i] + cat_size(a, i), Val(N))) this may start allocating where previously the loop was non-allocating. also the threshold of length(a) > 4 seems unlikely to be uniformly appropriate across different dimensionalities and array types given that this is the method for all AbstractArrays. what about AbstractArray types without block-copying fast paths, since that will fall back to iterating over the range, will there be more overhead?

BioTurboNick · 2026-03-30T19:42:21Z

How will #61426 interact with this?

What would you recommend I look at to test the block-copying better? Also I mentioned earlier that strictly making the for loop optimization only applicable to Array as it relates to ensuring this code works for GPUArrays. If there's a need for different array types to have their own logic here, would it make sense to factor it out and be a part of the array interface?

adienes · 2026-04-07T16:49:44Z

#61426 is separate and shouldn't interact with this one way or the other

I think you've convinced me that I'm being too picky and this is good to go

@KristofferC let me know if the backport labels are ok here. I know it's not customary to backport performance improvements but it seems ok given the context of this issue?

KristofferC added performance Must go faster backport 1.12 Change should be backported to release-1.12 labels Feb 15, 2025

KristofferC mentioned this pull request Feb 15, 2025

Backports 1.12 #57408

Merged

31 tasks

KristofferC mentioned this pull request Feb 17, 2025

Backports for 1.12 #57444

Merged

24 tasks

KristofferC mentioned this pull request Feb 26, 2025

Backports release 1.12 #57536

Merged

This was referenced Mar 24, 2025

Backports release 1.12 #57871

Merged

Backports release 1.12 #57955

Merged

KristofferC mentioned this pull request Apr 4, 2025

Backports for 1.12.0-beta2 #58009

Merged

51 tasks

KristofferC mentioned this pull request Apr 29, 2025

Backports for 1.12.0-beta3 #58270

Merged

53 tasks

KristofferC mentioned this pull request May 9, 2025

Backports for 1.12.0-beta4 #58369

Merged

58 tasks

KristofferC mentioned this pull request Jun 6, 2025

Backports for 1.12.0-rc1 #58655

Merged

60 tasks

KristofferC mentioned this pull request Jul 22, 2025

Backports for 1.12-rc2 #59061

Merged

20 tasks

KristofferC mentioned this pull request Aug 6, 2025

Backports for 1.12.0-rc2 #59110

Merged

38 tasks

KristofferC mentioned this pull request Aug 19, 2025

Backports for 1.12-rc2 #59337

Merged

27 tasks

This was referenced Sep 15, 2025

Backports for 1.12.0-rc3 (or 1.12.0) #59556

Merged

Backports for 1.12.0-rc3 (or 1.12.0) #59577

Merged

This was referenced Sep 24, 2025

Backports for 1.12.0-rc3 #59624

Merged

Backports for 1.12.1 #59705

Merged

vtjnash reviewed Oct 16, 2025

View reviewed changes

KristofferC mentioned this pull request Oct 21, 2025

Backports for 1.12.2 #59920

Merged

35 tasks

BioTurboNick and others added 10 commits February 19, 2026 22:43

More efficient copying of vectors/matricies

dc3410d

Simplify array copying as chunks, and guard against 0-length arrays

e71a8ad

whitespace

d84cd6e

Balance performance trade-offs

eb576dc

Eliminate bottleneck

ba39295

Remove stray Base

56b0f26

Move inline in

b8a1b34

Fix lost type information in typed_hvncat for 1-d arrays

ef21493

use method

1640127

Co-authored-by: Andy Dienes <51664769+adienes@users.noreply.github.com>

Remove unnecessary inline

8fb69a8

BioTurboNick force-pushed the fix-perf-hvcat-mats branch from c5e9be3 to 8fb69a8 Compare February 20, 2026 05:41

KristofferC mentioned this pull request Feb 25, 2026

Backports for 1.12.6 #61154

Merged

37 tasks

Restore actually-necessary @inline

7850c04

Merge branch 'master' into fix-perf-hvcat-mats

3b62290

adienes approved these changes Apr 7, 2026

View reviewed changes

adienes added merge me PR is reviewed. Merge when all tests are passing backport 1.12 Change should be backported to release-1.12 backport 1.13 Change should be backported to release-1.13 and removed backport 1.12 Change should be backported to release-1.12 labels Apr 7, 2026

IanButterworth merged commit 211af65 into JuliaLang:master Apr 8, 2026
10 of 12 checks passed

IanButterworth removed the merge me PR is reviewed. Merge when all tests are passing label Apr 8, 2026

BioTurboNick mentioned this pull request Apr 8, 2026

Reserve small array optimization in hvncat_fill! for Array types to restore compatibility with GPUArrays #61530

Open

Uh oh!

Conversation

BioTurboNick commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioTurboNick commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioTurboNick commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioTurboNick commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash left a comment

Choose a reason for hiding this comment

Uh oh!

vtjnash commented Oct 16, 2025

Uh oh!

nanosoldier commented Oct 16, 2025

Uh oh!

BioTurboNick commented Oct 27, 2025

Uh oh!

BioTurboNick commented Nov 21, 2025

Uh oh!

vtjnash commented Nov 21, 2025

Uh oh!

nanosoldier commented Nov 21, 2025

Uh oh!

vtjnash commented Nov 21, 2025

Uh oh!

BioTurboNick commented Feb 20, 2026

Uh oh!

BioTurboNick commented Feb 20, 2026

Uh oh!

BioTurboNick commented Mar 25, 2026

Uh oh!

adienes commented Mar 26, 2026

Uh oh!

BioTurboNick commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adienes commented Mar 27, 2026

Uh oh!

BioTurboNick commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adienes commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

BioTurboNick commented Feb 15, 2025 •

edited

Loading

BioTurboNick commented Feb 15, 2025 •

edited

Loading

BioTurboNick commented Feb 15, 2025 •

edited

Loading

BioTurboNick commented Feb 16, 2025 •

edited

Loading

BioTurboNick commented Mar 27, 2026 •

edited

Loading

BioTurboNick commented Mar 30, 2026 •

edited

Loading

adienes commented Apr 7, 2026 •

edited

Loading