Skip to content

speed up hvcat_fill! by unrolling internal iteration for type-stability#61426

Open
adienes wants to merge 7 commits intoJuliaLang:masterfrom
adienes:type_stable_hvcat_fill
Open

speed up hvcat_fill! by unrolling internal iteration for type-stability#61426
adienes wants to merge 7 commits intoJuliaLang:masterfrom
adienes:type_stable_hvcat_fill

Conversation

@adienes
Copy link
Copy Markdown
Member

@adienes adienes commented Mar 28, 2026

julia> using BenchmarkTools

julia> @btime [1.0 2; 3 4.0];
  75.248 ns (8 allocations: 336 bytes) # master
  9.593 ns (2 allocations: 112 bytes) # PR

julia> @btime [1 2 3im; 4 5 6im;;;]
  104.256 ns (10 allocations: 720 bytes) # master
  14.236 ns (2 allocations: 176 bytes) # PR

replaces #52028 with suggestion given in #52028 (comment)

@adienes adienes added performance Must go faster arrays [a, r, r, a, y, s] labels Mar 28, 2026
@adienes
Copy link
Copy Markdown
Member Author

adienes commented Mar 28, 2026

it looks like this is slightly slightly slower in heterogenous cases where the types don't promote to a concrete type, e.g.

julia> @btime ['a' 2; 'b' 3];
  282.313 ns (25 allocations: 1.16 KiB) # master
  302.454 ns (25 allocations: 1.16 KiB) # PR

this seems acceptable to me given that it's much more rare, and will be slow anyway since you end up with an abstractly-typed collection. but just noting.

adienes added a commit that referenced this pull request Mar 31, 2026
in a similar vein to #61426, we
can speed up `allequal` by unrolling the loop (up to a cap, 32 chosen by
convention)

I suppose this is not particularly a super common bottleneck but we may
as well be faster where possible.

master:
```
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 5))
BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
 Range (min … max):  13.861 ns …   8.303 μs  ┊ GC (min … max): 0.00% … 99.17%
 Time  (median):     18.412 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   33.582 ns ± 122.345 ns  ┊ GC (mean ± σ):  6.08% ±  1.71%

  ▅▇█▇▅▂        ▁▄▄▄▃▃▁  ▁▄▅▄▃▃▂▁   ▃▄▄▃▂▁▁▂▂▁ ▁▂▂▁   ▁▁▃▂     ▂
  ██████▅▅▃▅▃▁▄▅███████▇▆████████▇▇▇███████████████▇▆▆█████▆▇▆ █
  13.9 ns       Histogram: log(frequency) by time      83.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 12))
BenchmarkTools.Trial: 624 samples with 997 evaluations per sample.
 Range (min … max):  16.090 ns … 42.490 μs  ┊ GC (min … max): 0.00% … 73.54%
 Time  (median):     10.193 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    8.034 μs ±  4.193 μs  ┊ GC (mean ± σ):  0.62% ±  2.94%

  █                                                    ▆▅▆     
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▆▅█▃▂▁▁▁▁▁▁▁▆███▆▃▃ ▃
  16.1 ns         Histogram: frequency by time        11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 56))
BenchmarkTools.Trial: 480 samples with 1 evaluation per sample.
 Range (min … max):   9.840 ms … 48.062 ms  ┊ GC (min … max): 0.00% … 76.38%
 Time  (median):     10.312 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.399 ms ±  1.744 ms  ┊ GC (mean ± σ):  0.74% ±  3.49%

       ▁▇ ▁▆▄▁▂▃▂▃▃▆█▆▃▄▂▁▁▁ ▁                                 
  ▄▄▃▅▇██▇██████████████████▇█▄▄▄▃▂▂▁▃▃▃▂▁▂▂▁▁▁▂▁▁▂▁▂▃▂▁▁▁▁▁▂ ▄
  9.84 ms         Histogram: frequency by time        11.5 ms <

 Memory estimate: 1.45 MiB, allocs estimate: 27954.
```

PR
```
julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 5))
BenchmarkTools.Trial: 10000 samples with 998 evaluations per sample.
 Range (min … max):  14.445 ns … 91.516 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.868 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.809 ns ±  1.603 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                     ▅▃▁       █▁▁▂▁                           
  ▁▂▄▅▄▃▄▄▄▃▂▂▂▁▂▄▇█████▇▅▃▃▂▃▇█████▇▄▃▂▂▃▃▃▄▄▄▅▄▄▃▂▂▁▁▁▁▁▁▁▁ ▃
  14.4 ns         Histogram: frequency by time        19.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 12))
BenchmarkTools.Trial: 952 samples with 998 evaluations per sample.
 Range (min … max):  15.697 ns … 20.862 μs  ┊ GC (min … max): 0.00% … 62.59%
 Time  (median):      6.387 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    5.256 μs ±  3.257 μs  ┊ GC (mean ± σ):  0.48% ±  2.84%

  █                                                            
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▃▂▂▃▃▃▃▄▄▃▄▄▄▃▃▄▄▅▄▄▃▃▃▄▃▃▃▃▃▂ ▃
  15.7 ns         Histogram: frequency by time        9.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark allequal(t) setup=(t=ntuple(i->rand((1.0, 2)), 56))
BenchmarkTools.Trial: 645 samples with 1 evaluation per sample.
 Range (min … max):  6.847 ms …  23.438 ms  ┊ GC (min … max): 0.00% … 62.03%
 Time  (median):     7.830 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.730 ms ± 827.062 μs  ┊ GC (mean ± σ):  0.29% ±  2.44%

    ▁▂▃▁                   ▅█▄▁▁                               
  ▃▇████▆█▇▆▄▄▄▄▃▄▃▃▃▃▃▄▄▄▇█████▇▇▆▇▄▅▄▃▄▅▄▃▄▃▃▄▃▃▃▄▃▃▃▃▂▃▁▃▂ ▄
  6.85 ms         Histogram: frequency by time        9.08 ms <

 Memory estimate: 488.16 KiB, allocs estimate: 9482.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant