Move all platforms to use llvm.minimum/llvm.maximum for fmin/fmax by gbaraldi · Pull Request #56371 · JuliaLang/julia

gbaraldi · 2024-10-28T20:59:51Z

This used to not work but LLVM now has support for this on all platforms we care about.

Maybe this should be a builtin.
This allows for more vectorization opportunities since llvm understands the code better

Moelf · 2024-10-28T21:32:54Z

oh wow

julia> versioninfo()
Julia Version 1.12.0-DEV.1506
Commit 2cdfe062952 (2024-10-28 11:32 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 16 default, 0 interactive, 16 GC (on 16 virtual cores)

julia> function f(dij)
           vmin = Inf
           for x in dij
               vmin = min(x,vmin)
           end
           return vmin
       end
f (generic function with 2 methods)

julia> @be rand(Float64, 512000) f samples=500 evals=100
Benchmark: 6 samples with 100 evaluations
 min    1.764 ms
 median 1.768 ms
 mean   1.775 ms
 max    1.805 ms

julia> function f_llvm(dij)
           vmin = Inf
           for x in dij
               vmin = llvm_min(x,vmin)
           end
           return vmin
       end
f_llvm (generic function with 1 method)

julia> Base.@assume_effects :total @inline llvm_min(x::Float64, y::Float64) = ccall("llvm.minimum.f64", llvmcall, Float64, (Float64, Float64), x, y)
llvm_min (generic function with 1 method)

julia> @be rand(Float64, 512000) f_llvm samples=500 evals=100
Benchmark: 163 samples with 100 evaluations
 min    49.408 μs
 median 51.382 μs
 mean   51.667 μs
 max    55.443 μs

Moelf · 2024-10-28T22:24:29Z

related question, what would it take to make isless() faster? (it's used for our findmin())

julia> function f_isless(dij)
           vmin = Inf
           for x in dij
               pred = isless(x, vmin)
               vmin = pred ? x : vmin
           end
           return vmin
       end
f_isless (generic function with 2 methods)

julia> @be rand(Float64, 512000) f_isless samples=500 evals=100
Benchmark: 5 samples with 100 evaluations
 min    2.108 ms
 median 2.114 ms
 mean   2.118 ms
 max    2.140 ms

julia> function f_llvm(dij)
           vmin = Inf
           for x in dij
               vmin = llvm_min(x, vmin)
           end
           return vmin
       end
f_llvm (generic function with 5 methods)

julia> @be rand(Float64, 512000) f_llvm samples=500 evals=100
Benchmark: 162 samples with 100 evaluations
 min    49.071 μs
 median 51.354 μs
 mean   51.752 μs
 max    71.813 μs

Zentrik · 2024-10-29T04:53:33Z

Are you sure the isless version isn't slow because it's branching

Moelf · 2024-10-29T09:05:11Z

it's not because of branching:

julia> function f_isless(dij)
                  vmin = Inf
                  for x in dij
                      pred = isless(x, vmin)
                      vmin = ifelse(pred, x, vmin)
                  end
                  return vmin
              end
f_isless (generic function with 1 method)

julia> @be rand(512000) f_isless samples=100 evals=50
Benchmark: 10 samples with 50 evaluations
 min    2.090 ms
 median 2.101 ms
 mean   2.116 ms
 max    2.211 ms

gbaraldi · 2024-10-29T15:17:09Z

PowerPC failures are fixed in LLVM 19. So we may want to wait for that to merge to merge this.

oscardssmith · 2024-10-29T19:01:30Z

Should these get tfuncs? IIUC, making these be ccalls gives them inlining cost 10, which is probably higher than we want.

gbaraldi · 2024-10-29T19:27:45Z

Maybe. I'm not sure we gain much by making them tfuncs. They get annotated @inline anyway. But I´m not against it.

giordano · 2024-10-30T12:45:30Z

This fixes #48487 (CC: @mikmoore) on x86_64 (I think it was already fixed on aarch64):

% julia +pr56371 -q  
julia> code_native(x->max(1.0,x),Tuple{Float64};debuginfo=:none)

# [...]
# %bb.0:                                # %top
        #DEBUG_VALUE: #2:x <- $xmm0
        push    rbp
        mov     rbp, rsp
        movabs  rax, offset .LCPI0_0
        vmovsd  xmm1, qword ptr [rax]           # xmm1 = mem[0],zero
        vmaxsd  xmm0, xmm1, xmm0
        pop     rbp
        ret

julia> code_native(x->ifelse(1.0>x,1.0,x),Tuple{Float64};debuginfo=:none)

# [...]
# %bb.0:                                # %top
        #DEBUG_VALUE: #5:x <- $xmm0
        push    rbp
        mov     rbp, rsp
        movabs  rax, offset .LCPI0_0
        vmovsd  xmm1, qword ptr [rax]           # xmm1 = mem[0],zero
        vmaxsd  xmm0, xmm1, xmm0
        pop     rbp
        ret

src/intrinsics.cpp

mikmoore · 2024-10-30T13:35:02Z

Hopefully this also gives us a path to better performance on minimum/maximum/extrema without the pitfalls of the current implementation or complexity of #45581 (although I think that PR would be faster, it was too complicated to ever quite finish and would be annoying to maintain).

Moelf · 2024-10-30T13:53:11Z

as far as I can tell minimum/maximum is as fast as it can be -- I don't know if you can get the same performance as Base.fast_minimum without fastmath.

However, I'm continuously interested in what we can do for argmin/findmin -- Mose and I have yet to find a way to make findmin() SIMD without using SIMD.jl

andrewjradcliffe · 2024-10-31T20:08:42Z

@Moelf part of the issue is thatisless treats NaNs specially. Ideally, it would follow the totalOrder predicate (Wikipedia is a big vague; see IEEE 754-2019 for details). However, historical precedent means that we have weird treatment of NaN values.

For example, if one eliminates the branches here, then a factor of 2 speedup is possible. Alas, the ordering of floats becomes [-NaN, -Inf, ..., +Inf, +NaN] and this would be breaking.

Moelf · 2024-10-31T20:15:05Z

Right, but I don't think use @fastmath x<vmin help either, it just doesn't SIMD no matter what you try

giordano · 2024-10-31T20:29:37Z

I think this discussion should continue in a dedicated ticket 🙂

oscardssmith · 2025-01-13T20:00:46Z

Given that LLVM 19 is taking a while to merge and powerpc is currently listed as teir 4 (used to build), can we rebase and merge this?

oscardssmith · 2025-01-13T21:01:11Z

I've pushed 1 additional commit that removes some of the remaining complexity from when this was only ccalls.

…n but as a test. This used to not work but LLVM now has support for this on all platforms we care about.

base/math.jl

src/runtime_intrinsics.c

giordano

Looks good to me as far as I can tell.

I can confirm this works also on riscv64, and also on this PR

julia> code_llvm(x->max(-Inf,x),Tuple{Float64};debuginfo=:none)

; Function Signature: var"#20"(Float64)
define double @"julia_#20_2131"(double %"x::Float64") #0 {
top:
    #dbg_value(double %"x::Float64", !2, !DIExpression(), !14)
  ret double %"x::Float64"
}

while on current master

julia> code_llvm(x->max(-Inf,x),Tuple{Float64};debuginfo=:none)

; Function Signature: var"#26"(Float64)
define double @"julia_#26_4754"(double %"x::Float64") #0 {
top:
    #dbg_value(double %"x::Float64", !3, !DIExpression(), !15)
  %0 = fsub double 0xFFF0000000000000, %"x::Float64"
  %bitcast_coercion = bitcast double %0 to i64
  %1 = icmp sgt i64 %bitcast_coercion, -1
  %2 = select i1 %1, double 0xFFF0000000000000, double %"x::Float64"
  %3 = fcmp ord double %"x::Float64", 0.000000e+00
  %4 = select i1 %3, double %2, double %0
  ret double %4
}

which is another confirmation #48487 is indeed fixed. I see exactly the same improvement on x86_64 (on aarch64 we were already using the llvm functions so there's no change there).

Note that tests are passing on all platforms, the x86_64-apple-darwin job is only erroring during a cleanup step at the end, and upload jobs are failing because of the ongoing github outage.

…liaLang#56371) This used to not work but LLVM now has support for this on all platforms we care about. Maybe this should be a builtin. This allows for more vectorization opportunities since llvm understands the code better Fix JuliaLang#48487. --------- Co-authored-by: Mosè Giordano <765740+giordano@users.noreply.github.com> Co-authored-by: oscarddssmith <oscar.smith@juliacomputing.com>

Moelf reviewed Oct 30, 2024

View reviewed changes

src/intrinsics.cpp Outdated Show resolved Hide resolved

oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code labels Nov 8, 2024

giordano force-pushed the gb/fmin-fun branch from b13ac32 to 4e18a86 Compare January 13, 2025 20:06

gbaraldi and others added 4 commits January 13, 2025 16:30

Move all platforms to use llvm.minimum. Maybe change this to a builti…

f116ffb

…n but as a test. This used to not work but LLVM now has support for this on all platforms we care about.

Change min from llvmcall to intrinsic

4457c08

Update src/intrinsics.cpp

f0d7d2c

clean up now that we have the intrinsic rather than just the ccall

3402a7e

giordano reviewed Jan 13, 2025

View reviewed changes

base/math.jl Outdated Show resolved Hide resolved

giordano reviewed Jan 13, 2025

View reviewed changes

base/math.jl Outdated Show resolved Hide resolved

giordano reviewed Jan 13, 2025

View reviewed changes

src/runtime_intrinsics.c Outdated Show resolved Hide resolved

address review

a0af124

oscardssmith force-pushed the gb/fmin-fun branch from 639a157 to a0af124 Compare January 13, 2025 21:56

giordano approved these changes Jan 14, 2025

View reviewed changes

oscardssmith merged commit a861a55 into master Jan 14, 2025

oscardssmith deleted the gb/fmin-fun branch January 14, 2025 04:14

adienes mentioned this pull request Jan 21, 2025

Error in max_double implementation #57119

Closed

giordano mentioned this pull request Mar 5, 2025

optimize min/max and related for floats #57647

Open

oscardssmith mentioned this pull request May 2, 2025

Remove bugged and typically slower minimum/maximum method #58267

Merged

mbauman mentioned this pull request May 2, 2025

Speed up extrema 3-50x #58280

Merged

pxl-th mentioned this pull request May 8, 2025

Compilation issues with Flux.softmax() and Julia v1.12.0-beta2 JuliaGPU/AMDGPU.jl#756

Closed

adienes mentioned this pull request May 13, 2025

Function evaluated on identical arguments returns different results using ApproxFunSingularities #58402

Closed

giordano mentioned this pull request Sep 10, 2025

LLVM 18- generates non-existing min.NaN.f64/max.NaN.f64 instructions JuliaGPU/CUDA.jl#2886

Closed

4 tasks

serenity4 mentioned this pull request Sep 16, 2025

Add Julia v1.12 compatibility chalk-lab/Mooncake.jl#714

Merged

6 tasks

Uh oh!

Conversation

gbaraldi commented Oct 28, 2024 • edited by giordano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Moelf commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Moelf commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zentrik commented Oct 29, 2024

Uh oh!

Moelf commented Oct 29, 2024

Uh oh!

gbaraldi commented Oct 29, 2024

Uh oh!

oscardssmith commented Oct 29, 2024

Uh oh!

gbaraldi commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Oct 30, 2024

Uh oh!

Uh oh!

mikmoore commented Oct 30, 2024

Uh oh!

Moelf commented Oct 30, 2024

Uh oh!

andrewjradcliffe commented Oct 31, 2024

Uh oh!

Moelf commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Oct 31, 2024

Uh oh!

oscardssmith commented Jan 13, 2025

Uh oh!

oscardssmith commented Jan 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

giordano left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gbaraldi commented Oct 28, 2024 •

edited by giordano

Loading

Moelf commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 28, 2024 •

edited

Loading

gbaraldi commented Oct 29, 2024 •

edited

Loading

Moelf commented Oct 31, 2024 •

edited

Loading