Experiment with a slighly adjusted pipeline#52850
Conversation
|
@nanosoldier |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
ad90755 to
30ed1f0
Compare
|
@nanosoldier |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
6a92ba1 to
e4a27bd
Compare
|
@nanosoldier |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
| FPM.addPass(InstCombinePass()); | ||
| FPM.addPass(AggressiveInstCombinePass()); |
There was a problem hiding this comment.
does it make sense to do 2 instcombine right next to each other?
There was a problem hiding this comment.
There was a problem hiding this comment.
It might be worth customizing the AggressiveInstCombinePass slightly since the defaults include some options that are likely not useful for us specifically from https://llvm.org/doxygen/AggressiveInstCombine_8cpp.html:
foldSqrtis probalby useless because we generate LLVM sqrttryToRecognizePopCountprobably isn't useful since we havecount_onesfoldMemChrI don't think we usememchr(but not sure).
This is unlikely to matter much, but probably could save a bit of compile time here and there.
|
looks like you need to fix a couple tests: also rerunning nanosoldier, since a lot of changes have happened since: |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
Looks overall pretty good, but there are a couple 10x regressions (look like vectorization failures). Is there an easy way from nanosoldier for us to test compile time to make sure it's comparable? |
|
Isn't that what the inference benchmarks are for, which look like no change to me. |
|
I took a big look at it. There's still a couple regressions, but it seems to be a pretty clear overall win. If anyone wants to take a further look
The 16x regression is now gone with my latest commit |
d6a2afa to
b447d87
Compare
|
Do we want to run a pkgeval? Im slightly worried about the fact that I had to modify passes. |
|
@nanosoldier |
|
The package evaluation job you requested has completed - possible new issues were detected. |
|
@nanosoldier |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
|
|
|
nice to see that this improves the allocation elimination :) |
a114171 to
99ad967
Compare
|
I can confirm this PR on top of #57380 addresses most of #56145 (stores are vectorised, there's still an out-of-bounds section in the preamble, it shouldn't be a big deal performance-wise although it probably taints effects since it can throw): ; Function Signature: var"#59"(Memory{Float64})
; @ REPL[17]:2 within `#59`
define void @"julia_#59_3153"(ptr noundef nonnull align 8 dereferenceable(16) %"v::GenericMemory") local_unnamed_addr #0 {
top:
%thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #11
%tls_ppgcstack = getelementptr inbounds i8, ptr %thread_ptr, i64 -8
%tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
; ┌ @ range.jl:917 within `iterate`
; │┌ @ range.jl:688 within `isempty`
; ││┌ @ operators.jl:425 within `>`
; │││┌ @ int.jl:83 within `<`
%.unbox = load i64, ptr %"v::GenericMemory", align 8
%0 = icmp slt i64 %.unbox, 1
; └└└└
br i1 %0, label %L29, label %preloop.pseudo.exit
[...]
oob: ; preds = %L11.postloop, %L11
%value_phi3.lcssa = phi i64 [ %value_phi3.postloop, %L11.postloop ], [ %6, %L11 ]
; @ REPL[17]:3 within `#59`
; ┌ @ genericmemory.jl:260 within `setindex!`
; │┌ @ genericmemory.jl:252 within `_setindex!`
%ptls_field = getelementptr inbounds i8, ptr %tls_pgcstack, i64 16
%ptls_load = load ptr, ptr %ptls_field, align 8
%"box::GenericMemoryRef" = call noalias nonnull align 8 dereferenceable(32) ptr @ijl_gc_small_alloc(ptr %ptls_load, i32 408, i32 32, i64 140577962397856) #8
%"box::GenericMemoryRef.tag_addr" = getelementptr inbounds i64, ptr %"box::GenericMemoryRef", i64 -1
store atomic i64 140577962397856, ptr %"box::GenericMemoryRef.tag_addr" unordered, align 8
store ptr %memoryref_data, ptr %"box::GenericMemoryRef", align 8
%.repack16 = getelementptr inbounds { ptr, ptr }, ptr %"box::GenericMemoryRef", i64 0, i32 1
store ptr %"v::GenericMemory", ptr %.repack16, align 8
call void @ijl_bounds_error_int(ptr nonnull %"box::GenericMemoryRef", i64 %value_phi3.lcssa)
unreachable
load: ; preds = %L11
%memoryref_offset = shl i64 %value_phi3, 3
; ││ @ genericmemory.jl:253 within `_setindex!`
%gep = getelementptr i8, ptr %invariant.gep, i64 %memoryref_offset
store i64 4607182418800017408, ptr %gep, align 8
; └└
; @ REPL[17]:4 within `#59`
; ┌ @ range.jl:921 within `iterate`
%1 = add nuw nsw i64 %value_phi3, 1
; └
%exitcond39.not = icmp eq i64 %value_phi3, %umax
br i1 %exitcond39.not, label %main.exit.selector, label %L11
[...]
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%offset.idx = shl i64 %index, 3
%12 = or disjoint i64 %offset.idx, 8
; ││ @ genericmemory.jl:253 within `_setindex!`
%13 = getelementptr i8, ptr %invariant.gep, i64 %12
%14 = getelementptr i64, ptr %13, i64 4
%15 = getelementptr i64, ptr %13, i64 8
%16 = getelementptr i64, ptr %13, i64 12
store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %13, align 8
store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %14, align 8
store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %15, align 8
store <4 x i64> <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>, ptr %16, align 8
%index.next = add nuw i64 %index, 16
%17 = icmp eq i64 %index.next, %n.vec
br i1 %17, label %L11, label %vector.body
[...]It'd be nice to have a test to make sure this doesn't regress, once #57380 is merged. |
There are cases where we optimize the SRet more than the pass expected so try and handle those. I'm tryin to get a test for this, this is separated from #52850 to make merging both easier --------- Co-authored-by: Jameson Nash <vtjnash@gmail.com>
|
@nanosoldier |
|
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Needs #57380 to merge first