Skip to content

Enable allocation hoisting out of loops#43057

Merged
vchuravy merged 14 commits intoJuliaLang:masterfrom
pchintalapudi:pc/allocation-hoisting
Jan 6, 2022
Merged

Enable allocation hoisting out of loops#43057
vchuravy merged 14 commits intoJuliaLang:masterfrom
pchintalapudi:pc/allocation-hoisting

Conversation

@pchintalapudi
Copy link
Copy Markdown
Member

Allocations in loops which escape outside of the loop can prevent further optimization of the loop. By eliminating those allocations, we can both reduce memory pressure and enable further loop optimizations.

This PR also separates the escape analysis framework from alloc-opt into its own file and refactors it, which allows it to be called from other passes and be more configurable.

@KristofferC
Copy link
Copy Markdown
Member

Just as a note, it would be nice with example codes that now optimize with this and potentially benchmarks. As well as perhaps some checks on potential regressions in compile time.

@JeffBezanson JeffBezanson added compiler:codegen Generation of LLVM IR and native code performance Must go faster labels Nov 12, 2021
@pchintalapudi
Copy link
Copy Markdown
Member Author

Example - loop deletion

Function:

function f(N)
                N <= 0 && return nothing
                x = Ref(-1)
                for i in 1:N
                  x = Ref(i)
                end
                return x
              end

julia-master: https://godbolt.org/z/Po1cb8bYP -- the IR retains the loop through 1:N and keeps the allocation throughout (O(N))
post-PR: https://godbolt.org/z/ffnh5f9nc -- the loop is deleted and a Ref(N) is returned immediately (O(1))

@pchintalapudi
Copy link
Copy Markdown
Member Author

Compile time regressions:

julia-master

Base  ───────────── 24.748508 seconds
ArgTools  ─────────  4.247066 seconds
Artifacts  ────────  0.115716 seconds
Base64  ───────────  0.112100 seconds
CRC32c  ───────────  0.008074 seconds
FileWatching  ─────  0.113021 seconds
Libdl  ────────────  0.001553 seconds
Logging  ──────────  0.037643 seconds
Mmap  ─────────────  0.089043 seconds
NetworkOptions  ───  0.100414 seconds
SHA  ──────────────  0.194215 seconds
Serialization  ────  0.294139 seconds
Sockets  ──────────  0.332315 seconds
Unicode  ──────────  0.081751 seconds
DelimitedFiles  ───  0.106797 seconds
LinearAlgebra  ────  7.887713 seconds
Markdown  ─────────  0.824033 seconds
Printf  ───────────  0.221028 seconds
Random  ───────────  1.206307 seconds
Tar  ──────────────  0.291240 seconds
Dates  ────────────  1.638434 seconds
Distributed  ──────  0.793682 seconds
Future  ───────────  0.005262 seconds
InteractiveUtils  ─  0.507379 seconds
LibGit2  ──────────  1.439142 seconds
Profile  ──────────  0.285505 seconds
SparseArrays  ─────  2.729028 seconds
UUIDs  ────────────  0.016787 seconds
REPL  ─────────────  3.597898 seconds
SharedArrays  ─────  0.535205 seconds
Statistics  ───────  0.192941 seconds
SuiteSparse  ──────  1.658076 seconds
TOML  ─────────────  0.072125 seconds
Test  ─────────────  0.319665 seconds
LibCURL  ──────────  0.430354 seconds
Downloads  ────────  0.347174 seconds
Pkg  ──────────────  4.731190 seconds
LazyArtifacts  ────  0.001611 seconds
Stdlibs total  ──── 35.577262 seconds
Sysimage built. Summary:
Total ───────  60.327274 seconds
Base: ───────  24.748508 seconds 41.0237%
Stdlibs: ────  35.577262 seconds 58.9738%
    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 36/36
Executing precompile statements... 1458/1493
Precompilation complete. Summary:
Total ─────── 116.517783 seconds
Generation ──  89.111257 seconds 76.4787%
Execution ───  27.406527 seconds 23.5213%

post-PR

Base  ───────────── 24.558715 seconds
ArgTools  ─────────  4.217561 seconds
Artifacts  ────────  0.105677 seconds
Base64  ───────────  0.112183 seconds
CRC32c  ───────────  0.008003 seconds
FileWatching  ─────  0.112550 seconds
Libdl  ────────────  0.001550 seconds
Logging  ──────────  0.037203 seconds
Mmap  ─────────────  0.088297 seconds
NetworkOptions  ───  0.101084 seconds
SHA  ──────────────  0.192668 seconds
Serialization  ────  0.293077 seconds
Sockets  ──────────  0.318113 seconds
Unicode  ──────────  0.083754 seconds
DelimitedFiles  ───  0.107151 seconds
LinearAlgebra  ────  7.830820 seconds
Markdown  ─────────  0.824277 seconds
Printf  ───────────  0.219940 seconds
Random  ───────────  1.198442 seconds
Tar  ──────────────  0.288437 seconds
Dates  ────────────  1.610241 seconds
Distributed  ──────  0.784773 seconds
Future  ───────────  0.006982 seconds
InteractiveUtils  ─  0.519238 seconds
LibGit2  ──────────  1.403073 seconds
Profile  ──────────  0.288860 seconds
SparseArrays  ─────  2.728747 seconds
UUIDs  ────────────  0.015668 seconds
REPL  ─────────────  3.600705 seconds
SharedArrays  ─────  0.522527 seconds
Statistics  ───────  0.190193 seconds
SuiteSparse  ──────  1.628297 seconds
TOML  ─────────────  0.073384 seconds
Test  ─────────────  0.316663 seconds
LibCURL  ──────────  0.444204 seconds
Downloads  ────────  0.345108 seconds
Pkg  ──────────────  4.701191 seconds
LazyArtifacts  ────  0.001628 seconds
Stdlibs total  ──── 35.344183 seconds
Sysimage built. Summary:
Total ───────  59.904386 seconds
Base: ───────  24.558715 seconds 40.9965%
Stdlibs: ────  35.344183 seconds 59.001%
    JULIA usr/lib/julia/sys-o.a
Generating REPL precompile statements... 36/36
Executing precompile statements... 1457/1492
Precompilation complete. Summary:
Total ─────── 112.617546 seconds
Generation ──  86.288129 seconds 76.6205%
Execution ───  26.329417 seconds 23.3795%

Seems like it's approximately equivalent in terms of compile time.

@pchintalapudi
Copy link
Copy Markdown
Member Author

Any idea why this isn't working here?

julia> function f(n)
           ans = 0.0
           for _ in 1:n
               tmp = rand(100)
               ans += sum(tmp)
           end
           ans
       end
f (generic function with 1 method)

julia> @time f(1000)
  0.000331 seconds (1000 allocations: 875.000 KiB)
50090.867115483634

So array allocations and object allocations are treated differently. Object allocations are converted to a julia.gc_alloc_obj function which eventually gets lowered to another allocation function as necessary. Array allocations, on the other hand, delegate to one of 4 functions (definitions in Base/boot.jl and src/array.c) for 1D, 2D, 3D, and ND array allocations respectively. Those array allocations are implemented as direct ccalls, which means they aren't currently handled by any escape analysis, SROA, or allocation optimization in general. I'm separately looking into possible ways to also optimize array allocations in similar ways, but that's a much larger change.

@oscardssmith
Copy link
Copy Markdown
Member

oscardssmith commented Nov 12, 2021

julia> function f(n)
           ans = 0.0
           for _ in 1:n
               tmp = rand(100)
               ans += sum(tmp)
           end
           ans
       end
f (generic function with 1 method)

julia> @time f(1000)
  0.000302 seconds (1000 allocations: 875.000 KiB)
50002.48175259297

Copy link
Copy Markdown
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests that exercise the change in llvm-julia-licm? Preferable in llvmpasses?

@vchuravy
Copy link
Copy Markdown
Member

@nanosoldier runtests(ALL, vs = ":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@KristofferC
Copy link
Copy Markdown
Member

@nanosoldier runtests(["ApproxFunFourier", "BigWig", "ChaosTools", "CombinatorialSpaces", "ConcurrentCollections", "CrypticCrosswords", "DataFrames", "FASTX", "FastAI", "FastFloat16s", "GraphNeuralNetworks", "GraphSignals", "Hashpipe", "IntervalTrees", "JUDI4Cloud", "JWAS", "JetPackWaveFD", "LegolasFlux", "MaxwellBase", "NeuralOperators", "NonconvexCore", "NonconvexSearch", "OILMMs", "OteraEngine", "Pitaya", "PowerGraphics", "Probably", "QuadEig", "QuantumTomography", "SDDP", "SDFResults", "SimSpin", "SpatialJackknife", "StochasticRounding", "TransmuteDims", "YAAD", "Zygote"], vs = ":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run tests: NanosoldierError: failed to upload test log: AWSException: InvalidAccessKeyId -- The AWS Access Key Id you provided does not exist in our records.
HTTP.ExceptionRequest.StatusError(403, "PUT", "/julialang-reports/nanosoldier/pkgeval/by_hash/9805d57_vs_b3e4341/OteraEngine.1.8.0-DEV-fe0fd37020c.log?x-amz-acl=public-read", HTTP.Messages.Response:
"""
HTTP/1.1 403 Forbidden
x-amz-request-id: NJ464G23P4KZM68A
x-amz-id-2: fxS/CjQ50GVLeRo+Cx/UH4VgnGiZgDft0o9xdulp+E3UaTrmBM61Zjb0DH5dIcpnaPaqE8hG754=
Content-Type: application/xml
Transfer-Encoding: chunked
Date: Wed, 15 Dec 2021 20:32:04 GMT
Server: AmazonS3
Connection: close

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InvalidAccessKeyId</Code><Message>The AWS Access Key Id you provided does not exist in our records.</Message><AWSAccessKeyId>AKIA4WZGSTHCMSSHRWL6</AWSAccessKeyId><RequestId>NJ464G23P4KZM68A</RequestId><HostId>fxS/CjQ50GVLeRo+Cx/UH4VgnGiZgDft0o9xdulp+E3UaTrmBM61Zjb0DH5dIcpnaPaqE8hG754=</HostId></Error>""")

Logs and partial data can be found here
cc @maleadt

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Dec 16, 2021

AWS failure.

@nanosoldier runtests(["ApproxFunFourier", "BigWig", "ChaosTools", "CombinatorialSpaces", "ConcurrentCollections", "CrypticCrosswords", "DataFrames", "FASTX", "FastAI", "FastFloat16s", "GraphNeuralNetworks", "GraphSignals", "Hashpipe", "IntervalTrees", "JUDI4Cloud", "JWAS", "JetPackWaveFD", "LegolasFlux", "MaxwellBase", "NeuralOperators", "NonconvexCore", "NonconvexSearch", "OILMMs", "OteraEngine", "Pitaya", "PowerGraphics", "Probably", "QuadEig", "QuantumTomography", "SDDP", "SDFResults", "SimSpin", "SpatialJackknife", "StochasticRounding", "TransmuteDims", "YAAD", "Zygote"], vs = ":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@vchuravy
Copy link
Copy Markdown
Member

@nanosoldier runtests(["ApproxFunFourier", "BigWig", "ChaosTools", "CombinatorialSpaces", "ConcurrentCollections", "CrypticCrosswords", "DataFrames", "FASTX", "FastAI", "FastFloat16s", "GraphNeuralNetworks", "GraphSignals", "Hashpipe", "IntervalTrees", "JUDI4Cloud", "JWAS", "JetPackWaveFD", "LegolasFlux", "MaxwellBase", "NeuralOperators", "NonconvexCore", "NonconvexSearch", "OILMMs", "OteraEngine", "Pitaya", "PowerGraphics", "Probably", "QuadEig", "QuantumTomography", "SDDP", "SDFResults", "SimSpin", "SpatialJackknife", "StochasticRounding", "TransmuteDims", "YAAD", "Zygote"], vs = ":master")

@vchuravy
Copy link
Copy Markdown
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@vchuravy
Copy link
Copy Markdown
Member

If nanosoldier comes back happy I am going to merge this this week.

@pchintalapudi pchintalapudi force-pushed the pc/allocation-hoisting branch from 36abcf3 to 29ae1a1 Compare January 5, 2022 06:31
Co-authored-by: Shuhei Kadowaki <40514306+aviatesk@users.noreply.github.com>
@vchuravy
Copy link
Copy Markdown
Member

vchuravy commented Jan 5, 2022

@nanosoldier runtests(["ApproxFunFourier", "BigWig", "ChaosTools", "CombinatorialSpaces", "ConcurrentCollections", "CrypticCrosswords", "DataFrames", "FASTX", "FastAI", "FastFloat16s", "GraphNeuralNetworks", "GraphSignals", "Hashpipe", "IntervalTrees", "JUDI4Cloud", "JWAS", "JetPackWaveFD", "LegolasFlux", "MaxwellBase", "NeuralOperators", "NonconvexCore", "NonconvexSearch", "OILMMs", "OteraEngine", "Pitaya", "PowerGraphics", "Probably", "QuadEig", "QuantumTomography", "SDDP", "SDFResults", "SimSpin", "SpatialJackknife", "StochasticRounding", "TransmuteDims", "YAAD", "Zygote"], vs = ":master")

@vchuravy
Copy link
Copy Markdown
Member

vchuravy commented Jan 5, 2022

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Copy Markdown
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@nanosoldier
Copy link
Copy Markdown
Collaborator

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/data/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

@vchuravy vchuravy merged commit 4c45f29 into JuliaLang:master Jan 6, 2022
@Sacha0
Copy link
Copy Markdown
Member

Sacha0 commented Jan 6, 2022

🎉

@vtjnash
Copy link
Copy Markdown
Member

vtjnash commented Jan 11, 2022

Comment on lines +137 to +138
jl_alloc::runEscapeAnalysis(call, required, jl_alloc::EscapeAnalysisOptionalArgs().with_valid_set(&L->getBlocksSet()));
if (use_info.escaped || use_info.addrescaped) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation: this is depending on more than just escape analysis results, but also is depending on the absence of loop-carried dependencies. Currently, runEscapeAnalysis has no handling implemented for PhiNode, so it looks like it is correct, but if the escape-analysis ever gains the ability to look at the function CFG, we need to fix this usage.

LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
@pchintalapudi pchintalapudi deleted the pc/allocation-hoisting branch March 6, 2022 21:58
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
maleadt added a commit that referenced this pull request Apr 28, 2023
maleadt added a commit that referenced this pull request Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

compiler:codegen Generation of LLVM IR and native code performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.