Add JLJITLinkMemoryManager (ports memory manager to JITLink)#60105
Add JLJITLinkMemoryManager (ports memory manager to JITLink)#60105xal-0 merged 1 commit intoJuliaLang:masterfrom
Conversation
5d5362e to
0b04319
Compare
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager. Release JLJITLinkMemoryManager lock when calling FinalizedCallbacks
0b04319 to
c5c9964
Compare
|
About jitlink everywhere, are you planning to address llvm/llvm-project#63236? That caused us some pain when we switched aarch64-linux. |
|
That's what this change addresses, though the fallback to I thought I'd defer that work both because it's separable and because there is a better option on macOS that I am working on now: we can use the new APRR (aka fast permission restrictions). That lets us create special RWX mappings and toggle whether each thread sees it as R-X or RW- independently and without any system calls like mprotect. |
|
Cool, it wasn't to clear to me what were the memory use regressions mentioned in the first post, thanks for explaining it! |
|
cc: @pchintalapudi |
|
Thanks! |
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen. (cherry picked from commit 6fa0e75)
…#60196) Reverts #60105. Nightly builds of aarch64-darwin Julia hang at startup on some systems, notably on GitHub Actions, making all nightly jobs timeout (after 6 hours...), see https://discourse.julialang.org/t/ci-testing-hangs-on-macos-nightly/133909 (I had previously reported this issue on Slack at https://julialang.slack.com/archives/CPWJ5DGG1/p1763055610279379). See also julia-actions/julia-runtest#155. The error message when the process receives a SIGTERM signal is ``` in expression starting at none:0 __psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line) unknown function (ip: 0x0) at (unknown file) __psynch_mutexwait at /usr/lib/system/libsystem_kernel.dylib (unknown line) unknown function (ip: 0x0) at (unknown file) Allocations: 1 (Pool: 1; Big: 0); GC: 0 ``` I can reliably reproduce it locally on a M1 box with MacOSX 12.6 21.6.0, I can provide more information as needed. I bisected the issue to #60105.
…ng#60105) Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (JuliaLang#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation `wr_ptr`s. I decided it wasn't worth it to associate `OnFinalizedFunction` callbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen.
Apple ARM CPUs treat the `ic ivau` as a memory read, which causes a confusing crash in DualMapAllocator if we try using it on a wr_addr that has been mprotected to `Prot::NO`, since we are still holding the allocator lock. This re-lands JuliaLang#60105, after it was reverted in JuliaLang#60196. Thanks @giordano!
Apple ARM CPUs treat the `ic ivau` as a memory read, which causes a confusing crash in DualMapAllocator if we try using it on a wr_addr that has been mprotected to `Prot::NO`, since we are still holding the allocator lock. For Apple aarch64 systems with SIP disabled, this will result in some memory savings, since DualMapAllocator will now work there. Like before, other JITLink platforms, namely Linux aarch64 and RISC-V, will benefit too. This re-lands JuliaLang#60105, after it was reverted in JuliaLang#60196. Thanks @giordano!
…ager/#60105) (#60230) Apple ARM CPUs treat the `ic ivau` as a memory read, which causes a confusing crash in DualMapAllocator if we try using it on a `wr_addr` that has been mprotected to `Prot::NO`, since we are still holding the allocator lock. For Apple aarch64 systems with SIP disabled, this will result in some memory savings, since DualMapAllocator will now work there. Like before, other JITLink platforms, namely Linux aarch64 and RISC-V, will benefit too. This re-lands #60105, after it was reverted in #60196. Thanks @giordano!
# Overview This PR overhauls the way linking works in Julia, both in the JIT and AOT. The point is to enable us to generate LLVM IR that depends only on the source IR, eliminating both nondeterminism and statefulness. This serves two purposes. First, if the IR is predictable, we can cache compile objects using the bitcode hash as a key, like how the ThinLTO cache works. #58592 was an early experiment along these lines. Second, we can reuse work that was done in a previous session, like pkgimages, but for the JIT. We accomplish this by generating names that are unique only within the current LLVM module, removing most uses of the `globalUniqueGeneratedNames` counter. The replacement for `jl_codegen_params_t`, `jl_codegen_output_t`, represents a Julia "translation unit", and tracks the information we'll need to link the compiled module into the running session. When linking, we manipulate the JITLink [LinkGraph](https://llvm.org/docs/JITLink.html#linkgraph) (after compilation) instead of renaming functions in the LLVM IR (before). ## Example ``` julia> @noinline foo(x) = x + 2.0 baz(x) = foo(foo(x)) code_llvm(baz, (Int64,); dump_module=true, optimize=false) ``` Nightly: ```llvm [...] @"+Core.Float64#774" = private unnamed_addr constant ptr @"+Core.Float64#774.jit" @"+Core.Float64#774.jit" = private alias ptr, inttoptr (i64 4797624416 to ptr) ; Function Signature: baz(Int64) ; @ REPL[1]:2 within `baz` define double @julia_baz_772(i64 signext %"x::Int64") #0 { top: %pgcstack = call ptr @julia.get_pgcstack() %0 = call double @j_foo_775(i64 signext %"x::Int64") %1 = call double @j_foo_776(double %0) ret double %1 } ; Function Attrs: noinline optnone define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 { top: %pgcstack = call ptr @julia.get_pgcstack() %0 = getelementptr inbounds i8, ptr %"args::Any[]", i32 0 %1 = load ptr, ptr %0, align 8 %.unbox = load i64, ptr %1, align 8 %2 = call double @julia_baz_772(i64 signext %.unbox) %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8 %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64 %3 = inttoptr i64 %Float64 to ptr %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152 %"box::Float64" = call noalias nonnull align 8 dereferenceable(8) ptr @julia.gc_alloc_obj(ptr %current_task, i64 8, ptr %3) #5 store double %2, ptr %"box::Float64", align 8 ret ptr %"box::Float64" } [...] ``` Diff after this PR. Notice how each symbol gets the lowest possible integer suffix that will make it unique to the module, and how the two specializations for `foo` get different names: ```diff @@ -4,18 +4,18 @@ target triple = "arm64-apple-darwin24.6.0" -@"+Core.Float64#774" = external global ptr +@"+Core.Float64#_0" = external global ptr ; Function Signature: baz(Int64) ; @ REPL[1]:2 within `baz` -define double @julia_baz_772(i64 signext %"x::Int64") #0 { +define double @julia_baz_0(i64 signext %"x::Int64") #0 { top: %pgcstack = call ptr @julia.get_pgcstack() - %0 = call double @j_foo_775(i64 signext %"x::Int64") - %1 = call double @j_foo_776(double %0) + %0 = call double @j_foo_0(i64 signext %"x::Int64") + %1 = call double @j_foo_1(double %0) ret double %1 } ; Function Attrs: noinline optnone -define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 { +define nonnull ptr @jfptr_baz_0(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") #1 { top: %pgcstack = call ptr @julia.get_pgcstack() @@ -23,7 +23,7 @@ %1 = load ptr, ptr %0, align 8 %.unbox = load i64, ptr %1, align 8 - %2 = call double @julia_baz_772(i64 signext %.unbox) - %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8 - %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64 + %2 = call double @julia_baz_0(i64 signext %.unbox) + %"+Core.Float64#_0" = load ptr, ptr @"+Core.Float64#_0", align 8 + %Float64 = ptrtoint ptr %"+Core.Float64#_0" to i64 %3 = inttoptr i64 %Float64 to ptr %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152 @@ -39,8 +39,8 @@ ; Function Signature: foo(Int64) -declare double @j_foo_775(i64 signext) #3 +declare double @j_foo_0(i64 signext) #3 ; Function Signature: foo(Float64) -declare double @j_foo_776(double) #4 +declare double @j_foo_1(double) #4 attributes #0 = { "frame-pointer"="all" "julia.fsig"="baz(Int64)" "probe-stack"="inline-asm" } ``` ## List of changes - Many sources of statefulness and nondeterminism in the emitted LLVM IR have been eliminated, namely: - Function symbols defined for CodeInstances - Global symbols referring to data on the Julia heap - Undefined function symbols referring to invoked external CodeInstances - `jl_codeinst_params_t` has become `jl_codegen_output_t`. It now represents one Julia "translation unit". More than one CodeInstance can be emitted to the same `jl_codegen_output_t`, if desired, though in the JIT every CI gets its own right now. One motivation behind this is to allow us to emit code on multiple threads and avoid the bitcode serialize/deserialize step we currently do, if that proves worthwhile. When we are done emitting to a `jl_codegen_output_t`, we call `.finish()`, which discards the intermediate state and returns only the LLVM module and the info needed for linking (`jl_linker_info_t`). - The new `JLMaterializationUnit` wraps emitting Julia LLVM modules and the associated `jl_linker_info_t`. It informs ORC that we can materialize symbols for the CIs defined by that output, and picks globally unique names for them. When it is materialized, it resolves all the call targets and generates trampolines for CodeInstances that are invoked but have the wrong calling convention, or are not yet compiled. - We now postpone linking decisions to after codegen whenever possible. For example, `emit_invoke` no longer tries to find a compiled version of the CodeInstance, and it no longer generates trampolines to adapt calling conventions. `jl_analyze_workqueue`'s job has been absorbed into `JuliaOJIT::linkOutput`. - Some `image_codegen` differences have been removed: - Codegen no longer cares if a compiled CodeInstance came from an image. During ahead-of-time linking, we generate thunk functions that load the address from the fvars table. - In `jl_emit_native_impl`, emit every CodeInstance into one `jl_codegen_output_t`. We now defer the creation of the `llvm::Linker` for llvmcalls, which has construction cost that grows with the size of the destination module, until the very end. - RTDyld is removed completely, since we cannot control linking like we can with JITLink. Since #60105, platforms that previous used the optimized memory manager now use the new one. ### General refactoring - Adapt the `jl_callingconv_t` enum from `staticdata.c` into `jl_invoke_api_t` and use it in more places. There is one enumerator for each special `jl_callptr_t` function that can go in a CodeInstance's `invoke` field, as well as one that indicates an invoke wrapper should be there. There is a convenience function for reading an invoke pointer and getting the API type, and vice versa. - Avoid using magic string values, and try to directly pass pointers to LLVM `Function *` or ORC string pool entries when possible. ## Future work - `DLSymOptimizer` should be mostly removed, in favour of emitting raw ccalls and redirecting them to the appropriate target during linking. - We should support ahead-of-time linking multiple `jl_codegen_output_t`s together, in order to parallelize LLVM IR emission when compiling a system image. - We still pass strings to `emit_call_specfun_other`, even though the prototype for the function is now created by `jl_codegen_output_t::get_call_target`. We should hold on to the calling convention info so it doesn't have to be recomputed.
# Overview This PR overhauls the way linking works in Julia, both in the JIT and AOT. The point is to enable us to generate LLVM IR that depends only on the source IR, eliminating both nondeterminism and statefulness. This serves two purposes. First, if the IR is predictable, we can cache compile objects using the bitcode hash as a key, like how the ThinLTO cache works. JuliaLang#58592 was an early experiment along these lines. Second, we can reuse work that was done in a previous session, like pkgimages, but for the JIT. We accomplish this by generating names that are unique only within the current LLVM module, removing most uses of the `globalUniqueGeneratedNames` counter. The replacement for `jl_codegen_params_t`, `jl_codegen_output_t`, represents a Julia "translation unit", and tracks the information we'll need to link the compiled module into the running session. When linking, we manipulate the JITLink [LinkGraph](https://llvm.org/docs/JITLink.html#linkgraph) (after compilation) instead of renaming functions in the LLVM IR (before). ## Example ``` julia> @noinline foo(x) = x + 2.0 baz(x) = foo(foo(x)) code_llvm(baz, (Int64,); dump_module=true, optimize=false) ``` Nightly: ```llvm [...] @"+Core.Float64#774" = private unnamed_addr constant ptr @"+Core.Float64#774.jit" @"+Core.Float64#774.jit" = private alias ptr, inttoptr (i64 4797624416 to ptr) ; Function Signature: baz(Int64) ; @ REPL[1]:2 within `baz` define double @julia_baz_772(i64 signext %"x::Int64") #0 { top: %pgcstack = call ptr @julia.get_pgcstack() %0 = call double @j_foo_775(i64 signext %"x::Int64") %1 = call double @j_foo_776(double %0) ret double %1 } ; Function Attrs: noinline optnone define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") JuliaLang#1 { top: %pgcstack = call ptr @julia.get_pgcstack() %0 = getelementptr inbounds i8, ptr %"args::Any[]", i32 0 %1 = load ptr, ptr %0, align 8 %.unbox = load i64, ptr %1, align 8 %2 = call double @julia_baz_772(i64 signext %.unbox) %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8 %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64 %3 = inttoptr i64 %Float64 to ptr %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152 %"box::Float64" = call noalias nonnull align 8 dereferenceable(8) ptr @julia.gc_alloc_obj(ptr %current_task, i64 8, ptr %3) JuliaLang#5 store double %2, ptr %"box::Float64", align 8 ret ptr %"box::Float64" } [...] ``` Diff after this PR. Notice how each symbol gets the lowest possible integer suffix that will make it unique to the module, and how the two specializations for `foo` get different names: ```diff @@ -4,18 +4,18 @@ target triple = "arm64-apple-darwin24.6.0" -@"+Core.Float64#774" = external global ptr +@"+Core.Float64#_0" = external global ptr ; Function Signature: baz(Int64) ; @ REPL[1]:2 within `baz` -define double @julia_baz_772(i64 signext %"x::Int64") #0 { +define double @julia_baz_0(i64 signext %"x::Int64") #0 { top: %pgcstack = call ptr @julia.get_pgcstack() - %0 = call double @j_foo_775(i64 signext %"x::Int64") - %1 = call double @j_foo_776(double %0) + %0 = call double @j_foo_0(i64 signext %"x::Int64") + %1 = call double @j_foo_1(double %0) ret double %1 } ; Function Attrs: noinline optnone -define nonnull ptr @jfptr_baz_773(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") JuliaLang#1 { +define nonnull ptr @jfptr_baz_0(ptr %"function::Core.Function", ptr noalias nocapture noundef readonly %"args::Any[]", i32 %"nargs::UInt32") JuliaLang#1 { top: %pgcstack = call ptr @julia.get_pgcstack() @@ -23,7 +23,7 @@ %1 = load ptr, ptr %0, align 8 %.unbox = load i64, ptr %1, align 8 - %2 = call double @julia_baz_772(i64 signext %.unbox) - %"+Core.Float64#774" = load ptr, ptr @"+Core.Float64#774", align 8 - %Float64 = ptrtoint ptr %"+Core.Float64#774" to i64 + %2 = call double @julia_baz_0(i64 signext %.unbox) + %"+Core.Float64#_0" = load ptr, ptr @"+Core.Float64#_0", align 8 + %Float64 = ptrtoint ptr %"+Core.Float64#_0" to i64 %3 = inttoptr i64 %Float64 to ptr %current_task = getelementptr inbounds i8, ptr %pgcstack, i32 -152 @@ -39,8 +39,8 @@ ; Function Signature: foo(Int64) -declare double @j_foo_775(i64 signext) JuliaLang#3 +declare double @j_foo_0(i64 signext) JuliaLang#3 ; Function Signature: foo(Float64) -declare double @j_foo_776(double) JuliaLang#4 +declare double @j_foo_1(double) JuliaLang#4 attributes #0 = { "frame-pointer"="all" "julia.fsig"="baz(Int64)" "probe-stack"="inline-asm" } ``` ## List of changes - Many sources of statefulness and nondeterminism in the emitted LLVM IR have been eliminated, namely: - Function symbols defined for CodeInstances - Global symbols referring to data on the Julia heap - Undefined function symbols referring to invoked external CodeInstances - `jl_codeinst_params_t` has become `jl_codegen_output_t`. It now represents one Julia "translation unit". More than one CodeInstance can be emitted to the same `jl_codegen_output_t`, if desired, though in the JIT every CI gets its own right now. One motivation behind this is to allow us to emit code on multiple threads and avoid the bitcode serialize/deserialize step we currently do, if that proves worthwhile. When we are done emitting to a `jl_codegen_output_t`, we call `.finish()`, which discards the intermediate state and returns only the LLVM module and the info needed for linking (`jl_linker_info_t`). - The new `JLMaterializationUnit` wraps emitting Julia LLVM modules and the associated `jl_linker_info_t`. It informs ORC that we can materialize symbols for the CIs defined by that output, and picks globally unique names for them. When it is materialized, it resolves all the call targets and generates trampolines for CodeInstances that are invoked but have the wrong calling convention, or are not yet compiled. - We now postpone linking decisions to after codegen whenever possible. For example, `emit_invoke` no longer tries to find a compiled version of the CodeInstance, and it no longer generates trampolines to adapt calling conventions. `jl_analyze_workqueue`'s job has been absorbed into `JuliaOJIT::linkOutput`. - Some `image_codegen` differences have been removed: - Codegen no longer cares if a compiled CodeInstance came from an image. During ahead-of-time linking, we generate thunk functions that load the address from the fvars table. - In `jl_emit_native_impl`, emit every CodeInstance into one `jl_codegen_output_t`. We now defer the creation of the `llvm::Linker` for llvmcalls, which has construction cost that grows with the size of the destination module, until the very end. - RTDyld is removed completely, since we cannot control linking like we can with JITLink. Since JuliaLang#60105, platforms that previous used the optimized memory manager now use the new one. ### General refactoring - Adapt the `jl_callingconv_t` enum from `staticdata.c` into `jl_invoke_api_t` and use it in more places. There is one enumerator for each special `jl_callptr_t` function that can go in a CodeInstance's `invoke` field, as well as one that indicates an invoke wrapper should be there. There is a convenience function for reading an invoke pointer and getting the API type, and vice versa. - Avoid using magic string values, and try to directly pass pointers to LLVM `Function *` or ORC string pool entries when possible. ## Future work - `DLSymOptimizer` should be mostly removed, in favour of emitting raw ccalls and redirecting them to the appropriate target during linking. - We should support ahead-of-time linking multiple `jl_codegen_output_t`s together, in order to parallelize LLVM IR emission when compiling a system image. - We still pass strings to `emit_call_specfun_other`, even though the prototype for the function is now created by `jl_codegen_output_t::get_call_target`. We should hold on to the calling convention info so it doesn't have to be recomputed.
Ports our RTDyLD memory manager to JITLink in order to avoid memory use regressions after switching to JITLink everywhere (#60031). This is a direct port: finalization must happen all at once, because it invalidates all allocation
wr_ptrs. I decided it wasn't worth it to associateOnFinalizedFunctioncallbacks with each block, since they are large enough to make it extremely likely that all in-flight allocations land in the same block; everything must be relocated before finalization can happen.I plan to add support for DualMapAllocator on ARM64 macOS, as well as an alternative for executable memory later. For now, we fall back to the old MapperJITLinkMemoryManager.