Refetch PTLS via Task struct (preparation for task migration)#39220
Refetch PTLS via Task struct (preparation for task migration)#39220tkf wants to merge 3 commits intoJuliaLang:masterfrom
Conversation
| // We need `gcstack` in `Task` to allocate Julia objects; *including* the `Task` type. | ||
| // However, to allocate a `Task` via `jl_gc_alloc` as done in `jl_init_root_task`, | ||
| // we need the `Task` type itself. We use stack-allocated "raw" `jl_task_t` struct to | ||
| // workaround this chicken-and-egg problem. Note that this relies on GC to be turned | ||
| // off just above as GC fails because we don't/can't allocate the type tag. | ||
| jl_task_t bootstrap_task; | ||
| bootstrap_task.gcstack = NULL; | ||
| jl_current_task = &bootstrap_task; |
There was a problem hiding this comment.
This is what I suggest for solving the bootstrapping issue I mentioned in the OP. It's a bit scary approach but I find it the most straightforward way to do it.
Also, do we want to zero out the struct? (done in e259cc0)
| LoadInst *ptls_load = ctx.builder.CreateAlignedLoad( | ||
| emit_bitcast(ctx, pptls, T_ppjlvalue), Align(sizeof(void *)), "ptls_load"); | ||
| // Note: Corersponding store (`t->ptls = ptls`) happes in `ctx_switch` of tasks.c. | ||
| ptls_load->setOrdering(AtomicOrdering::Monotonic); // TODO: what should we use? |
There was a problem hiding this comment.
What's the appropriate ordering here?
There was a problem hiding this comment.
Private-Thread-LocalS have no atomic ordering
There was a problem hiding this comment.
Maybe I should've asked how do we prevent a load of PTLS from the field of the task struct to be reordered against a function call? I used atomic ordering so that b in this function f would not be CSE'ed with a:
function f()
a = Threads.threadid()
yield()
b = Threads.threadid()
(a, b)
endIs there other ways to do it? Also, since task struct would be shared between threads once we start migration, I thought some kind of ordering might make sense.
There was a problem hiding this comment.
That's an aliasing question and has nothing to do with atomics
|
Taken from test/compiler/codgen.jl, this function jl_string_ptr(s::String) = ccall(:jl_string_ptr, Ptr{UInt8}, (Any,), s)lowers to ( define i64 @julia_jl_string_ptr_179({}* nonnull %0) {
top:
%thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #5
%current_task_field3 = getelementptr i8, i8* %thread_ptr, i64 -26168
%1 = bitcast i8* %current_task_field3 to {}***
%current_task4 = load {}**, {}*** %1, align 8
%ptls_field5 = getelementptr inbounds {}*, {}** %current_task4, i64 38
%2 = bitcast {}** %ptls_field5 to {}**
%ptls_load = load atomic {}*, {}** %2 monotonic, align 8
%ptls_load2 = load atomic {}*, {}** %2 monotonic, align 8
%3 = bitcast {}* %0 to {}**
%4 = getelementptr inbounds {}*, {}** %3, i64 1
%5 = ptrtoint {}** %4 to i64
ret i64 %5
}and compiles to ( .text
movq %fs:0, %rax
movq -26168(%rax), %rax
movq 304(%rax), %rcx
movq 304(%rax), %rax
leaq 8(%rdi), %rax
retq
nopw %cs:(%rax,%rax)with this PR. But, in 1.7.0-DEV.266, we have define i64 @julia_jl_string_ptr_197({}* nonnull %0) {
top:
%1 = bitcast {}* %0 to {}**
%2 = getelementptr inbounds {}*, {}** %1, i64 1
%3 = ptrtoint {}** %2 to i64
ret i64 %3
}and .text
leaq 8(%rdi), %rax
retq
nopw %cs:(%rax,%rax)I'm puzzled why LLVM can't (is not allowed to?) eliminate unused loads. I understand memory ordering also imposes the constraints on the reordering of other instructions but didn't expect this result. I need to learn more about this stuff... Anyway, given this constraint (either theoretically or technologically), maybe we need to switch to placeholder function + pass-based solution akin to #39168 and try to place/eliminate load ourselves? Or is there a more clever approach? |
|
This is a start. The next step is to replace |
|
Re my comment #39220 (comment) just above: I just found this in LLVM's manual https://llvm.org/docs/Atomics.html#monotonic
So maybe what I'm seeing is expected? |
This is an alternative approach to #39168. In this patch, PTLS is refetched through
jl_task_t. We can "cache" the pointer to thejl_task_tas SSA variable since it is guaranteed that the task does not change within a function [1].Functionally, I think the first commit is sufficient. In the second commit, I removed
pgcstackfrom PTLS as it's redundant. It turned out there is a bit of bootstrapping issue since we need toTasktype to allocate a task but we needgcstackinTaskfield to createTasktype.I'm not sure what memory ordering of load we should use for refetching PTLS so I just put
monotonicas the weakest placeholder (sinceunorderedgets reordered w.r.t the function calls).As I mentioned in the last comment #39168 (comment), a major disadvantage of this approach is that LICM of
threadid()would not work. However, it might be better to stop usingarray[threadid()]pattern for "thread-local storage" anyway.There are some test failures in codegen test. I'll fix them once this direction is OK.
[1] This won't be the case if we start doing continuation stealing like Cilk. But, I think we can mix the approach taken in #39168 to refetch the pointer to the task struct at the beginning of the continuation conditionally and then propagate the task pointer via phi nodes.
Demo
@code_llvm debuginfo=:none optimize=false f()You can see
%ptls_load2 = load ...after@j_yield_224().@code_llvm debuginfo=:none f()