Conversation
|
@gasche can you please run precheck on it? |
|
@gadmm thank you for this. |
|
Thanks @gasche. CI is failing on Windows with access violations, which is actually good news because the root of the Windows failure had not yet been identified (not for lack of trying). Now, it is unlikely that it is caused by the three last commits because they only change the behaviour in corner cases and were not there before. This makes the first one (using thread_local) a primary suspect. The commit looks trivial but keep in mind that the commit could be correct while suffering from side-effects of pre-existing bugs (I know for instance that Windows is picky about threads that outlive the main thread). @gasche could you please run precheck on 084a512? (or do you need a named branch?) (sorry to ask again, everything would be easier if I managed to reproduce these bugs) |
Thank you, this PR is the first one on the stack. I also pushed a small addition to #11272 to make it independent from all the systhreads PR, so this can also be reviewed now as well. |
I don't mind, but I think you should follow the process to get precheck access rights yourself, which is documented in HACKING.adoc: running INRIA's CI on a publicly-available git branch. |
|
(Yes, it seems that precheck needs a named branch.) |
|
I created the branch According to the procedure in HACKING.adoc I should already have these rights, so it seems that I will have to report the problem to the anonymous e-mail address mentioned. That being said I have already spent too much time on the systhread issues. At the start I simply wanted to have systhread stop accessing |
aed6d4d to
7916f46
Compare
|
Thanks to @dra27's help I managed to reproduce the issue on Windows. It was indeed caused by the first commit (084a512ce). Perhaps Windows experts would know why (incompatibility with dynamic linking? does not like it when non-detached threads are not joined?). In the meanwhile for 5.0 I rebased without the first commit. |
Thanks for fighting through the frustrations - I'm glad the build behaved in the end!
It would certainly be interesting to confirm that it's an implicit TLS issue and potentially fix flexlink, but given the size of the diff (i.e. the fact we don't use that much explicit TLS), I'm not sure it'd be worth devoting time to? Footnotes
|
New precheck run: https://ci.inria.fr/ocaml/job/precheck/739/ |
| st_tls_set(caml_thread_key, new_thread); | ||
|
|
||
| Active_thread = new_thread; | ||
| Tick_thread_running = 0; |
There was a problem hiding this comment.
This is a very nice catch, I wonder how it got in there
otherlibs/systhreads/st_stubs.c
Outdated
| /* Restore the runtime state from the curr_thread descriptor */ | ||
| caml_thread_restore_runtime_state(); | ||
| } else { | ||
| caml_leave_blocking_section_default(); |
There was a problem hiding this comment.
My intuition is that caml_thread_leave_blocking_section (and its enter counterpart) would not be called if initialization isn't done for Thread.
(as these are set as callback only if caml_thread_initialize is called)
Is this extra handling required?
There was a problem hiding this comment.
Ah, maybe there's a weird interaction in case we are talking about another domain's initialization? (not domain 0)
There was a problem hiding this comment.
Yes, for domain ≠ 0, there is a gap between spawning and threads init during which arbitrary OCaml code can run. The solution is not really nice.
I do not really like this solution (look at the removal of [@noalloc] on Thread.self). It would be simpler to avoid this by calling the domain threads init with a hook on the C side. Would you or someone else would be willing to look into such an alternative solution?
There was a problem hiding this comment.
To be quite frank I think the solution here works and we should merge the branch as is.
I agree this is not quite elegant, and I've never been a huge fan of the initialisation process for threads to be called from OCaml.
Maybe we should just open an issue and address this later?
There was a problem hiding this comment.
we should merge the branch as is
it needs an approval first :-)
|
Thanks for the infos @dra27.
For systhreads, no. For flexdll users, it is probably worth investigating/documenting the issue given that |
|
@gadmm this is good to go if you confirm that you are happy to merge now |
|
I think the API change (Thread.self no longer being noalloc) needs an additional opinion. |
|
Right, the more I think about it, the less I'm comfortable with it. ( |
|
I don't think there should be a problem of having a C-side hook To further lift limitations on init, I even wonder if the initialization of systhreads could not be made fully per-domain (using thread-local hooks). (But that's beyond the scope.) |
|
Don't you think there could be a timing issue there: |
|
I don't think there is a timing issue for systhreads initialization on domain 0. The systhreads init is done at the same time as the domain 0 threads init. Either way I agree that this could go in and one can open an issue to ask that the Thread.self situation is fixed, to let someone have a look independently from me. Then depending on how severe this is deemed, the issue can be targeted to 5.0 or later. (One motivation to fix it for 5.0 is that I am not sure that |
|
CAMLprim value caml_thread_self(value unit) /* ML */
{
if (curr_thread == NULL) invalid_argument("Thread.self: not initialized");
return curr_thread->descr;
}(In the current trunk, the function does not raise an exception but, if I understand correctly, it reads an uninitialized field instead, which is not much nicer.) Marking the function Looking for code that uses You can decide to change the runtime implementation in significant ways to make the function truly noalloc, but i think that it could/should go to a separate PR. |
|
This confirms that removing noalloc is a perf regression. I am not sure that this was incorrect though. If you access this function via the Thread module, then thread init has been done and it cannot raise any exception (right?). The check is for other potential callers, but those are not affected by the noalloc annotation. |
|
Could we define (Then I guess |
|
(The suggestion above might be a low-effort approach before a more invasive solution to thread initialization is proposed. I agree that it's worth pulling the thread (no pun intended) of what API Thread needs from the runtime to be correct.) |
|
I agree The simplest solution is to do a fatal error instead of raising an Invalid_argument exception. Justification: the only case where |
Not in multicore. There is also the case where domain-local thread init has not yet been done on the current domain. This is concrete, for instance memprof-limits calls Thread.self from asynchronous callbacks, which can be called in this interval. (So I realize now that even raising is a bad solution, since the code will have to be adapted to be ready for this in multicore.) To focus on this PR, I see two solutions:
|
Right. And very little code is ready to handle an |
|
Even if we remove the "offending commit", we have the problem that the return value of |
|
Raising Invalid_argument from Thread.self even when the Threads library was properly linked with the program is already a correctness bug. So I'd like to see it addressed before anything from this PR is merged. |
7916f46 to
756e4c9
Compare
|
The message was unclear so I took the second solution of removing the commit from this PR. I will open an issue instead.
This does a NULL pointer dereferencing (or a use-after-free in the case of a recycled domain). Idem if There was a risk for |
|
I think leaving this out of this PR and merging the fixes already in there is a good idea, should we just move on and merge this and move the discussion over to #11434 ? |
|
@gadmm this conflicts with your Caml_state=NULL changes, would you mind rebasing before merge? |
This is undefined behaviour; instead they are intended to be reused.
756e4c9 to
fe5e5aa
Compare
|
Rebased. |
More systhreads fixes for 5.0 (cherry picked from commit 05114fb)
|
Cherry-picked in 5.0 as 9bf9c52. |
Through digging into the new systhreads implementation here are a few more fixes to look into for 5.0:
Bugs fixed at 2. and 4. are likely sources of UB (deadlocks or crashes) in practice, which might explain some CI failures on other systhreads PRs, though it is unclear yet.
No change entry needed.
(cc @Engil, @kayceesrk, @gasche)