Simplifications/cleanups/clarifications from multicore review#10966
Simplifications/cleanups/clarifications from multicore review#10966sadiqj merged 7 commits intoocaml:trunkfrom
Conversation
xavierleroy
left a comment
There was a problem hiding this comment.
Thank you for all the attention you gave to the runtime system code. I read commit by commit. The first 5 look obviously correct, and the last 2 are a bit above my grade but I trust you that they do not change behavior yet. So, I'm approving this PR but would prefer to have a second review.
|
I've started reading through and I can finish the review of by mid-week. One thing I am not sure about is whether it's the right thing to reintroduce |
|
I would tend to agree with you regarding However I will document the limitation. |
|
The 6th and 7th commit are best reviewed by comparing to pre-multicore trunk. It mainly reintroduces what was already there. Only for |
So my understanding from rocq-prover/rocq#15464 (comment) and the discussion afterward is that switching to |
|
Right, there are no meaningful finalizers in Coq. Basically, what Coq needs is an extremely cheap way of testing that a signal (namely Whether we detect only pending signals or other pending events, we do not care, as long as the test is cheap and has no false positive. And if the runtime decides to perform a lot more work than just resolving the pending signal, that is also fine with us. |
|
And just to be complete, handling signals is just one of the issues. The other is what to do with |
|
Right. One point is that Coq VM will indeed need to be changed to call So there's maybe the option of not requiring more changes to the CoqVM later. It's worth testing the performance impact of polling in the new version. |
Perhaps, but it is still incredibly slower than the approach available in OCaml 4. In the old world, we could just check the value of atomic_thread_fence(memory_order_acquire);
for (i = 0; i < NSIG_WORDS; i++) {
if (atomic_load_explicit(&caml_pending_signals[i], memory_order_relaxed))
return 1;
}By the way, was the performance of OCaml bytecode interpreter tested? The old one was checking |
|
Believe it or not, this function amounts to a test of a single machine word and is compiled as such! See #10880. How recent is your benchmark? |
|
I am interested in feeback regarding the perf of |
That was the case a while ago, before #10880 was merged. Performance should be much better now, but can probably be improved some more by just testing young_ptr against young_limit. |
This would not work currently. |
Couldn't we make it work? It's good enough for native code, so it should be doable in bytecode too. |
|
I might have misread your comment. For @silene it is probably not enough to test young_limit because it is reset by allocations. Eventually for OCaml's bytecode interpreter the goal is to just have this test, yes. But if the issue mentioned at #10880 (comment) has the same origin then this is fixed on the way to solving #10915 (edit: looking at it again I do not see an indication that the failing test depends on #10915; I do not mean to discourage looking at the issue independently). |
Indeed, my mistake. After recompiling OCaml trunk, it is not as slow, but it is still 70% slower than OCaml 4.08 and 4.13. That said, it does not seem to be due to For the record, in the loop version of Fibonacci, time is spent inside |
sadiqj
left a comment
There was a problem hiding this comment.
This looks good @gadmm, thanks for going through these. Other than the review comment on the diff, I'm still uneasy about having a caml_process_pending_actions_exn that doesn't actually process pending actions, only signals. That said the only real user we can find of the API is probably aware of the problem and we intend to fix it eventually.
otherlibs/systhreads/st_stubs.c
Outdated
| if (atomic_load_acq(&Thread_main_lock.waiters) == 0) return Val_unit; | ||
|
|
||
| caml_process_pending_signals(); | ||
| caml_process_pending_actions(); |
There was a problem hiding this comment.
I'm not convinced this matches the behaviour in 4.x. If we look at https://github.com/ocaml/ocaml/blob/4.14/otherlibs/systhreads/st_stubs.c#L734-L739 we explicitly only deal with signals in caml_thread_yield.
There was a problem hiding this comment.
I agree, I think my reasoning was that in theory with new polling locations, any signal handler could execute other actions and so the distinction is no longer relevant. But the reasoning was probably flawed and in any case not for this PR which is meant to be side-effect-free.
Some documentation comment erroneously disappeared in multicore, which encouraged to keep consistency with caml_enter/leave_blocking_section, which was useful information in this case. I restored the comment.
In light of this comment, the line Current_thread = st_tls_get(Thread_key); was added in caml_thread_enter_blocking_section but not in caml_thread_yield, so there is a now a discrepancy, please have a look.
There was a problem hiding this comment.
Yes, this is a good point. @Engil do you happen to know how we ended up with this? trunk seems to update the current thread when entering a blocking section:
ocaml/otherlibs/systhreads/st_stubs.c
Line 210 in 1f9c958
caml_thread_enter_blocking_section.
(4.14 doesn't do this: https://github.com/ocaml/ocaml/blob/4.14/otherlibs/systhreads/st_stubs.c#L230 )
There was a problem hiding this comment.
I think this is an oversight and is not required.
Weird interactions with backup threads or else do not seem likely at all (since a thread going through caml_thread_enter_blocking_section is the thread actually holding the lock and the thread that has been executing code.).
I think it is safe to remove this.
There was a problem hiding this comment.
Ok, thanks. I'll make a PR doing so.
3a78d18 to
10edc37
Compare
|
I have to leave your remaining remark for another time. |
No change entry needed.
Remove caml_huge_fallback_count and FIXME comment. It never worked (has always been equal to zero) so it has no potential users, and a grep of opam sources returns no hit. There is little risk in removing it, and 5.0 is a good time for this. No change entry needed.
`caml/misc.h`: This header is very commonly included and I think it is not intended to make these functions and macros public. `caml/platform.h`: The whole header is made a CAML_INTERNALS. It contains non-prefixed identifiers and no function of public interest No change entry needed.
…c_interrupt Eventually we will only need the one that does not run finalisers, once it is implemented. No change entry needed.
Please review the last three commits (the others have not changed) and then I'll clean up the history. |
10edc37 to
44dd90e
Compare
- Restore old public API and comments. - Document current limitations, caml_process_pending_actions_exn is uninmplemented for now. - Move pending_actions back to signals.c - Reintroduce caml_process_pending_actions_with_root for subsequent commit - Remove superfluous functions - Try to restore discipline of _removing_ internal functions that directly raise when possible (caml_process_pending_signals). This is very preliminary work to reintroduce control over asynchronous exceptions. The first two points are desirable for 5.0. No behaviour change expected. No change entry needed.
Polling locations were tightened for memprof. They partially disappeared during multicore's rebase on top of 4.12 since multicore did not support memprof or the new polling machinery. In addition, it was inconsistent whether these changes and/or the comments relating them to memprof were kept or removed. These changes will be easy to miss when working to restore memprof unless someone goes through the diff of the multicore PR. Since I have already done it, this will make sure that nothing is accidentally lost. This commit is meant to be behaviour-preserving. No change entry needed.
Shorten the name caml_check_for_pending_signals for consistency
44dd90e to
0a372b2
Compare
|
It is clean now, thanks. |
|
Brilliant, thanks for your work on this @gadmm |
This PR is best read commit-by-commit. I grouped them together because they are all minor changes coming from my review of the multicore PR and are expected to introduce no behaviour change (and I think will interest @kayceesrk and @sadiqj).
caml_modify. Confusion stems from the fact that the PLDI paper describes a publication scheme based on the old concurrent collector design, so we document the difference with the paper. (As a side note, have the benchmarks from the paper been re-run with the release store instead of a relaxed store? If there is an impact then one could try optimizations based on the write barrier.)I believe that this PR will not need too many discussions, but if you want me to split any part of it to a different PR do not hesitate to request it.
Each commit has a detailed commit log that gives more details.