Resource-safe C interface, part 1 (the 4.10 backwards-compatibility edition)#8993
Resource-safe C interface, part 1 (the 4.10 backwards-compatibility edition)#8993gasche merged 6 commits intoocaml:trunkfrom
Conversation
89c299a to
cb1078d
Compare
|
I think that @jhjourdan and @stedolan can be interested and have opinions on the PR. An example of proposed use of the new interface is given at rocq-prover/rocq#10603 (comment). |
|
I think @dbuenzli can be interested and have opinions on the public interface. |
cb1078d to
9c26fab
Compare
jhjourdan
left a comment
There was a problem hiding this comment.
I think this PR goes to the right direction in exposing the _exn variants of these functions.
I left a few comments in the diff. I particular, I would prefer not introducing several variants for functions that essentially do the same thing. I instead prefer using flags or optional parameters.
runtime/signals.c
Outdated
|
|
||
| CAMLexport value caml_process_pending_events_exn (void) | ||
| { | ||
| if (!caml_get_urgent_to_do()) return Val_unit; |
There was a problem hiding this comment.
I don't like this change : even thought it may not be bogus in itself, it makes the code way less robust, for two reasons:
- In native mode, if
caml_something_to_dois not set but the minor GC limit is set toyoung_alloc_end, then we end up in an loop, because we never callcaml_update_young_limit. - If there is a pending callback but
caml_something_to_dois not set, then we need to wait for the next callback to get a chance of executing the first one.
Sure, if we maintain the invariant that caml_something_to_do is always set when there is a pending callback, then these two issues can never happen. But this is based on a rather subtle logic, and, moreover, accesses to caml_something_to_do and young_limit can be racy so that it's not clear that we see the right changes at the right moment.
There was a problem hiding this comment.
Sorry for the loop, I should have realised because I knew it from my work on implementing masking.
Note that whether the user is tasked in calling caml_get_urgent_to_do themselves or this implementation, the problem is the same. The goal here is to avoid exposing assumptions in the public interface about the race between querying signals and running signal handlers. It is not yet clear to me how multicore is going to affect this, but this way we can change it more easily in the future.
What we should do here is implement it differently in native and bytecode, and in native we check for young_limit==young_alloc_end instead (I already have to introduce the symmetric function caml_urgent_notify_program for Sys.mask, see 41a8ced).
There was a problem hiding this comment.
What we should do here is implement it differently in native and bytecode, and in native we check for young_limit==young_alloc_end instead (I already have to introduce the symmetric function caml_urgent_notify_program for Sys.mask, see 41a8ced).
I would still prefer not doing this test at all. Again, because of the race, caml_something_to_do may not be up-to-date in bytecode mode. The same applies for young_limit in native mode. And, moreover, the invariant you are relying on is subtle and may get broken in the future by bogus code. I would prefer being robust.
Do we have an idea of how cost a spurious call to the non-optimized version of this function? How much do we gain from this optimization?
There was a problem hiding this comment.
For the cost, the argument is that you want a C interface to do like OCaml. For instance you might want to remain reactive in a tight loop. In OCaml, we are happy that polling only costs one comparison in that case. The premiss is that there should be a way to do exactly like OCaml (essentially if (check()) { perform(); }).
It is true that checking for young_limit==young_alloc_end is not exactly like young_ptr < young_limit, but I think making young_limit==young_alloc_end an explicit invariant of when pending actions should be performed is technically sound. (I may also be missing your point, in that case could you please give us a more detailed explanation of your objection?)
Furthermore, in bytecode it is a bit more complicated indeed given that now sometimes the test is young_ptr < young_limit, other times it is caml_something_to_do, but I would prefer to simplify bytecode by always checking for the same condition (so, young_limit==young_alloc_end).
I can propose to make the non-checking function internal and provide a public function that does the test before calling the internal function.
The following piece of code from ocamlcc.1.0/runtime/ocamlcc-signals.c makes me think further (it relies on internals):
if (caml_something_to_do) {
long ocamlcc_save_sp_offset;
SaveSp(ocamlcc_save_sp_offset);
sp[-frame_sz - 1] = sp[-frame_sz - 2] = sp[-frame_sz - 3] =
sp[-frame_sz - 4] = sp[-frame_sz - 5] =
sp[-frame_sz - 6] = Val_unit;
OffsetSp(-frame_sz - 6);
do {
caml_something_to_do = 0;
caml_process_event();
} while (caml_something_to_do);
ResetSp(-frame_sz - 6);
RestoreSp(ocamlcc_save_sp_offset);
}If we do want to support the inner loop, it looks to me that we need to expose caml_get_urgent_to_do(), contrary to what I advocate. But I do not know if we want to support something like it. We can keep caml_get_urgent_to_do internal, since they already rely on internals.
We will see what I come up with.
There was a problem hiding this comment.
I initially wrote that in bytecode sometimes the test is young_ptr < young_limit, but this is currently never because young_limit == young_alloc_end. Similarly in native code we can get to caml_garbage_collection by exceeding caml_memprof_young_trigger. In that case, as I now understand, it is not only brittle but wrong to check young_limit == young_alloc_end.
There was a problem hiding this comment.
My point is that, in rare situations, caml_something_to_do and young_limit may not be set even though there is an async callback waiting. I can foresee several reasons:
- Because of the race, writes to these variables may not be visible to the OCaml thread at the right moment. Imagine we get into
caml_check_urgent_gcwhencaml_something_to_dois seen as set but notcaml_pending_signals: we may miss a signal in the first call tocaml_check_urgent_gcand don't check for them later because we return from said function even before checkingcaml_pending_signals. - The fact that
caml_something_to_dois set when there are pending callbacks relies on subtle invariants which we will easily break in the future. - If there are two different callbacks (say, a signal and a finalizer) arriving at the same time, then we set
caml_something_to_doto 0 when starting the first one while the second one is still waiting. We may want to run the second one if the first one is taking long.
In the current situation, the code is robust in the sense that async callbacks are called at the next minor collection even if, for some reason, caml_something_to_do is not set properly. With what you are proposing, we have to wait for the next async callback, which may not come at all.
I can propose to make the non-checking function internal and provide a public function that does the test before calling the internal function.
Perhaps what we should do then is make caml_something_to_do part of the public interface and document that this is the thing one has to check before calling caml_check_urgent_gc. This would both cover ocamlcc and Coq use cases, and make polling very cheap as you desire.
Note, however, that because of the reasons I described above, it may be possible that we miss async callbacks in long-running C code. I guess that's the price to pay?
There was a problem hiding this comment.
You explain well why we need to distinguish the function called internally from the function provided to the user (which will be guarded by a check to caml_something_to_do).
With masking, we will need to make it a hard invariant that caml_something_to_do has to be set if there are pending callbacks (outside the case of process_pending* where it is temporarily set to zero and set back to 1 later if needed). I think it is better to start doing it now.
In the case of a finaliser taking a long time while a signal is pending and caml_something_to_do is temporarily 0 (or vice-versa), we can think further about it but I think this is outside the scope.
These reasons convince me to not make public a check for caml_something_to_do, so that we keep the races under our control and it is easier to fix any issue later.
I am not sure we want to support this exact code of ocamlcc, because I don't think there is a meaning to "please run signals until there are no more".
I am not sure I see under which conditions async callbacks could be forgotten, what scenario do you think about exactly?
There was a problem hiding this comment.
With masking, we will need to make it a hard invariant that caml_something_to_do has to be set if there are pending callbacks (outside the case of process_pending* where it is temporarily set to zero and set back to 1 later if needed). I think it is better to start doing it now.
Why so? Cannot we simply call caml_check_urgent_gc after any unmasking?
These reasons convince me to not make public a check for caml_something_to_do, so that we keep the races under our control and it is easier to fix any issue later.
What about exposing a getter then? This would make it easy to change its content if needed. This getter could even be an inline function, for performance.
I am not sure we want to support this exact code of ocamlcc, because I don't think there is a meaning to "please run signals until there are no more".
That's not what worries me in ocamlcc's use case. What worries me is that it seems important for its performance not to register its extra roots and perform the function call at every single check for signals.
I am not sure I see under which conditions async callbacks could be forgotten, what scenario do you think about exactly?
I don't get your question: I thought my previous message was all about answering it... If caml_something_to_do is 0 when there is a pending async callback, then, with your proposal, we would need to check for the next async callback to come for executing the first one. To me, this is essentially as if we lost it.
There was a problem hiding this comment.
Just to make things clear: my current proposal is to expose in the public interface a getter for caml_something_to_do, and document that it can be used for fast polling in C code. This getter could be an inline function in the signals.h file to make its cost minimal. We would not check for caml_something_to_do at every minor GC, just to make sure we don't miss async callbacks.
Do you agree with this proposal? If not, then, what is the issue? It seems to me this addresses all your concerns.
There was a problem hiding this comment.
Ok, I thought you meant that some signal would be dropped. Now I understand the issue with long-running C code, this is well spotted. I am proposing the following fix: allow users the obtain the same fallback behaviour as there is from OCaml if they cooperate, by setting caml_something_to_do at every minor collection or major slice in CAML_FROM_C mode.
Waiting for a minor collection or a major slice is still bad latency. It would be nice to fix the first race you explain between caml_something_to_do and caml_pending_signals (this beyond my expertise). I imagine that something will have to be done anyway with multicore (I thought I saw it mentioned at some point but could not find a dedicated bug report). (I am also thinking that caml_signals_are_pending might be redundant now with caml_something_to_do, but I am not sure I should do something about it right now.)
I am wary of exposing a getter for caml_something_to_do as much as exposing caml_something_to_do itself: I want to avoid hardcoding backwards-compatibility support for the form if (check()) { execute(); } and everything else users can come up with. This would mean losing control over the critical section between check and execute, limiting future evolution. In particular this limits the space of possible solutions of the two problems and of any other that could arise (allow me to remain abstract; if I had a concrete solution in mind right now I would propose it). Anyway, I do not see a need for it compared to your proposal now that we can replicate the fallback behaviour in caml_process_pending_actions.
Ocamlcc is a special because it is an OCaml compiler so it is not excessive to ask that they keep relying on internals. It is like saying that the public interface does not support the use-case from runtime/interp.c either. The goal of this new API is not to support the creation of alternative OCaml compilers, as this could require to make public many more internal runtime identifiers beyond those.
I hope the new version of the code addresses all your related concerns (I am waiting for tests to finish). I am documenting the invariant for caml_something_to_do so that we go towards fixing it and not breaking it.
(About unmasking: we do not execute callbacks right away for crucial reasons related to reasoning about critical sections for polling. I gave a thought to what made OCaml's polling behaviour nice to reason about in my experience, and I wrote it down here: ocaml-multicore/ocaml-multicore#187 (comment). It is entirely about almost your question. But I guess you just want to poll for signals unconditionally when unmasking and not rely on caml_something_to_do to the extent that it is unreliable: that is something else and very reasonable, do not hesitate to bring up this point again at #8961.)
|
And, BTW, thanks for this PR, this is quite a lot of well-polished work. |
|
I'll get back to this shortly. In the meanwhile, I note that some comments are about choices for a public-facing interface. I would like to mention that anybody should feel free to contribute to these, not just the experts of this code. |
|
Reminder for myself again: we agreed to change the default behavior of |
|
Do you need help to get this merged this week? |
|
No problem to finish it this week, but it would help to have a clear idea of the deadlines in general. Right now it feels like everybody around me is in a hurry and I don't know why. I asked privately about this and I was told "feature freeze somewhere around mid-October", which makes me think we still have time to finish it this week and discuss it comfortably, but it's rather vague.
|
|
This PR looks nice, but it adds quite a few more functions to the C API. In particular, what's the motivation for providing both If it's purely to avoid the overhead of adding an extra root, then I'd like to suggest the following implementation of value caml_process_pending_gc(value extra_root)
{
if (Caml_state->requested_major_slice || Caml_state->requested_minor_gc) {
CAMLparam1 (extra_root);
CAML_INSTR_INT ("force_minor/caml_process_pending_gc@", 1);
caml_gc_dispatch();
CAMLdrop;
}
return extra_root;
}This version incurs the small overhead of adding a root only when actually invoking the GC, which is an expensive enough operation that the overhead doesn't matter. |
|
The motivation was backwards compatibility: the with_root version was an alias for caml_check_urgent_gc. The one I would like to expose does not take an additional root (and the reason it's enough is close to your suggestion that additional roots are easy to add locally). Anyway, the decision to change the semantics of caml_check_urgent_gc will bring a few changes and simplifications. I'll keep your tip in mind to use if need be. |
1aa3356 to
6bafd46
Compare
|
Here is the new version. I will open a different PR for the commit that changes I have to admit that I do not understand the CAMLno_tsan business, thanks if you could have a look and explain any possible mistake. |
|
The first 4 commits are proposed at #9037. This PR and review concerns the 5 last commits. |
6bafd46 to
85a3eaf
Compare
Don't worry too much about that, CI will tell you if you do it wrong. |
runtime/caml/signals.h
Outdated
| FIXME: | ||
|
|
||
| * We could get into caml_process_pending_actions when | ||
| caml_something_to_do is seen as set but not | ||
| caml_pending_signals, making us miss the signal. | ||
|
|
||
| * If there are two different callbacks (say, a signal and a | ||
| finaliser) arriving at the same time, then we set | ||
| caml_something_to_do to 0 when starting the first one while the | ||
| second one is still waiting. We may want to run the second one | ||
| if the first one is taking long. |
There was a problem hiding this comment.
Not sure this deserves a FIXME. The current solution is rather good, and I don't think there is anything better we could do.
There was a problem hiding this comment.
IIUC, in these situations, instead of reacting at the next allocation, it react on the order of 10^5 to 10^6 allocations later. For finalisers the total latency remains on the same order, but for signals this is bad. In particular, have a look at the part of ocaml-multicore/ocaml-multicore#187 (comment) which says "guaranteed interrupt at fixed delays... Based on the maximum latency of interrupt required...". From their point of view of guaranteed latency, this is still buggy.
The second bullet point can be solved if we call signal handlers first, what do you think about it?
It would be nice to fix the first one as well.
There was a problem hiding this comment.
The second bullet point can be solved if we call signal handlers first, what do you think about it?
Alright, let's do that. Note, however, that the issue is still there for finalisers or memprof. That's sort of an unsolvable problem.
Anyway, for both issues, they will only occur at very rare occasions, and we have a workaround. And it's unlikely we will find a "fix" any time. Hence I don't think the "FIXME" is deserved.
But that's a detail, I don't care, let's move forward.
There was a problem hiding this comment.
The second bullet point has been fixed. Signal handlers are now safe from this (newly-introduced, AFAIU) race. It is no longer a FIXME, because it is not a bug for finalisers (their latency remains on the same order of magnitude), and it is not a bug for memprof callbacks either, because although we want them to run ASAP, they are served on a best-effort basis.
I keep the first item as a FIXME because it is a bug if we take into account code that tries to allocate little, or if there are latency guarantees as assumed in the work on better safe points. If it is a deviation to the intended behaviour and it does not appear easily solvable with the current mechanism, then I think a FIXME is the best option.
|
Windows CI is failing, it is probably due to a segfault, but I am not sure. Let's see what happens when you will update this. |
85a3eaf to
b42176f
Compare
|
I am OK with removing the functions with the extra root. Here is the patch I propose: I had to add roots at three places:
let () =
for i = 0 to 100000000 do
ignore (Float.Array.create 2)
doneI think this is acceptable. Nobody will have such a tight allocation pattern. Do you agree with this change? |
|
Ok, I worked on a patch in the meanwhile. Let us cross-check our patches. |
|
I'm OK with the change, but for |
|
Is the change to |
|
Re
The new API proposed by Guillaume looks nice, and I'm not saying this to discourage you from simplifying things -- a 5% degradation on a microbenchmark does sound reasonable in favor of better ressource-safety in the runtime. But optimizing the runtime code also makes sense. |
No, that's a mistake. |
It's with the conditional rooting. I am not sure whether it is noise (the stack frame is different, this can have impact on alignment), or whether there is another effect (the size of the code is larger -> impact on code cache, the stack frame is larger...). |
|
In any case, I don't think the performance degradation of either inilining or not inlining is worth carrying. So do as you prefer. But I don't really like the idea of keeping the two aliases just for one function used in one very particular way. This is not mentionning the fact that virtually nobody use |
I can believe this, and many other effects. There was an interesting talk at Strange Loop about this (https://www.youtube.com/watch?v=r-TLSBdHe1A, I especially like the part where a difference in benchmarks was due to them running in sessions with user names with different lengths). BTW it has nothing to do with the absence of CAMLno_tsan? That does not affect generated code? IIUC you also need to add a root to |
3eab45c to
4845e9c
Compare
According to @stedolan, there is no reason that making a GC root should be costly, if somebody wanted to work on it. |
This only affect the generated code when the thread sanitizer is activated. Well, at least, I would consider it to be a bug...
Indeed, well spotted!
For About So, I see two solutions:
|
Ah good to know, thanks for digging up a reference.
Yes, see the latest version, this is what I was now proposing.
I already use Stephen's trick to make it negligible. I think it is fine as it is. |
jhjourdan
left a comment
There was a problem hiding this comment.
Apart from the bug in caml_array_gather, I'm happy with this version.
I have a minor comment about the name caml_do_actions, which I don't find coherent with the other names.
Another point: forget my recent proposal to change the definition of caml_process_pending_actions. I had not thought enough of it, what I was thinking about does not work.
Introduce caml_process_pending_actions and
caml_process_pending_actions_exn: a variant of the former which does
not raise but returns a value that has to be checked against
Is_exception_value.
I keep the current conventions from caml_callback{,_exn}: For a
resource-safe interface, we mostly care about the _exn variants, but
every time there is a public _exn function I provide a function that
raises directly for convenience.
They are introduced and documented in caml/signals.h.
Private functions are converted to their _exn variant on the way as
needed: for internal functions of the runtime, it is desirable to go
towards a complete elimination of functions that raise implicitly.
Get rid of the distant logic of caml_raise_in_async_callback. Instead,
caml_process_pending_events takes care itself of its something_to_do
"resource". This avoids calling the former function in places
unrelated to asynchronous callbacks.
point following every minor collection or major slice. Also run signal handlers first. Indeed, in some cases, caml_something_to_do is not reliable (spotted by @jhjourdan): * We could get into caml_process_pending_actions when caml_something_to_do is seen as set but not caml_pending_signals, making us miss the signal. * If there are two different callbacks (say, a signal and a finaliser) arriving at the same time, then we set caml_something_to_do to 0 when starting the first one while the second one is still waiting. We may want to run the second one if the first one is taking long. In the latter case, the additional fix is to favour signals, which have a lower latency requirement, whereas the latency of finalisers keeps the same order of magnitude, and memprof callbacks are served on a best-effort basis.
4845e9c to
890bb90
Compare
|
Roadmap:
|
I don't think there is any issue with this. |
|
The PR is ready as is. It can be merged if CI passes. |
|
Green light on my side as well. I am very satisfied with the final state of the PR, especially with the important improvements brought via Jacques-Henri's review. Thanks! |
gasche
left a comment
There was a problem hiding this comment.
Approved on jhjourdan's behalf.
This introduces new functions
caml_process_pending_*(new public-facing name forcaml_check_urgent_gc) with variants that return an exception instead of raising it immediately (in the style ofcaml_callback_exn. The overall goal is to allow writing resource-safe C functions in the presence of asynchronous exceptions.This is a follow-up to #8691 and #8941. #8691 made this proposal possible by limiting the places where exceptions can be raised in C code.
Unfortunately, it can also break code for users with long-running C functions, who now have to change their code to poll regularly.I think this is a sound design, but users need to be accompanied and the new usage has to be documented and given support. Documentation has a PR at #8941 which will need to be updated if this PR is accepted. Support is provided by this PR (part 1), specifically limited to what would be nice to do for 4.10 (in particular providing a transition for the Coq proof assistant). (Note, there are two things: settling on an interface with nice names for the explicit polling, and in addition to that, what is needed to support Coq's use-case. But their current code relies on caml internals, so there is no obligation to be nice.)This contains a proposal for a public-facing interface. This is best read commit-per-commit, each commit log includes a detailed rationale. Let me know if the github interface makes this impractical.
To do