Skip to content

Resource-safe C interface, part 1 (the 4.10 backwards-compatibility edition)#8993

Merged
gasche merged 6 commits intoocaml:trunkfrom
gadmm:resource-safe-api-pt1
Oct 18, 2019
Merged

Resource-safe C interface, part 1 (the 4.10 backwards-compatibility edition)#8993
gasche merged 6 commits intoocaml:trunkfrom
gadmm:resource-safe-api-pt1

Conversation

@gadmm
Copy link
Copy Markdown
Contributor

@gadmm gadmm commented Sep 27, 2019

This introduces new functions caml_process_pending_* (new public-facing name for caml_check_urgent_gc) with variants that return an exception instead of raising it immediately (in the style of caml_callback_exn. The overall goal is to allow writing resource-safe C functions in the presence of asynchronous exceptions.

This is a follow-up to #8691 and #8941. #8691 made this proposal possible by limiting the places where exceptions can be raised in C code. Unfortunately, it can also break code for users with long-running C functions, who now have to change their code to poll regularly. I think this is a sound design, but users need to be accompanied and the new usage has to be documented and given support. Documentation has a PR at #8941 which will need to be updated if this PR is accepted. Support is provided by this PR (part 1), specifically limited to what would be nice to do for 4.10 (in particular providing a transition for the Coq proof assistant). (Note, there are two things: settling on an interface with nice names for the explicit polling, and in addition to that, what is needed to support Coq's use-case. But their current code relies on caml internals, so there is no obligation to be nice.)

This contains a proposal for a public-facing interface. This is best read commit-per-commit, each commit log includes a detailed rationale. Let me know if the github interface makes this impractical.

To do

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 28, 2019

I think that @jhjourdan and @stedolan can be interested and have opinions on the PR.

An example of proposed use of the new interface is given at rocq-prover/rocq#10603 (comment).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Sep 28, 2019

I think @dbuenzli can be interested and have opinions on the public interface.

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR goes to the right direction in exposing the _exn variants of these functions.

I left a few comments in the diff. I particular, I would prefer not introducing several variants for functions that essentially do the same thing. I instead prefer using flags or optional parameters.


CAMLexport value caml_process_pending_events_exn (void)
{
if (!caml_get_urgent_to_do()) return Val_unit;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this change : even thought it may not be bogus in itself, it makes the code way less robust, for two reasons:

  • In native mode, if caml_something_to_do is not set but the minor GC limit is set to young_alloc_end, then we end up in an loop, because we never call caml_update_young_limit.
  • If there is a pending callback but caml_something_to_do is not set, then we need to wait for the next callback to get a chance of executing the first one.

Sure, if we maintain the invariant that caml_something_to_do is always set when there is a pending callback, then these two issues can never happen. But this is based on a rather subtle logic, and, moreover, accesses to caml_something_to_do and young_limit can be racy so that it's not clear that we see the right changes at the right moment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the loop, I should have realised because I knew it from my work on implementing masking.

Note that whether the user is tasked in calling caml_get_urgent_to_do themselves or this implementation, the problem is the same. The goal here is to avoid exposing assumptions in the public interface about the race between querying signals and running signal handlers. It is not yet clear to me how multicore is going to affect this, but this way we can change it more easily in the future.

What we should do here is implement it differently in native and bytecode, and in native we check for young_limit==young_alloc_end instead (I already have to introduce the symmetric function caml_urgent_notify_program for Sys.mask, see 41a8ced).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we should do here is implement it differently in native and bytecode, and in native we check for young_limit==young_alloc_end instead (I already have to introduce the symmetric function caml_urgent_notify_program for Sys.mask, see 41a8ced).

I would still prefer not doing this test at all. Again, because of the race, caml_something_to_do may not be up-to-date in bytecode mode. The same applies for young_limit in native mode. And, moreover, the invariant you are relying on is subtle and may get broken in the future by bogus code. I would prefer being robust.

Do we have an idea of how cost a spurious call to the non-optimized version of this function? How much do we gain from this optimization?

Copy link
Copy Markdown
Contributor Author

@gadmm gadmm Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cost, the argument is that you want a C interface to do like OCaml. For instance you might want to remain reactive in a tight loop. In OCaml, we are happy that polling only costs one comparison in that case. The premiss is that there should be a way to do exactly like OCaml (essentially if (check()) { perform(); }).

It is true that checking for young_limit==young_alloc_end is not exactly like young_ptr < young_limit, but I think making young_limit==young_alloc_end an explicit invariant of when pending actions should be performed is technically sound. (I may also be missing your point, in that case could you please give us a more detailed explanation of your objection?)

Furthermore, in bytecode it is a bit more complicated indeed given that now sometimes the test is young_ptr < young_limit, other times it is caml_something_to_do, but I would prefer to simplify bytecode by always checking for the same condition (so, young_limit==young_alloc_end).

I can propose to make the non-checking function internal and provide a public function that does the test before calling the internal function.

The following piece of code from ocamlcc.1.0/runtime/ocamlcc-signals.c makes me think further (it relies on internals):

  if (caml_something_to_do) {
    long ocamlcc_save_sp_offset;
    SaveSp(ocamlcc_save_sp_offset);
    sp[-frame_sz - 1] = sp[-frame_sz - 2] = sp[-frame_sz - 3] =
      sp[-frame_sz - 4] = sp[-frame_sz - 5] =
      sp[-frame_sz - 6] = Val_unit;
    OffsetSp(-frame_sz - 6);
    do {
      caml_something_to_do = 0;
      caml_process_event();
    } while (caml_something_to_do);
    ResetSp(-frame_sz - 6);
    RestoreSp(ocamlcc_save_sp_offset);
  }

If we do want to support the inner loop, it looks to me that we need to expose caml_get_urgent_to_do(), contrary to what I advocate. But I do not know if we want to support something like it. We can keep caml_get_urgent_to_do internal, since they already rely on internals.

We will see what I come up with.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially wrote that in bytecode sometimes the test is young_ptr < young_limit, but this is currently never because young_limit == young_alloc_end. Similarly in native code we can get to caml_garbage_collection by exceeding caml_memprof_young_trigger. In that case, as I now understand, it is not only brittle but wrong to check young_limit == young_alloc_end.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that, in rare situations, caml_something_to_do and young_limit may not be set even though there is an async callback waiting. I can foresee several reasons:

  • Because of the race, writes to these variables may not be visible to the OCaml thread at the right moment. Imagine we get into caml_check_urgent_gc when caml_something_to_do is seen as set but not caml_pending_signals: we may miss a signal in the first call to caml_check_urgent_gc and don't check for them later because we return from said function even before checking caml_pending_signals.
  • The fact that caml_something_to_do is set when there are pending callbacks relies on subtle invariants which we will easily break in the future.
  • If there are two different callbacks (say, a signal and a finalizer) arriving at the same time, then we set caml_something_to_do to 0 when starting the first one while the second one is still waiting. We may want to run the second one if the first one is taking long.

In the current situation, the code is robust in the sense that async callbacks are called at the next minor collection even if, for some reason, caml_something_to_do is not set properly. With what you are proposing, we have to wait for the next async callback, which may not come at all.

I can propose to make the non-checking function internal and provide a public function that does the test before calling the internal function.

Perhaps what we should do then is make caml_something_to_do part of the public interface and document that this is the thing one has to check before calling caml_check_urgent_gc. This would both cover ocamlcc and Coq use cases, and make polling very cheap as you desire.

Note, however, that because of the reasons I described above, it may be possible that we miss async callbacks in long-running C code. I guess that's the price to pay?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You explain well why we need to distinguish the function called internally from the function provided to the user (which will be guarded by a check to caml_something_to_do).

With masking, we will need to make it a hard invariant that caml_something_to_do has to be set if there are pending callbacks (outside the case of process_pending* where it is temporarily set to zero and set back to 1 later if needed). I think it is better to start doing it now.

In the case of a finaliser taking a long time while a signal is pending and caml_something_to_do is temporarily 0 (or vice-versa), we can think further about it but I think this is outside the scope.

These reasons convince me to not make public a check for caml_something_to_do, so that we keep the races under our control and it is easier to fix any issue later.

I am not sure we want to support this exact code of ocamlcc, because I don't think there is a meaning to "please run signals until there are no more".

I am not sure I see under which conditions async callbacks could be forgotten, what scenario do you think about exactly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With masking, we will need to make it a hard invariant that caml_something_to_do has to be set if there are pending callbacks (outside the case of process_pending* where it is temporarily set to zero and set back to 1 later if needed). I think it is better to start doing it now.

Why so? Cannot we simply call caml_check_urgent_gc after any unmasking?

These reasons convince me to not make public a check for caml_something_to_do, so that we keep the races under our control and it is easier to fix any issue later.

What about exposing a getter then? This would make it easy to change its content if needed. This getter could even be an inline function, for performance.

I am not sure we want to support this exact code of ocamlcc, because I don't think there is a meaning to "please run signals until there are no more".

That's not what worries me in ocamlcc's use case. What worries me is that it seems important for its performance not to register its extra roots and perform the function call at every single check for signals.

I am not sure I see under which conditions async callbacks could be forgotten, what scenario do you think about exactly?

I don't get your question: I thought my previous message was all about answering it... If caml_something_to_do is 0 when there is a pending async callback, then, with your proposal, we would need to check for the next async callback to come for executing the first one. To me, this is essentially as if we lost it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make things clear: my current proposal is to expose in the public interface a getter for caml_something_to_do, and document that it can be used for fast polling in C code. This getter could be an inline function in the signals.h file to make its cost minimal. We would not check for caml_something_to_do at every minor GC, just to make sure we don't miss async callbacks.

Do you agree with this proposal? If not, then, what is the issue? It seems to me this addresses all your concerns.

Copy link
Copy Markdown
Contributor Author

@gadmm gadmm Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought you meant that some signal would be dropped. Now I understand the issue with long-running C code, this is well spotted. I am proposing the following fix: allow users the obtain the same fallback behaviour as there is from OCaml if they cooperate, by setting caml_something_to_do at every minor collection or major slice in CAML_FROM_C mode.

Waiting for a minor collection or a major slice is still bad latency. It would be nice to fix the first race you explain between caml_something_to_do and caml_pending_signals (this beyond my expertise). I imagine that something will have to be done anyway with multicore (I thought I saw it mentioned at some point but could not find a dedicated bug report). (I am also thinking that caml_signals_are_pending might be redundant now with caml_something_to_do, but I am not sure I should do something about it right now.)

I am wary of exposing a getter for caml_something_to_do as much as exposing caml_something_to_do itself: I want to avoid hardcoding backwards-compatibility support for the form if (check()) { execute(); } and everything else users can come up with. This would mean losing control over the critical section between check and execute, limiting future evolution. In particular this limits the space of possible solutions of the two problems and of any other that could arise (allow me to remain abstract; if I had a concrete solution in mind right now I would propose it). Anyway, I do not see a need for it compared to your proposal now that we can replicate the fallback behaviour in caml_process_pending_actions.

Ocamlcc is a special because it is an OCaml compiler so it is not excessive to ask that they keep relying on internals. It is like saying that the public interface does not support the use-case from runtime/interp.c either. The goal of this new API is not to support the creation of alternative OCaml compilers, as this could require to make public many more internal runtime identifiers beyond those.

I hope the new version of the code addresses all your related concerns (I am waiting for tests to finish). I am documenting the invariant for caml_something_to_do so that we go towards fixing it and not breaking it.

(About unmasking: we do not execute callbacks right away for crucial reasons related to reasoning about critical sections for polling. I gave a thought to what made OCaml's polling behaviour nice to reason about in my experience, and I wrote it down here: ocaml-multicore/ocaml-multicore#187 (comment). It is entirely about almost your question. But I guess you just want to poll for signals unconditionally when unmasking and not rely on caml_something_to_do to the extent that it is unreliable: that is something else and very reasonable, do not hesitate to bring up this point again at #8961.)

@jhjourdan
Copy link
Copy Markdown
Contributor

And, BTW, thanks for this PR, this is quite a lot of well-polished work.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 2, 2019

I'll get back to this shortly. In the meanwhile, I note that some comments are about choices for a public-facing interface. I would like to mention that anybody should feel free to contribute to these, not just the experts of this code.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 7, 2019

Reminder for myself again: we agreed to change the default behavior of caml_check_urgent_gc over at #8691. It has to be done for 4.10 so I will have to split it in case the current PR is not ready for 4.10. (ETA: during this week)

@jhjourdan
Copy link
Copy Markdown
Contributor

Do you need help to get this merged this week?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 7, 2019 via email

@stedolan
Copy link
Copy Markdown
Contributor

stedolan commented Oct 8, 2019

This PR looks nice, but it adds quite a few more functions to the C API. In particular, what's the motivation for providing both caml_process_pending_gc and caml_process_pending_gc_with_root?

If it's purely to avoid the overhead of adding an extra root, then I'd like to suggest the following implementation of caml_process_pending_gc(value) instead:

value caml_process_pending_gc(value extra_root)
{
  if (Caml_state->requested_major_slice || Caml_state->requested_minor_gc) {
    CAMLparam1 (extra_root);
    CAML_INSTR_INT ("force_minor/caml_process_pending_gc@", 1);
    caml_gc_dispatch();
    CAMLdrop;
  }
  return extra_root;
}

This version incurs the small overhead of adding a root only when actually invoking the GC, which is an expensive enough operation that the overhead doesn't matter.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 8, 2019

The motivation was backwards compatibility: the with_root version was an alias for caml_check_urgent_gc. The one I would like to expose does not take an additional root (and the reason it's enough is close to your suggestion that additional roots are easy to add locally). Anyway, the decision to change the semantics of caml_check_urgent_gc will bring a few changes and simplifications. I'll keep your tip in mind to use if need be.

@stedolan stedolan mentioned this pull request Oct 9, 2019
@gadmm gadmm force-pushed the resource-safe-api-pt1 branch from 1aa3356 to 6bafd46 Compare October 11, 2019 16:24
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 11, 2019

Here is the new version. I will open a different PR for the commit that changes caml_check_urgent_gc, because it is a separate entry in the Changes file. That current PR is based on top of that other one.

I have to admit that I do not understand the CAMLno_tsan business, thanks if you could have a look and explain any possible mistake.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 11, 2019

The first 4 commits are proposed at #9037. This PR and review concerns the 5 last commits.

@gadmm gadmm force-pushed the resource-safe-api-pt1 branch from 6bafd46 to 85a3eaf Compare October 11, 2019 18:45
@jhjourdan
Copy link
Copy Markdown
Contributor

I have to admit that I do not understand the CAMLno_tsan business, thanks if you could have a look and explain any possible mistake.

Don't worry too much about that, CI will tell you if you do it wrong.

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part of this PR which is not also in #9037 looks good. I have minor comments. Nothing important. When #9037 will be merged, this one is ready to merge.

Comment on lines +54 to +64
FIXME:

* We could get into caml_process_pending_actions when
caml_something_to_do is seen as set but not
caml_pending_signals, making us miss the signal.

* If there are two different callbacks (say, a signal and a
finaliser) arriving at the same time, then we set
caml_something_to_do to 0 when starting the first one while the
second one is still waiting. We may want to run the second one
if the first one is taking long.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this deserves a FIXME. The current solution is rather good, and I don't think there is anything better we could do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in these situations, instead of reacting at the next allocation, it react on the order of 10^5 to 10^6 allocations later. For finalisers the total latency remains on the same order, but for signals this is bad. In particular, have a look at the part of ocaml-multicore/ocaml-multicore#187 (comment) which says "guaranteed interrupt at fixed delays... Based on the maximum latency of interrupt required...". From their point of view of guaranteed latency, this is still buggy.

The second bullet point can be solved if we call signal handlers first, what do you think about it?

It would be nice to fix the first one as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second bullet point can be solved if we call signal handlers first, what do you think about it?

Alright, let's do that. Note, however, that the issue is still there for finalisers or memprof. That's sort of an unsolvable problem.

Anyway, for both issues, they will only occur at very rare occasions, and we have a workaround. And it's unlikely we will find a "fix" any time. Hence I don't think the "FIXME" is deserved.

But that's a detail, I don't care, let's move forward.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second bullet point has been fixed. Signal handlers are now safe from this (newly-introduced, AFAIU) race. It is no longer a FIXME, because it is not a bug for finalisers (their latency remains on the same order of magnitude), and it is not a bug for memprof callbacks either, because although we want them to run ASAP, they are served on a best-effort basis.

I keep the first item as a FIXME because it is a bug if we take into account code that tries to allocate little, or if there are latency guarantees as assumed in the work on better safe points. If it is a deviation to the intended behaviour and it does not appear easily solvable with the current mechanism, then I think a FIXME is the best option.

@jhjourdan
Copy link
Copy Markdown
Contributor

Windows CI is failing, it is probably due to a segfault, but I am not sure. Let's see what happens when you will update this.

@damiendoligez damiendoligez added this to the 4.10 milestone Oct 15, 2019
@gadmm gadmm force-pushed the resource-safe-api-pt1 branch from 85a3eaf to b42176f Compare October 15, 2019 11:03
@jhjourdan
Copy link
Copy Markdown
Contributor

I am OK with removing the functions with the extra root. Here is the patch I propose:

diff --git a/runtime/array.c b/runtime/array.c
index f84e365fc..bf509e36f 100644
--- a/runtime/array.c
+++ b/runtime/array.c
@@ -279,8 +279,17 @@ CAMLprim value caml_floatarray_create(value len)
   else {
     result = caml_alloc_shr (wosize, Double_array_tag);
   }
+
   // Give the GC a chance to run, and run memprof callbacks
-  return caml_check_urgent_gc_and_callbacks (result);
+  // This is an inlined version of [caml_process_pending_actions], to
+  // avoid unconditionally registering result as a root.
+  if(caml_something_to_do) {
+    CAMLparam1(result);
+    caml_raise_if_exception(caml_do_urgent_gc_and_callbacks_exn());
+    CAMLdrop;
+  }
+
+  return result;
 }
 
 /* [len] is a [value] representing number of words or floats */
@@ -331,7 +340,8 @@ CAMLprim value caml_make_vect(value len, value init)
     }
   }
   // Give the GC a chance to run, and run memprof callbacks
-  CAMLreturn (caml_check_urgent_gc_and_callbacks(res));
+  caml_process_pending_actions();
+  CAMLreturn (res);
 }
 
 /* [len] is a [value] representing number of floats */
@@ -384,7 +394,8 @@ CAMLprim value caml_make_array(value init)
         Store_double_flat_field(res, i, d);
       }
       // run memprof callbacks
-      CAMLreturn (caml_check_urgent_gc_and_callbacks(res));
+      caml_process_pending_actions();
+      CAMLreturn (res);
     }
   }
 #else
@@ -443,7 +454,7 @@ CAMLprim value caml_array_blit(value a1, value ofs1, value a2, value ofs2,
   }
   /* Many caml_modify in a row can create a lot of old-to-young refs.
      Give the minor GC a chance to run if it needs to. */
-  caml_check_urgent_gc(Val_unit);
+  caml_process_pending_actions();
   return Val_unit;
 }
 
@@ -522,7 +533,7 @@ static value caml_array_gather(intnat num_arrays,
     /* Many caml_initialize in a row can create a lot of old-to-young
        refs.  Give the minor GC a chance to run if it needs to.
        Run memprof callback for the major allocation. */
-    res = caml_check_urgent_gc_and_callbacks (res);
+    caml_process_pending_actions();
   }
   CAMLreturn (res);
 }
diff --git a/runtime/caml/signals.h b/runtime/caml/signals.h
index c2eb9f7ce..3e7872ef3 100644
--- a/runtime/caml/signals.h
+++ b/runtime/caml/signals.h
@@ -76,7 +76,6 @@ void caml_record_signal(int signal_number);
 value caml_process_pending_signals_exn(void);
 void caml_set_action_pending (void);
 value caml_do_urgent_gc_and_callbacks_exn (void);
-value caml_check_urgent_gc_and_callbacks (value extra_root); // raises
 int caml_set_signal_action(int signo, int action);
 void caml_setup_stack_overflow_detection(void);
 
diff --git a/runtime/intern.c b/runtime/intern.c
index e49e20d32..0637b27db 100644
--- a/runtime/intern.c
+++ b/runtime/intern.c
@@ -693,6 +693,7 @@ static header_t* intern_add_to_heap(mlsize_t whsize)
 }
 
 static value intern_end(value res, mlsize_t whsize) {
+  CAMLparam1(res);
   header_t *block = intern_add_to_heap(whsize);
   header_t *blockend = intern_dest;
 
@@ -705,7 +706,8 @@ static value intern_end(value res, mlsize_t whsize) {
     caml_memprof_track_interned(block, blockend);
 
   // Give gc a chance to run, and run memprof callbacks
-  return caml_check_urgent_gc_and_callbacks(res);
+  caml_process_pending_actions();
+  CAMLreturn(res);
 }
 
 /* Parsing the header */
diff --git a/runtime/interp.c b/runtime/interp.c
index b3cea3d6b..ac549b455 100644
--- a/runtime/interp.c
+++ b/runtime/interp.c
@@ -913,7 +913,7 @@ value caml_interprete(code_t prog, asize_t prog_size)
 
     process_actions:
       Setup_for_event;
-      caml_check_urgent_gc_and_callbacks(Val_unit);
+      caml_process_pending_actions();
       Restore_after_event;
       Next;
 
diff --git a/runtime/meta.c b/runtime/meta.c
index d536b68bc..282833287 100644
--- a/runtime/meta.c
+++ b/runtime/meta.c
@@ -186,7 +186,8 @@ CAMLprim value caml_realloc_global(value size)
       Field (new_global_data, i) = Val_long (0);
     }
     // Give gc a chance to run, and run memprof callbacks
-    caml_global_data = caml_check_urgent_gc_and_callbacks(new_global_data);
+    caml_global_data = new_global_data;
+    caml_process_pending_actions();
   }
   return Val_unit;
 }
diff --git a/runtime/obj.c b/runtime/obj.c
index 5c77c2893..d73595dc0 100644
--- a/runtime/obj.c
+++ b/runtime/obj.c
@@ -120,7 +120,7 @@ CAMLprim value caml_obj_with_tag(value new_tag_v, value arg)
     res = caml_alloc_shr(sz, tg);
     for (i = 0; i < sz; i++) caml_initialize(&Field(res, i), Field(arg, i));
     // Give gc a chance to run, and run memprof callbacks
-    res = caml_check_urgent_gc_and_callbacks(res);
+    caml_process_pending_actions();
   }
   CAMLreturn (res);
 }
diff --git a/runtime/signals.c b/runtime/signals.c
index 8f78dfc77..04c18a1db 100644
--- a/runtime/signals.c
+++ b/runtime/signals.c
@@ -122,7 +122,7 @@ void caml_set_action_pending(void)
 /* Record the delivery of a signal, and arrange for it to be processed
    as soon as possible:
    - via caml_something_to_do, processed in
-     caml_check_urgent_gc_and_callbacks_exn.
+     caml_process_pending_actions_exn.
    - by playing with the allocation limit, processed in
      caml_garbage_collection and caml_alloc_small_dispatch.
 */
@@ -324,34 +324,16 @@ exception:
   return exn;
 }
 
-CAMLno_tsan /* The access to [caml_something_to_do] is not synchronized. */
-static inline value caml_check_urgent_gc_and_callbacks_exn(value extra_root)
+CAMLexport inline value caml_process_pending_actions_exn(void)
 {
-  if (caml_something_to_do) {
-    CAMLparam1(extra_root);
-    value exn = caml_do_urgent_gc_and_callbacks_exn();
-    if (Is_exception_result(exn))
-      CAMLreturn(exn);
-    CAMLdrop;
-  }
-  return extra_root;
-}
-
-value caml_check_urgent_gc_and_callbacks(value extra_root)
-{
-  value res = caml_check_urgent_gc_and_callbacks_exn(extra_root);
-  return caml_raise_if_exception(res);
-}
-
-CAMLexport value caml_process_pending_actions_exn(void)
-{
-  return caml_check_urgent_gc_and_callbacks_exn(Val_unit);
+  if (caml_something_to_do)
+    return caml_do_urgent_gc_and_callbacks_exn();
+  return Val_unit;
 }
 
 CAMLexport void caml_process_pending_actions(void)
 {
-  value exn = caml_check_urgent_gc_and_callbacks_exn(Val_unit);
-  caml_raise_if_exception(exn);
+  caml_raise_if_exception(caml_process_pending_actions_exn());
 }
 
 /* OS-independent numbering of signals */
diff --git a/runtime/weak.c b/runtime/weak.c
index e9234bdfc..ee5094cc5 100644
--- a/runtime/weak.c
+++ b/runtime/weak.c
@@ -110,9 +110,12 @@ CAMLexport value caml_ephemeron_create (mlsize_t len)
 
 CAMLprim value caml_ephe_create (value len)
 {
-  value res = caml_ephemeron_create(Long_val(len));
+  CAMLparam0();
+  CAMLlocal1(res);
+  res = caml_ephemeron_create(Long_val(len));
   // run memprof callbacks
-  return caml_check_urgent_gc_and_callbacks(res);
+  caml_process_pending_actions();
+  CAMLreturn(res);
 }
 
 CAMLprim value caml_weak_create (value len)
@@ -294,7 +297,7 @@ static value optionalize(int status, value *x)
   }
   // run memprof callbacks both for the option we are allocating here
   // and the calling function.
-  caml_check_urgent_gc_and_callbacks(Val_unit);
+  caml_process_pending_actions();
   CAMLreturn(res);
 }

I had to add roots at three places:

  • in intern_end and caml_ephe_create: since the overhead of these functions is already quite large, I doubt adding a root can have any significant performacne change
  • in caml_floatarray_create: I am not so sure about whether or not this can really have a performance impact on an actual program. Just in case, I added a manual check to caml_something_to_do before registering the root. The performance degradation is about 5% on the following unrealistic micro benchmark:
let () =
  for i = 0 to 100000000 do
    ignore (Float.Array.create 2)
  done

I think this is acceptable. Nobody will have such a tight allocation pattern.

Do you agree with this change?

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

Ok, I worked on a patch in the meanwhile. Let us cross-check our patches.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

I'm OK with the change, but for caml_floatarray_create, is the performance degradation with or without the conditional rooting? I agree that nobody will have code like that, so I would prefer to avoid inlining.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

Is the change to caml_array_blit intentional?

@gasche
Copy link
Copy Markdown
Member

gasche commented Oct 17, 2019

Re caml_float_array_create, on the nay-sayer side:

  • I could think of some (strange but realistic) programming patterns that involves creating lots of small float arrays (eg. representing 2D points as 2-sized arrays, maybe as the natural result of instantiating a dimension-generic library) with only a couple float operations between them.
  • This benchmark is done with the current implementation of the root-registration API. I can't tell if other GC designs could make it costlier (my understanding is that Multicore reuses the same minor heaps, but maybe there is some extra thread-local-access cost?); we have also discussed safer root-manipulation APIs (eg. camlroot) which would have a different cost profile, for example in a specific debug mode.

The new API proposed by Guillaume looks nice, and I'm not saying this to discourage you from simplifying things -- a 5% degradation on a microbenchmark does sound reasonable in favor of better ressource-safety in the runtime. But optimizing the runtime code also makes sense.

@jhjourdan
Copy link
Copy Markdown
Contributor

Is the change to caml_array_blit intentional?

No, that's a mistake.

@jhjourdan
Copy link
Copy Markdown
Contributor

I'm OK with the change, but for caml_floatarray_create, is the performance degradation with or without the conditional rooting?

It's with the conditional rooting. I am not sure whether it is noise (the stack frame is different, this can have impact on alignment), or whether there is another effect (the size of the code is larger -> impact on code cache, the stack frame is larger...).

@jhjourdan
Copy link
Copy Markdown
Contributor

In any case, I don't think the performance degradation of either inilining or not inlining is worth carrying. So do as you prefer.

But I don't really like the idea of keeping the two aliases just for one function used in one very particular way. This is not mentionning the fact that virtually nobody use floatarray directly (people use float array, which are created using Array.* instead).

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

It's with the conditional rooting. I am not sure whether it is noise (the stack frame is different, this can have impact on alignment), or whether there is another effect (the size of the code is larger -> impact on code cache, the stack frame is larger...).

I can believe this, and many other effects. There was an interesting talk at Strange Loop about this (https://www.youtube.com/watch?v=r-TLSBdHe1A, I especially like the part where a difference in benchmarks was due to them running in sessions with user names with different lengths).

BTW it has nothing to do with the absence of CAMLno_tsan? That does not affect generated code?

IIUC you also need to add a root to caml_array_gather, used via Array.sub, Array.append, and Array.concat. I am not sure either if the impact is worth caring, but alternatively with two occurrences it justifies a dedicated function. BTW I do not like it either because I would like to remove all implicitly-raising internal functions, so if you tell me that the impact for those three functions is negligible as well without inlining by hand then I prefer to remove it. In the meanwhile, and also because perhaps there is also some convenience value to this function, here is a new version I am satisfied with.

@gadmm gadmm force-pushed the resource-safe-api-pt1 branch from 3eab45c to 4845e9c Compare October 17, 2019 14:06
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

This benchmark is done with the current implementation of the root-registration API. I can't tell if other GC designs could make it costlier

According to @stedolan, there is no reason that making a GC root should be costly, if somebody wanted to work on it.

@jhjourdan
Copy link
Copy Markdown
Contributor

BTW it has nothing to do with the absence of CAMLno_tsan? That does not affect generated code?

This only affect the generated code when the thread sanitizer is activated. Well, at least, I would consider it to be a bug...

IIUC you also need to add a root to caml_array_gather, used via Array.sub, Array.append, and Array.concat.

Indeed, well spotted!

BTW I do not like it either because I would like to remove all implicitly-raising internal functions, so if you tell me that the impact for those three functions is negligible as well without inlining by hand then I prefer to remove it.

For caml_array_gather, the infrastructure around it is much heavier, and the function already registers some roots, so that the overhead is really negligible

About caml_floatarray_create, if we don't inline the function, the impact on said micro benchmark is closer to 20%. Given that this function has been specifically optimized (cf #6180) , I would prefer not doing this.

So, I see two solutions:

  • Either do the inlining as I proposed above
  • Keep the function with the extra root as you propose, but then please give it a name which make it clear that it is very close to the other (e.g., with_root suffix), and define it from the process_actions functions instead of the other way around, to avoid uselessly registering a root containing Val_unit if not needed.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

About caml_floatarray_create, if we don't inline the function, the impact on said micro benchmark is closer to 20%. Given that this function has been specifically optimized (cf #6180) , I would prefer not doing this.

Ah good to know, thanks for digging up a reference.

Keep the function with the extra root as you propose, but then please give it a name which make it clear that it is very close to the other (e.g., with_root suffix),

Yes, see the latest version, this is what I was now proposing.

and define it from the process_actions functions instead of the other way around, to avoid uselessly registering a root containing Val_unit if not needed.

I already use Stephen's trick to make it negligible. I think it is fine as it is.

Copy link
Copy Markdown
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the bug in caml_array_gather, I'm happy with this version.

I have a minor comment about the name caml_do_actions, which I don't find coherent with the other names.

Another point: forget my recent proposal to change the definition of caml_process_pending_actions. I had not thought enough of it, what I was thinking about does not work.

gadmm added 4 commits October 17, 2019 20:14
Introduce caml_process_pending_actions and
caml_process_pending_actions_exn: a variant of the former which does
not raise but returns a value that has to be checked against
Is_exception_value.

I keep the current conventions from caml_callback{,_exn}: For a
resource-safe interface, we mostly care about the _exn variants, but
every time there is a public _exn function I provide a function that
raises directly for convenience.

They are introduced and documented in caml/signals.h.

Private functions are converted to their _exn variant on the way as
needed: for internal functions of the runtime, it is desirable to go
towards a complete elimination of functions that raise implicitly.

Get rid of the distant logic of caml_raise_in_async_callback. Instead,
caml_process_pending_events takes care itself of its something_to_do
"resource". This avoids calling the former function in places
unrelated to asynchronous callbacks.
point following every minor collection or major slice.

Also run signal handlers first.

Indeed, in some cases, caml_something_to_do is not reliable (spotted
by @jhjourdan):

* We could get into caml_process_pending_actions when
  caml_something_to_do is seen as set but not caml_pending_signals,
  making us miss the signal.

* If there are two different callbacks (say, a signal and a finaliser)
  arriving at the same time, then we set caml_something_to_do to 0
  when starting the first one while the second one is still waiting.
  We may want to run the second one if the first one is taking long.

In the latter case, the additional fix is to favour signals, which
have a lower latency requirement, whereas the latency of finalisers
keeps the same order of magnitude, and memprof callbacks are served on
a best-effort basis.
@gadmm gadmm force-pushed the resource-safe-api-pt1 branch from 4845e9c to 890bb90 Compare October 17, 2019 18:24
@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

Roadmap:

  • Corner cases
  • Review of technical soundness
  • Review of public interface proposal
  • Fine-tuning
  • My question about backwards-compatibility for caml/mlvalues.h

@jhjourdan
Copy link
Copy Markdown
Contributor

My question about backwards-compatibility for caml/mlvalues.h

I don't think there is any issue with this.

@jhjourdan
Copy link
Copy Markdown
Contributor

The PR is ready as is. It can be merged if CI passes.

@gadmm
Copy link
Copy Markdown
Contributor Author

gadmm commented Oct 17, 2019

Green light on my side as well. I am very satisfied with the final state of the PR, especially with the important improvements brought via Jacques-Henri's review. Thanks!

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved on jhjourdan's behalf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants