Use a bit-vector to record pending signals#10880
Conversation
This speeds up `caml_check_for_pending_signals` and the bytecode interpreter significantly.
|
Running the testsuite with the current trunk: Note that we spend as much time checking for signals as interpreting bytecode instructions! |
|
Running the testsuite with this PR: |
There was a problem hiding this comment.
This is a good optimisation and I can't see any issues.
One further possible change. Do we need sequential consistency on the load in caml_check_for_pending_signals? Relaxed may be sufficient, we re-load with seq_cst in caml_process_pending_signals_exn.
Yes, this sounds very plausible to me, and it matches what you did in lf_skiplist.c . Maybe an expert like @maranget can confirm. In the meantime I'll update this PR to use relaxed loads. |
|
Looking at the generated assembly code, I realized that Linux has |
|
@xavierleroy may I ask what command you had used to generate these results? It seems quite useful and important. |
|
This is good old |
ctk21
left a comment
There was a problem hiding this comment.
I like the change, very sensible. I believe it works as intended.
otherlibs/unix/signals.c
Outdated
| #define NSIG 64 | ||
| #endif | ||
|
|
||
| #define BITS_PER_WORD (sizeof(uintnat) * 8) |
There was a problem hiding this comment.
The definitions for BITS_PER_WORD and NSIG_WORDS are duplicated (here and signals.c).
Is there a reason not to define them once in signals.h alongside caml_pending_signals?
There was a problem hiding this comment.
I tried but it causes a clash in yacc/defs.h (!), which defines its own BITS_PER_WORD macro, and indirectly includes <caml/signals.h> (why?), and defines CAML_INTERNAL (why??). I gave up...
There was a problem hiding this comment.
You're right, there's something offensive going on. I believe the issue to be that caml/platform.h is (unnecessarily) including caml/signals.h. This diff should fix it:
diff --git a/runtime/caml/platform.h b/runtime/caml/platform.h
index 68a5875a6..67d32316e 100644
--- a/runtime/caml/platform.h
+++ b/runtime/caml/platform.h
@@ -23,7 +23,6 @@
#include <string.h>
#include "config.h"
#include "mlvalues.h"
-#include "signals.h"
#if defined(MAP_ANON) && !defined(MAP_ANONYMOUS)
#define MAP_ANONYMOUS MAP_ANON
diff --git a/runtime/sync.c b/runtime/sync.c
index d9e54999d..d567c51c8 100644
--- a/runtime/sync.c
+++ b/runtime/sync.c
@@ -25,6 +25,7 @@
#include "caml/fail.h"
#include "caml/memory.h"
+#include "caml/signals.h"
#include "caml/sync.h"
#include "caml/eventlog.h"
There was a problem hiding this comment.
OK, this unblocks the situation indeed. See the latest push on this PR.
Depending on how keen we are to keep it in one word on Linux, we've got some wasted space at 0.. |
Yes, I was thinking along these lines! POSIX 2008 says that signal numbers are positive integers, and that "The value 0 is reserved for use as the null signal (see kill())", so we can fit signals 1 to 64 into one word... Stay tuned! |
POSIX 2008 says that signal numbers are positive integers, and that signal number 0 means "no signal" and is reserved for use by kill(). We take advantage of this to get a smaller caml_pending_signals array under Linux (which has NSIG = 65). The check on signal numbers in caml_install_signal_handler is tightened to reject the signal = 0 case early. (Before, it would cause a Sys_error as sigaction() would return EINVAL.)
caml_record_signal does the check already.
|
OK, I just pushed a version where signal numbers start at 1, not at 0. The compiled C code is even nicer than before :-) I noticed other opportunities for simplifications in signal handling, but this will be for another PR. |
|
Updated code looks correct. One last thing we could do (which came out of a conversation with @dra27) is use ctz for the index to start checking from. Might be an over-optimisation though. |
Also: if CAML_INTERNAL, the standard header file <signal.h> is now always included, whether POSIX signals are supported or not. That's because the standard header should have a chance to define NSIG before <caml/signals.h> provides its fallback definition.
|
I did consider clz / find first bit set (it's a canonical example of use of these instructions!) but they are not portable, i.e. we'd have a builtin function for GCC / Clang and a library function or hand-written code for MSVC. |
MSVC has |
|
For context, polling in bytecode is not supposed to check all signals one by one. The performance impact is more a symptom of the regressions I commented about in my review of the multicore PR (#10831 (comment)). In fact, why should comparing the young_pointer to the young_limit in However anything that rids us of the |
| for (i = 0; i < NSIG; i++) { | ||
| if (atomic_load_explicit(&caml_pending_signals[i], memory_order_seq_cst)) | ||
| for (i = 0; i < NSIG_WORDS; i++) { | ||
| if (atomic_load_explicit(&caml_pending_signals[i], memory_order_relaxed)) |
There was a problem hiding this comment.
One particular ordering to think about: when a signal is recorded, caml_pending_signals is set atomically and then young_limit is set with a release store. This message is received when the comparison young_limit <= young_pointer fails, after which caml_check_for_pending_signals ends up being called via caml_process_pending_signals.
Now the comparison with young_limit is done without synchronisation, so caml_pending_signals could still be seen as unset after the comparison fails. I am not sure this actually happens currently, but only for accidental reasons (an unrelated acquire load of young_limit in handle_gc_interrupts which happens to be called before caml_process_pending_signals inside caml_process_pending_actions, or a whole lot of other work happenning between the two; even so I am not sure how to reason about it).
It would be clearer and more robust to change the comparison pattern to upgrade the relaxed load into an acquire load when it fails using a fence, e.g.
if (young_pointer < young_limit) {
atomic_thread_fence(memory_order_acquire);
…
}
(the code in old trunk was organised in such a way that there is a common entry point to the GC which would be good to bring back, in this case it is an obvious place for such a fence).
There was a problem hiding this comment.
As far as this PR is concerned one solution is to add this fence at the top caml_check_for_pending_signals, but I can move this comment to my review of the multicore PR if you want. (I already commented there about the choice of a volatile read of young_limit instead of an atomic relaxed read which is UB, AFAIU.)
There was a problem hiding this comment.
We could put an "acquire" fence at the beginning of caml_check_for_pending_signals, or use "acquire" loads instead of "relaxed" loads. But it's not clear to me it's doing much of a difference. There will always be races that cause a pending signal to not be processed immediately, no?
Since I'm clearly out of my league here, I'd appreciate experts to chime in and tell me what to do. This PR speeds up the bytecode interpreter considerably, so I'd really like to see it merged soon.
There was a problem hiding this comment.
We're not doing an acquire on young_limit, which I think opens up the theoretical possibility of seeing young_limit updated without caml_pending_signals being set and then resetting young_limit after the check. This would delay processing the pending signal until young_limit is updated again (another signal, stop-the-world of some kind or we enter a blocking section). I think it might be possible to engineer a scenario where this leads to never processing a SIGINT (loop with a safepoint poll and nothing else?).
I think this can happen anyway at the moment. A signal could be delivered at any time before https://github.com/xavierleroy/ocaml/blob/79fb0b0ddaca1051106fcd540b5e82960449218d/runtime/minor_gc.c#L635 and the resulting young_limit would be over-written. I'll add that to the list of signalling issues, we probably need a caml_check_for_pending_signals at the very end of a stw section.
I don't think the memory orderings need hold up the PR, it doesn't worsen the existing situation and we can look to remedy things in the follow-up with the rest of @gadmm 's identified issues on the multicore PR.
There was a problem hiding this comment.
- I'm not sure it can happen at the moment in practice (series of seq_cst loads, the first of which is for signal 0). I agree it's a bit independent from this PR because I'd want the fence anyway so that correctness follows from the structure of the program.
- I'm fine with a fence at the top of
check_for_pending_signalsand we refine it later (I am not sure acquire loads here are a proper fix). - Good point about missing signals anyways; I don't think so, because the former trunk implementation went to length to avoid that, by processing signals first for instance. @sadiqj points out an issue with the multicore implementation, which in old trunk was dealt with by the update_young_limit business. There is still the case of races with syscalls but the reason there is fundamental rather than a programming bug, and there was an idea to fix this too.
There was a problem hiding this comment.
I'm fine with a fence at the top of check_for_pending_signals and we refine it later
OK, done. I added a [MM] (memory model) comment briefly summarizing the discussion and pointing here.
That's an excellent question. I can think of historical reasons (the bytecode interpreter came first, relying on a global "something_to_do" flag that was set by signal handlers and some other runtime functions too), but nothing more compelling. |
|
I also don't see the need for de70ae5 introduced these @Engil might have insight into why it was needed. |
|
For the The gist of it was, at the time, that without these polling point, the testcase would fail here: |
…nals For the reasons discussed in ocaml#10880 and summarized in a comment.
|
Nobody dares to accept this PR, but I think it's in good shape now, so I'll merge tomorrow if there is no more objections. I'll have a look at the signals_alloc test. Perhaps it's over-constrained and should not prevent a nicer, more efficient implementation of polling in the bytecode interpreter. If so we'll do that in a separate PR. |
gadmm
left a comment
There was a problem hiding this comment.
I approve this PR. (Approve button seems temporarily restricted due to previous spam.)
|
(I tried to give an approving review but it's disabled at the moment.) |
I think so too. Do not hesitate to request reviews when appropriate. |
|
Right, there were temporary restrictions in place on who can approve. Should be back to normal now. |
Currently, we use an array of
NSIGatomic 0-or-1 integers to record the presence of pending signals.This PR uses an array of bits instead, encoded as an array of
ceil (NSIG / BITS_PER_WORD)atomic words.Linux has
NSIG= 65 and macOS hasNSIG= 32, so we end up with one or two atomic words that can be tested very efficiently.In turn, this results in a major (x 2) speedup in the bytecode interpreter. See the profiles below for a run of the test suite.