Use a bit-vector to record pending signals by xavierleroy · Pull Request #10880 · ocaml/ocaml

xavierleroy · 2022-01-11T19:02:36Z

Currently, we use an array of NSIG atomic 0-or-1 integers to record the presence of pending signals.

This PR uses an array of bits instead, encoded as an array of ceil (NSIG / BITS_PER_WORD) atomic words.

Linux has NSIG = 65 and macOS has NSIG = 32, so we end up with one or two atomic words that can be tested very efficiently.

In turn, this results in a major (x 2) speedup in the bytecode interpreter. See the profiles below for a run of the test suite.

This speeds up `caml_check_for_pending_signals` and the bytecode interpreter significantly.

xavierleroy · 2022-01-11T19:03:08Z

Running the testsuite with the current trunk:

real	14m8.349s
user	20m52.485s
sys	0m46.666s

Event count (approx.): 5024892760014
Overhead  Command          Shared Object                       Symbol
  16.43%  Domain0          ocamlrun                            [.] caml_check_for_pending_signals
  14.16%  Domain0          ocamlrun                            [.] caml_interprete
   6.67%  Domain1          ocamlrun                            [.] caml_check_for_pending_signals
   5.12%  Domain1          ocamlrun                            [.] caml_interprete
   4.72%  Domain3          ocamlrun                            [.] caml_check_for_pending_signals
   4.60%  Domain2          ocamlrun                            [.] caml_check_for_pending_signals
   3.74%  Domain3          ocamlrun                            [.] caml_interprete
   3.67%  Domain2          ocamlrun                            [.] caml_interprete
   2.29%  Domain4          ocamlrun                            [.] caml_check_for_pending_signals
   1.34%  Domain4          ocamlrun                            [.] caml_interprete
   1.15%  Domain0          ocamlrun                            [.] caml_ensure_stack_capacity
   0.96%  Domain0          ocamlrun                            [.] caml_MD5Transform
   0.83%  Domain3          ocamlrun                            [.] lf_skiplist_lookup
   0.82%  Domain2          ocamlrun                            [.] caml_sample_gc_stats
   0.81%  Domain3          ocamlrun                            [.] caml_sample_gc_stats
   0.78%  Domain1          ocamlrun                            [.] caml_sample_gc_stats
   0.73%  Domain1          ocamlrun                            [.] caml_ensure_stack_capacity
   0.55%  Domain3          ocamlrun                            [.] caml_ensure_stack_capacity
   0.54%  Domain3          ocamlrun                            [.] caml_find_code_fragment_by_pc
   0.54%  Domain2          ocamlrun                            [.] caml_ensure_stack_capacity

Note that we spend as much time checking for signals as interpreting bytecode instructions!

xavierleroy · 2022-01-11T19:03:46Z

Running the testsuite with this PR:

real	8m32.072s
user	10m16.386s
sys	0m45.736s

Event count (approx.): 2483613262503
Overhead  Command          Shared Object                       Symbol
  22.67%  Domain0          ocamlrun                            [.] caml_interprete
   4.21%  Domain1          ocamlrun                            [.] caml_interprete
   3.20%  Domain3          ocamlrun                            [.] caml_interprete
   3.11%  Domain2          ocamlrun                            [.] caml_interprete
   2.36%  Domain0          ocamlrun                            [.] caml_ensure_stack_capacity
   1.96%  Domain0          ocamlrun                            [.] caml_MD5Transform
   1.66%  Domain4          ocamlrun                            [.] caml_interprete
   0.86%  Domain0          ocamlrun                            [.] caml_shared_try_alloc
   0.83%  Domain1          ocamlrun                            [.] caml_ensure_stack_capacity
   0.80%  Domain0          ocamlrun                            [.] caml_thread_code
   0.74%  Domain0          libm-2.31.so                        [.] __sin_fma
   0.68%  Domain3          ocamlrun                            [.] caml_ensure_stack_capacity
   0.68%  Domain0          ocamlrun                            [.] mark
   0.66%  Domain2          ocamlrun                            [.] caml_ensure_stack_capacity
   0.64%  Domain3          ocamlrun                            [.] lf_skiplist_lookup
   0.58%  Domain2          ocamlrun                            [.] caml_sample_gc_stats
   0.57%  Domain1          ocamlrun                            [.] caml_sample_gc_stats
   0.55%  Domain3          ocamlrun                            [.] caml_sample_gc_stats
   0.53%  Domain0          [unknown]                           [k] 0xffffffffb8275ee7
   0.52%  Domain0          ocamlrun                            [.] pool_sweep
   0.52%  ld               libbfd-2.34-system.so               [.] bfd_link_hash_traverse

sadiqj

This is a good optimisation and I can't see any issues.

One further possible change. Do we need sequential consistency on the load in caml_check_for_pending_signals? Relaxed may be sufficient, we re-load with seq_cst in caml_process_pending_signals_exn.

xavierleroy · 2022-01-12T09:13:07Z

Do we need sequential consistency on the load in caml_check_for_pending_signals? Relaxed may be sufficient,

Yes, this sounds very plausible to me, and it matches what you did in lf_skiplist.c . Maybe an expert like @maranget can confirm. In the meantime I'll update this PR to use relaxed loads.

xavierleroy · 2022-01-12T09:38:19Z

Looking at the generated assembly code, I realized that Linux has NSIG = 65, so we actually use two words instead of one, as I was hoping. Yet the code seems pretty efficient.

kayceesrk · 2022-01-12T09:38:39Z

@xavierleroy may I ask what command you had used to generate these results? It seems quite useful and important.

xavierleroy · 2022-01-12T09:40:18Z

This is good old perf under Linux. Nothing fancy!

    cd testsuite
    make clean
    perf record make all
    perf report

ctk21

I like the change, very sensible. I believe it works as intended.

ctk21 · 2022-01-12T09:59:58Z

otherlibs/unix/signals.c

 #define NSIG 64
 #endif

+#define BITS_PER_WORD (sizeof(uintnat) * 8)


The definitions for BITS_PER_WORD and NSIG_WORDS are duplicated (here and signals.c).
Is there a reason not to define them once in signals.h alongside caml_pending_signals?

I tried but it causes a clash in yacc/defs.h (!), which defines its own BITS_PER_WORD macro, and indirectly includes <caml/signals.h> (why?), and defines CAML_INTERNAL (why??). I gave up...

You're right, there's something offensive going on. I believe the issue to be that caml/platform.h is (unnecessarily) including caml/signals.h. This diff should fix it:

diff --git a/runtime/caml/platform.h b/runtime/caml/platform.h index 68a5875a6..67d32316e 100644 --- a/runtime/caml/platform.h +++ b/runtime/caml/platform.h @@ -23,7 +23,6 @@ #include <string.h> #include "config.h" #include "mlvalues.h" -#include "signals.h" #if defined(MAP_ANON) && !defined(MAP_ANONYMOUS) #define MAP_ANONYMOUS MAP_ANON diff --git a/runtime/sync.c b/runtime/sync.c index d9e54999d..d567c51c8 100644 --- a/runtime/sync.c +++ b/runtime/sync.c @@ -25,6 +25,7 @@ #include "caml/fail.h" #include "caml/memory.h" +#include "caml/signals.h" #include "caml/sync.h" #include "caml/eventlog.h"

OK, this unblocks the situation indeed. See the latest push on this PR.

sadiqj · 2022-01-12T10:12:29Z

Looking at the generated assembly code, I realized that Linux has NSIG = 65, so we actually use two words instead of one, as I was hoping. Yet the code seems pretty efficient.

Depending on how keen we are to keep it in one word on Linux, we've got some wasted space at 0..

xavierleroy · 2022-01-12T10:19:58Z

Depending on how keen we are to keep it in one word on Linux, we've got some wasted space at 0..

Yes, I was thinking along these lines! POSIX 2008 says that signal numbers are positive integers, and that "The value 0 is reserved for use as the null signal (see kill())", so we can fit signals 1 to 64 into one word... Stay tuned!

POSIX 2008 says that signal numbers are positive integers, and that signal number 0 means "no signal" and is reserved for use by kill(). We take advantage of this to get a smaller caml_pending_signals array under Linux (which has NSIG = 65). The check on signal numbers in caml_install_signal_handler is tightened to reject the signal = 0 case early. (Before, it would cause a Sys_error as sigaction() would return EINVAL.)

caml_record_signal does the check already.

xavierleroy · 2022-01-12T11:04:18Z

OK, I just pushed a version where signal numbers start at 1, not at 0. The compiled C code is even nicer than before :-)

I noticed other opportunities for simplifications in signal handling, but this will be for another PR.

sadiqj · 2022-01-12T11:50:21Z

Updated code looks correct.

One last thing we could do (which came out of a conversation with @dra27) is use ctz for the index to start checking from. Might be an over-optimisation though.

Also: if CAML_INTERNAL, the standard header file <signal.h> is now always included, whether POSIX signals are supported or not. That's because the standard header should have a chance to define NSIG before <caml/signals.h> provides its fallback definition.

xavierleroy · 2022-01-12T13:04:04Z

I did consider clz / find first bit set (it's a canonical example of use of these instructions!) but they are not portable, i.e. we'd have a builtin function for GCC / Clang and a library function or hand-written code for MSVC.

gretay-js · 2022-01-12T13:59:49Z

I did consider clz / find first bit set (it's a canonical example of use of these instructions!) but they are not portable, i.e. we'd have a builtin function for GCC / Clang and a library function or hand-written code for MSVC.

MSVC has _BitScanReverse.

gadmm · 2022-01-12T14:07:27Z

For context, polling in bytecode is not supposed to check all signals one by one. The performance impact is more a symptom of the regressions I commented about in my review of the multicore PR (#10831 (comment)). In fact, why should comparing the young_pointer to the young_limit in interp.c not be enough, as done for polling in native?

However anything that rids us of the signals_are_pending cache variable from old trunk is good to take; this is beneficial for caml_enter_blocking_section.

gadmm · 2022-01-12T15:56:10Z

runtime/signals.c

-  for (i = 0; i < NSIG; i++) {
-    if (atomic_load_explicit(&caml_pending_signals[i], memory_order_seq_cst))
+  for (i = 0; i < NSIG_WORDS; i++) {
+    if (atomic_load_explicit(&caml_pending_signals[i], memory_order_relaxed))


One particular ordering to think about: when a signal is recorded, caml_pending_signals is set atomically and then young_limit is set with a release store. This message is received when the comparison young_limit <= young_pointer fails, after which caml_check_for_pending_signals ends up being called via caml_process_pending_signals.

Now the comparison with young_limit is done without synchronisation, so caml_pending_signals could still be seen as unset after the comparison fails. I am not sure this actually happens currently, but only for accidental reasons (an unrelated acquire load of young_limit in handle_gc_interrupts which happens to be called before caml_process_pending_signals inside caml_process_pending_actions, or a whole lot of other work happenning between the two; even so I am not sure how to reason about it).

It would be clearer and more robust to change the comparison pattern to upgrade the relaxed load into an acquire load when it fails using a fence, e.g.

if (young_pointer < young_limit) { atomic_thread_fence(memory_order_acquire); … }

(the code in old trunk was organised in such a way that there is a common entry point to the GC which would be good to bring back, in this case it is an obvious place for such a fence).

As far as this PR is concerned one solution is to add this fence at the top caml_check_for_pending_signals, but I can move this comment to my review of the multicore PR if you want. (I already commented there about the choice of a volatile read of young_limit instead of an atomic relaxed read which is UB, AFAIU.)

We could put an "acquire" fence at the beginning of caml_check_for_pending_signals, or use "acquire" loads instead of "relaxed" loads. But it's not clear to me it's doing much of a difference. There will always be races that cause a pending signal to not be processed immediately, no?

Since I'm clearly out of my league here, I'd appreciate experts to chime in and tell me what to do. This PR speeds up the bytecode interpreter considerably, so I'd really like to see it merged soon.

We're not doing an acquire on young_limit, which I think opens up the theoretical possibility of seeing young_limit updated without caml_pending_signals being set and then resetting young_limit after the check. This would delay processing the pending signal until young_limit is updated again (another signal, stop-the-world of some kind or we enter a blocking section). I think it might be possible to engineer a scenario where this leads to never processing a SIGINT (loop with a safepoint poll and nothing else?).

I think this can happen anyway at the moment. A signal could be delivered at any time before https://github.com/xavierleroy/ocaml/blob/79fb0b0ddaca1051106fcd540b5e82960449218d/runtime/minor_gc.c#L635 and the resulting young_limit would be over-written. I'll add that to the list of signalling issues, we probably need a caml_check_for_pending_signals at the very end of a stw section.

I don't think the memory orderings need hold up the PR, it doesn't worsen the existing situation and we can look to remedy things in the follow-up with the rest of @gadmm 's identified issues on the multicore PR.

I'm not sure it can happen at the moment in practice (series of seq_cst loads, the first of which is for signal 0). I agree it's a bit independent from this PR because I'd want the fence anyway so that correctness follows from the structure of the program.

I'm fine with a fence at the top of check_for_pending_signals and we refine it later (I am not sure acquire loads here are a proper fix).

Good point about missing signals anyways; I don't think so, because the former trunk implementation went to length to avoid that, by processing signals first for instance. @sadiqj points out an issue with the multicore implementation, which in old trunk was dealt with by the update_young_limit business. There is still the case of races with syscalls but the reason there is fundamental rather than a programming bug, and there was an idea to fix this too.

I'm fine with a fence at the top of check_for_pending_signals and we refine it later

OK, done. I added a [MM] (memory model) comment briefly summarizing the discussion and pointing here.

xavierleroy · 2022-01-12T19:00:56Z

In fact, why should comparing the young_pointer to the young_limit in interp.c not be enough, as done for polling in native?

That's an excellent question. I can think of historical reasons (the bytecode interpreter came first, relying on a global "something_to_do" flag that was set by signal handlers and some other runtime functions too), but nothing more compelling.

ctk21 · 2022-01-13T09:36:55Z

I also don't see the need for caml_check_for_pending_signals() in interp.c for POPTRAP & CHECK_SIGNALS.

de70ae5 introduced these caml_check_for_pending_signals() calls as part of ocaml-multicore#722 which fixed up the signals_alloc testcase. Notice that caml_check_for_pending_signals was also added to the macro Enter_gc in this change.

@Engil might have insight into why it was needed.

abbysmal · 2022-01-13T10:34:20Z

For the POPTRAP and CHECK_SIGNALS addition, I added these following some observations made comparing the bytecode and native versions on the signals_alloc testcase.

The gist of it was, at the time, that without these polling point, the testcase would fail here:
https://github.com/ocaml/ocaml/blob/trunk/testsuite/tests/callback/signals_alloc.ml#L17
It would not print12345, but rather 12435. After debugging for a while I suspected these polling point were necessary to get the behavior the testsuite expected, but maybe the test assertions are a bit too strict?

…nals For the reasons discussed in ocaml#10880 and summarized in a comment.

xavierleroy · 2022-01-13T15:27:09Z

Nobody dares to accept this PR, but I think it's in good shape now, so I'll merge tomorrow if there is no more objections.

I'll have a look at the signals_alloc test. Perhaps it's over-constrained and should not prevent a nicer, more efficient implementation of polling in the bytecode interpreter. If so we'll do that in a separate PR.

gadmm

I approve this PR. (Approve button seems temporarily restricted due to previous spam.)

sadiqj · 2022-01-13T16:01:45Z

(I tried to give an approving review but it's disabled at the moment.)

gadmm · 2022-01-13T16:01:55Z

Nobody dares to accept this PR, but I think it's in good shape now, so I'll merge tomorrow if there is no more objections.

I think so too. Do not hesitate to request reviews when appropriate.

xavierleroy · 2022-01-13T16:05:10Z

Right, there were temporary restrictions in place on who can approve. Should be back to normal now.

Use a bit-vector to record pending signals

63ab30e

This speeds up `caml_check_for_pending_signals` and the bytecode interpreter significantly.

xavierleroy mentioned this pull request Jan 11, 2022

Move Random to a splitting generator #10877

Closed

sadiqj reviewed Jan 11, 2022

View reviewed changes

xavierleroy added 2 commits January 12, 2022 10:35

Use relaxed loads to poll for signals

544ceaf

Add a definition for atomic_fetch_or for old versions of GCC

20682cb

ctk21 reviewed Jan 12, 2022

View reviewed changes

xavierleroy added 2 commits January 12, 2022 12:02

No need to check validity of signal number in handle_signal

5984257

caml_record_signal does the check already.

ctk21 and others added 2 commits January 12, 2022 13:49

Do not include <caml/signals.h> in <caml/platform.h>

f1bb398

gadmm reviewed Jan 12, 2022

View reviewed changes

Add an "acquire" fence at the beginning of caml_check_for_pending_sig…

aae832e

…nals For the reasons discussed in ocaml#10880 and summarized in a comment.

gadmm reviewed Jan 13, 2022

View reviewed changes

gadmm mentioned this pull request Jan 13, 2022

Multicore OCaml #10831

Merged

xavierleroy merged commit 437c103 into ocaml:trunk Jan 14, 2022

xavierleroy deleted the bitvect-signals branch January 14, 2022 09:05

gadmm mentioned this pull request Jan 18, 2022

Meta-issue on asynchronous action handling in multicore #10915

Closed

20 tasks

ctk21 mentioned this pull request Jan 24, 2022

Use mmap(MAP_STACK|...) for fibre stacks on OpenBSD #10875

Merged

gadmm mentioned this pull request Jan 31, 2022

Simplifications/cleanups/clarifications from multicore review #10966

Merged

ctk21 mentioned this pull request Feb 15, 2022

Cut polling of pending signals vector in bytecode interpreter #11014

Merged

dra27 mentioned this pull request May 23, 2023

Fix #12255 - handle large signal numbers correctly. #12256

Merged

Conversation

xavierleroy commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Jan 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Jan 11, 2022

Uh oh!

sadiqj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

kayceesrk commented Jan 12, 2022

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

ctk21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadiqj commented Jan 12, 2022

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

sadiqj commented Jan 12, 2022

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

gretay-js commented Jan 12, 2022

Uh oh!

gadmm commented Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gadmm Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gadmm Jan 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xavierleroy commented Jan 12, 2022

Uh oh!

ctk21 commented Jan 13, 2022

Uh oh!

abbysmal commented Jan 13, 2022

Uh oh!

xavierleroy commented Jan 13, 2022

Uh oh!

gadmm left a comment

Choose a reason for hiding this comment

Uh oh!

sadiqj commented Jan 13, 2022

Uh oh!

gadmm commented Jan 13, 2022

Uh oh!

xavierleroy commented Jan 13, 2022

Uh oh!

xavierleroy commented Jan 11, 2022 •

edited

Loading

xavierleroy commented Jan 11, 2022 •

edited

Loading

sadiqj left a comment •

edited

Loading

gadmm commented Jan 12, 2022 •

edited

Loading

gadmm Jan 12, 2022 •

edited

Loading

gadmm Jan 12, 2022 •

edited

Loading