Skip to content

Use a bit-vector to record pending signals#10880

Merged
xavierleroy merged 8 commits intoocaml:trunkfrom
xavierleroy:bitvect-signals
Jan 14, 2022
Merged

Use a bit-vector to record pending signals#10880
xavierleroy merged 8 commits intoocaml:trunkfrom
xavierleroy:bitvect-signals

Conversation

@xavierleroy
Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy commented Jan 11, 2022

Currently, we use an array of NSIG atomic 0-or-1 integers to record the presence of pending signals.

This PR uses an array of bits instead, encoded as an array of ceil (NSIG / BITS_PER_WORD) atomic words.

Linux has NSIG = 65 and macOS has NSIG = 32, so we end up with one or two atomic words that can be tested very efficiently.

In turn, this results in a major (x 2) speedup in the bytecode interpreter. See the profiles below for a run of the test suite.

This speeds up `caml_check_for_pending_signals` and the bytecode interpreter
significantly.
@xavierleroy
Copy link
Copy Markdown
Contributor Author

xavierleroy commented Jan 11, 2022

Running the testsuite with the current trunk:

real	14m8.349s
user	20m52.485s
sys	0m46.666s

Event count (approx.): 5024892760014
Overhead  Command          Shared Object                       Symbol
  16.43%  Domain0          ocamlrun                            [.] caml_check_for_pending_signals
  14.16%  Domain0          ocamlrun                            [.] caml_interprete
   6.67%  Domain1          ocamlrun                            [.] caml_check_for_pending_signals
   5.12%  Domain1          ocamlrun                            [.] caml_interprete
   4.72%  Domain3          ocamlrun                            [.] caml_check_for_pending_signals
   4.60%  Domain2          ocamlrun                            [.] caml_check_for_pending_signals
   3.74%  Domain3          ocamlrun                            [.] caml_interprete
   3.67%  Domain2          ocamlrun                            [.] caml_interprete
   2.29%  Domain4          ocamlrun                            [.] caml_check_for_pending_signals
   1.34%  Domain4          ocamlrun                            [.] caml_interprete
   1.15%  Domain0          ocamlrun                            [.] caml_ensure_stack_capacity
   0.96%  Domain0          ocamlrun                            [.] caml_MD5Transform
   0.83%  Domain3          ocamlrun                            [.] lf_skiplist_lookup
   0.82%  Domain2          ocamlrun                            [.] caml_sample_gc_stats
   0.81%  Domain3          ocamlrun                            [.] caml_sample_gc_stats
   0.78%  Domain1          ocamlrun                            [.] caml_sample_gc_stats
   0.73%  Domain1          ocamlrun                            [.] caml_ensure_stack_capacity
   0.55%  Domain3          ocamlrun                            [.] caml_ensure_stack_capacity
   0.54%  Domain3          ocamlrun                            [.] caml_find_code_fragment_by_pc
   0.54%  Domain2          ocamlrun                            [.] caml_ensure_stack_capacity

Note that we spend as much time checking for signals as interpreting bytecode instructions!

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Running the testsuite with this PR:

real	8m32.072s
user	10m16.386s
sys	0m45.736s

Event count (approx.): 2483613262503
Overhead  Command          Shared Object                       Symbol
  22.67%  Domain0          ocamlrun                            [.] caml_interprete
   4.21%  Domain1          ocamlrun                            [.] caml_interprete
   3.20%  Domain3          ocamlrun                            [.] caml_interprete
   3.11%  Domain2          ocamlrun                            [.] caml_interprete
   2.36%  Domain0          ocamlrun                            [.] caml_ensure_stack_capacity
   1.96%  Domain0          ocamlrun                            [.] caml_MD5Transform
   1.66%  Domain4          ocamlrun                            [.] caml_interprete
   0.86%  Domain0          ocamlrun                            [.] caml_shared_try_alloc
   0.83%  Domain1          ocamlrun                            [.] caml_ensure_stack_capacity
   0.80%  Domain0          ocamlrun                            [.] caml_thread_code
   0.74%  Domain0          libm-2.31.so                        [.] __sin_fma
   0.68%  Domain3          ocamlrun                            [.] caml_ensure_stack_capacity
   0.68%  Domain0          ocamlrun                            [.] mark
   0.66%  Domain2          ocamlrun                            [.] caml_ensure_stack_capacity
   0.64%  Domain3          ocamlrun                            [.] lf_skiplist_lookup
   0.58%  Domain2          ocamlrun                            [.] caml_sample_gc_stats
   0.57%  Domain1          ocamlrun                            [.] caml_sample_gc_stats
   0.55%  Domain3          ocamlrun                            [.] caml_sample_gc_stats
   0.53%  Domain0          [unknown]                           [k] 0xffffffffb8275ee7
   0.52%  Domain0          ocamlrun                            [.] pool_sweep
   0.52%  ld               libbfd-2.34-system.so               [.] bfd_link_hash_traverse

Copy link
Copy Markdown
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good optimisation and I can't see any issues.

One further possible change. Do we need sequential consistency on the load in caml_check_for_pending_signals? Relaxed may be sufficient, we re-load with seq_cst in caml_process_pending_signals_exn.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Do we need sequential consistency on the load in caml_check_for_pending_signals? Relaxed may be sufficient,

Yes, this sounds very plausible to me, and it matches what you did in lf_skiplist.c . Maybe an expert like @maranget can confirm. In the meantime I'll update this PR to use relaxed loads.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Looking at the generated assembly code, I realized that Linux has NSIG = 65, so we actually use two words instead of one, as I was hoping. Yet the code seems pretty efficient.

@kayceesrk
Copy link
Copy Markdown
Contributor

@xavierleroy may I ask what command you had used to generate these results? It seems quite useful and important.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

This is good old perf under Linux. Nothing fancy!

    cd testsuite
    make clean
    perf record make all
    perf report

Copy link
Copy Markdown
Contributor

@ctk21 ctk21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change, very sensible. I believe it works as intended.

#define NSIG 64
#endif

#define BITS_PER_WORD (sizeof(uintnat) * 8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definitions for BITS_PER_WORD and NSIG_WORDS are duplicated (here and signals.c).
Is there a reason not to define them once in signals.h alongside caml_pending_signals?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried but it causes a clash in yacc/defs.h (!), which defines its own BITS_PER_WORD macro, and indirectly includes <caml/signals.h> (why?), and defines CAML_INTERNAL (why??). I gave up...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there's something offensive going on. I believe the issue to be that caml/platform.h is (unnecessarily) including caml/signals.h. This diff should fix it:

diff --git a/runtime/caml/platform.h b/runtime/caml/platform.h
index 68a5875a6..67d32316e 100644
--- a/runtime/caml/platform.h
+++ b/runtime/caml/platform.h
@@ -23,7 +23,6 @@
 #include <string.h>
 #include "config.h"
 #include "mlvalues.h"
-#include "signals.h"

 #if defined(MAP_ANON) && !defined(MAP_ANONYMOUS)
 #define MAP_ANONYMOUS MAP_ANON
diff --git a/runtime/sync.c b/runtime/sync.c
index d9e54999d..d567c51c8 100644
--- a/runtime/sync.c
+++ b/runtime/sync.c
@@ -25,6 +25,7 @@
 #include "caml/fail.h"
 #include "caml/memory.h"

+#include "caml/signals.h"
 #include "caml/sync.h"
 #include "caml/eventlog.h"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this unblocks the situation indeed. See the latest push on this PR.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 12, 2022

Looking at the generated assembly code, I realized that Linux has NSIG = 65, so we actually use two words instead of one, as I was hoping. Yet the code seems pretty efficient.

Depending on how keen we are to keep it in one word on Linux, we've got some wasted space at 0..

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Depending on how keen we are to keep it in one word on Linux, we've got some wasted space at 0..

Yes, I was thinking along these lines! POSIX 2008 says that signal numbers are positive integers, and that "The value 0 is reserved for use as the null signal (see kill())", so we can fit signals 1 to 64 into one word... Stay tuned!

POSIX 2008 says that signal numbers are positive integers, and that
signal number 0 means "no signal" and is reserved for use by kill().

We take advantage of this to get a smaller caml_pending_signals array
under Linux (which has NSIG = 65).

The check on signal numbers in caml_install_signal_handler is tightened
to reject the signal = 0 case early.  (Before, it would cause a Sys_error
as sigaction() would return EINVAL.)
caml_record_signal does the check already.
@xavierleroy
Copy link
Copy Markdown
Contributor Author

OK, I just pushed a version where signal numbers start at 1, not at 0. The compiled C code is even nicer than before :-)

I noticed other opportunities for simplifications in signal handling, but this will be for another PR.

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 12, 2022

Updated code looks correct.

One last thing we could do (which came out of a conversation with @dra27) is use ctz for the index to start checking from. Might be an over-optimisation though.

ctk21 and others added 2 commits January 12, 2022 13:49
Also: if CAML_INTERNAL, the standard header file <signal.h> is now
always included, whether POSIX signals are supported or not.
That's because the standard header should have a chance to define NSIG
before <caml/signals.h> provides its fallback definition.
@xavierleroy
Copy link
Copy Markdown
Contributor Author

I did consider clz / find first bit set (it's a canonical example of use of these instructions!) but they are not portable, i.e. we'd have a builtin function for GCC / Clang and a library function or hand-written code for MSVC.

@gretay-js
Copy link
Copy Markdown
Contributor

I did consider clz / find first bit set (it's a canonical example of use of these instructions!) but they are not portable, i.e. we'd have a builtin function for GCC / Clang and a library function or hand-written code for MSVC.

MSVC has _BitScanReverse.

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 12, 2022

For context, polling in bytecode is not supposed to check all signals one by one. The performance impact is more a symptom of the regressions I commented about in my review of the multicore PR (#10831 (comment)). In fact, why should comparing the young_pointer to the young_limit in interp.c not be enough, as done for polling in native?

However anything that rids us of the signals_are_pending cache variable from old trunk is good to take; this is beneficial for caml_enter_blocking_section.

for (i = 0; i < NSIG; i++) {
if (atomic_load_explicit(&caml_pending_signals[i], memory_order_seq_cst))
for (i = 0; i < NSIG_WORDS; i++) {
if (atomic_load_explicit(&caml_pending_signals[i], memory_order_relaxed))
Copy link
Copy Markdown
Contributor

@gadmm gadmm Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One particular ordering to think about: when a signal is recorded, caml_pending_signals is set atomically and then young_limit is set with a release store. This message is received when the comparison young_limit <= young_pointer fails, after which caml_check_for_pending_signals ends up being called via caml_process_pending_signals.

Now the comparison with young_limit is done without synchronisation, so caml_pending_signals could still be seen as unset after the comparison fails. I am not sure this actually happens currently, but only for accidental reasons (an unrelated acquire load of young_limit in handle_gc_interrupts which happens to be called before caml_process_pending_signals inside caml_process_pending_actions, or a whole lot of other work happenning between the two; even so I am not sure how to reason about it).

It would be clearer and more robust to change the comparison pattern to upgrade the relaxed load into an acquire load when it fails using a fence, e.g.

if (young_pointer < young_limit) {
  atomic_thread_fence(memory_order_acquire);
  …
}

(the code in old trunk was organised in such a way that there is a common entry point to the GC which would be good to bring back, in this case it is an obvious place for such a fence).

Copy link
Copy Markdown
Contributor

@gadmm gadmm Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as this PR is concerned one solution is to add this fence at the top caml_check_for_pending_signals, but I can move this comment to my review of the multicore PR if you want. (I already commented there about the choice of a volatile read of young_limit instead of an atomic relaxed read which is UB, AFAIU.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could put an "acquire" fence at the beginning of caml_check_for_pending_signals, or use "acquire" loads instead of "relaxed" loads. But it's not clear to me it's doing much of a difference. There will always be races that cause a pending signal to not be processed immediately, no?

Since I'm clearly out of my league here, I'd appreciate experts to chime in and tell me what to do. This PR speeds up the bytecode interpreter considerably, so I'd really like to see it merged soon.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not doing an acquire on young_limit, which I think opens up the theoretical possibility of seeing young_limit updated without caml_pending_signals being set and then resetting young_limit after the check. This would delay processing the pending signal until young_limit is updated again (another signal, stop-the-world of some kind or we enter a blocking section). I think it might be possible to engineer a scenario where this leads to never processing a SIGINT (loop with a safepoint poll and nothing else?).

I think this can happen anyway at the moment. A signal could be delivered at any time before https://github.com/xavierleroy/ocaml/blob/79fb0b0ddaca1051106fcd540b5e82960449218d/runtime/minor_gc.c#L635 and the resulting young_limit would be over-written. I'll add that to the list of signalling issues, we probably need a caml_check_for_pending_signals at the very end of a stw section.

I don't think the memory orderings need hold up the PR, it doesn't worsen the existing situation and we can look to remedy things in the follow-up with the rest of @gadmm 's identified issues on the multicore PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm not sure it can happen at the moment in practice (series of seq_cst loads, the first of which is for signal 0). I agree it's a bit independent from this PR because I'd want the fence anyway so that correctness follows from the structure of the program.
  • I'm fine with a fence at the top of check_for_pending_signals and we refine it later (I am not sure acquire loads here are a proper fix).
  • Good point about missing signals anyways; I don't think so, because the former trunk implementation went to length to avoid that, by processing signals first for instance. @sadiqj points out an issue with the multicore implementation, which in old trunk was dealt with by the update_young_limit business. There is still the case of races with syscalls but the reason there is fundamental rather than a programming bug, and there was an idea to fix this too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a fence at the top of check_for_pending_signals and we refine it later

OK, done. I added a [MM] (memory model) comment briefly summarizing the discussion and pointing here.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

In fact, why should comparing the young_pointer to the young_limit in interp.c not be enough, as done for polling in native?

That's an excellent question. I can think of historical reasons (the bytecode interpreter came first, relying on a global "something_to_do" flag that was set by signal handlers and some other runtime functions too), but nothing more compelling.

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Jan 13, 2022

I also don't see the need for caml_check_for_pending_signals() in interp.c for POPTRAP & CHECK_SIGNALS.

de70ae5 introduced these caml_check_for_pending_signals() calls as part of ocaml-multicore#722 which fixed up the signals_alloc testcase. Notice that caml_check_for_pending_signals was also added to the macro Enter_gc in this change.

@Engil might have insight into why it was needed.

@abbysmal
Copy link
Copy Markdown
Contributor

For the POPTRAP and CHECK_SIGNALS addition, I added these following some observations made comparing the bytecode and native versions on the signals_alloc testcase.

The gist of it was, at the time, that without these polling point, the testcase would fail here:
https://github.com/ocaml/ocaml/blob/trunk/testsuite/tests/callback/signals_alloc.ml#L17
It would not print12345, but rather 12435. After debugging for a while I suspected these polling point were necessary to get the behavior the testsuite expected, but maybe the test assertions are a bit too strict?

…nals

For the reasons discussed in ocaml#10880 and summarized in a comment.
@xavierleroy
Copy link
Copy Markdown
Contributor Author

Nobody dares to accept this PR, but I think it's in good shape now, so I'll merge tomorrow if there is no more objections.

I'll have a look at the signals_alloc test. Perhaps it's over-constrained and should not prevent a nicer, more efficient implementation of polling in the bytecode interpreter. If so we'll do that in a separate PR.

Copy link
Copy Markdown
Contributor

@gadmm gadmm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR. (Approve button seems temporarily restricted due to previous spam.)

@sadiqj
Copy link
Copy Markdown
Contributor

sadiqj commented Jan 13, 2022

(I tried to give an approving review but it's disabled at the moment.)

@gadmm
Copy link
Copy Markdown
Contributor

gadmm commented Jan 13, 2022

Nobody dares to accept this PR, but I think it's in good shape now, so I'll merge tomorrow if there is no more objections.

I think so too. Do not hesitate to request reviews when appropriate.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Right, there were temporary restrictions in place on who can approve. Should be back to normal now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants