-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Bad locking in io.c #7503
Description
Original bug ID: 7503
Reporter: @stedolan
Status: confirmed (set by @mshinwell on 2017-03-15T16:25:51Z)
Resolution: open
Priority: normal
Severity: minor
Category: runtime system and C interface
Related to: #5141
Monitored by: @stedolan @gasche
Bug description
With ocamlopt, the locking in io.c is not sufficient in the presence of asynchronous operations like signal handlers and finalisers. Here's an example program badlock.ml:
let () =
Sys.set_signal Sys.sigint (Sys.Signal_handle (fun _ ->
print_endline "signalled!"));
while true; do
print_endline "hello"
done
The exact behaviour varies depending on whether systhreads is enabled. With trunk and 4.05.0+beta2,
ocamlopt badlock.ml -o badlock && ./badlock
^C
gets stuck in an infinite loop in caml_flush
ocamlopt -runtime-variant d badlock.ml -o badlock && ./badlock
^C
crashes with an assertion failure (the same issue)
ocamlopt -thread unix.cmxa threads.cmxa badlock.ml -o badlock && ./badlock
^C
deadlocks in pthread_mutex_lock
Version 4.04 and before do not initialise systhreads when merely linked with threads.cmxa if the program does not use any thread features, so they behave as the first case even when compiled with -thread.
I think I understand what's going on in the deadlock case, since I came across this issue when trying to fix a similar problem in the multicore branch (until KC pointed out that the same problem likely exists on trunk). Here's a backtrace from the deadlock in 4.05.0+beta2:
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f2cd0d32bc5 in __GI___pthread_mutex_lock (mutex=0x563e7210ea60) at ../nptl/pthread_mutex_lock.c:80
#2 0x0000563e70e55a6e in caml_io_mutex_lock ()
#3 0x0000563e70e6920d in caml_ml_output_bytes (vchannel=, buff=, start=,
length=) at io.c:660
#4 0x0000563e70e2e0da in camlPervasives__output_string_1213 () at pervasives.ml:349
#5 0x0000563e70e2e97a in camlPervasives__print_endline_1310 () at pervasives.ml:472
#6 0x0000563e70e786f4 in caml_start_program ()
#7 0x0000563e70e5db09 in caml_execute_signal (signal_number=signal_number@entry=2, in_signal_handler=in_signal_handler@entry=0)
at signals.c:176
#8 0x0000563e70e5dbda in caml_process_pending_signals () at signals.c:58
#9 0x0000563e70e5dc87 in caml_process_pending_signals () at signals.c:53
#10 caml_leave_blocking_section () at signals.c:131
#11 0x0000563e70e71d2b in caml_write_fd (fd=1, flags=, buf=buf@entry=0x563e720ee7a0, n=n@entry=6) at unix.c:94
#12 0x0000563e70e68480 in caml_flush_partial (channel=0x563e720ee750) at io.c:168
#13 0x0000563e70e68f48 in caml_flush (channel=) at io.c:182
#14 caml_ml_flush (vchannel=) at io.c:612
#15 0x0000563e70e2e9bf in camlPervasives__print_endline_1310 () at pervasives.ml:472
#16 0x0000563e70e24c47 in camlBadlock__entry ()
#17 0x0000563e70e22369 in caml_program ()
#18 0x0000563e70e786f4 in caml_start_program ()
#19 0x0000563e70e5c9b5 in caml_main (argv=0x7ffff6bb5a18) at startup.c:145
#20 0x0000563e70e2203c in main (argc=, argv=) at main.c:37
After performing the write, caml_write_fd calls leave_blocking_section which runs any pending signal handlers. However, the channel's lock is held because it was taken in caml_flush, so when the signal handler tries to re-acquire the lock it deadlocks. (Without -threads, I think it continues and corrupts the channel state, although I'm not sure of the details).
I think that it is unsafe in general to hold mutexes while interacting with the OCaml heap. Due to finalisers and signal handlers, several heap operations can run arbitrary user code (e.g. caml_alloc, caml_enter_blocking_section, and more in multicore), which is to be avoided in a critical section.