Describe the issue
Sometimes when I run code in a domain, the GC segfaults. I saw this in Eio, but I now have a test-case that doesn't use it. It only happens if I also link with the threads library.
To reproduce
There is a gist with the code here:
https://gist.github.com/talex5/3852aeebd437436fc516e4ddc77a7e03
Building this Dockerfile produces the problem for me:
FROM ocaml/opam:debian-11-ocaml-4.12-domains
RUN sudo apt-get install wget
RUN wget https://gist.github.com/talex5/3852aeebd437436fc516e4ddc77a7e03/archive/70dca297f7ffce0498c07363be2569d79d971f9b.zip -O demo.zip
RUN unzip -j demo.zip
RUN opam install dune
RUN strings /home/opam/.opam/4.12/bin/ocamlc | grep OCAML_RUNTIME_BUILD_GIT_HASH_IS
RUN opam exec -- dune exec -- ./test.exe
I get this output:
Sending build context to Docker daemon 2.048kB
Step 1/7 : FROM ocaml/opam:debian-11-ocaml-4.12-domains
---> dfeb77d7ef7f
Step 2/7 : RUN sudo apt-get install wget
---> Using cache
---> f3f22235d2c2
Step 3/7 : RUN wget https://gist.github.com/talex5/3852aeebd437436fc516e4ddc77a7e03/archive/70dca297f7ffce0498c07363be2569d79d971f9b.zip -O demo.zip
---> Using cache
---> 15e20474ce4f
Step 4/7 : RUN unzip -j demo.zip
---> Using cache
---> 2d90c4738763
Step 5/7 : RUN opam install dune
---> Using cache
---> 323d01a88f43
Step 6/7 : RUN strings /home/opam/.opam/4.12/bin/ocamlc | grep OCAML_RUNTIME_BUILD_GIT_HASH_IS
---> Running in aa80cc86a82a
OCAML_RUNTIME_BUILD_GIT_HASH_IS_6be47af176
Removing intermediate container aa80cc86a82a
---> 28d9e3eb3f3e
Step 7/7 : RUN opam exec -- dune exec -- ./test.exe
---> Running in eaffd42e4615
Info: Creating file dune-project with this contents:
| (lang dune 2.9)
OK
OK
OK
Segmentation fault (core dumped)
The command '/bin/sh -c opam exec -- dune exec -- ./test.exe' returned a non-zero code: 139
Multicore OCaml build version
OCAML_RUNTIME_BUILD_GIT_HASH_IS_6be47af176
Did you try running it with the debug runtime and heap verification ON?
Yes. It makes it less likely to crash but it still happens sometimes.
Backtrace
Thread 1 received signal SIGSEGV, Segmentation fault.
caml_darken (v=0, ignored=<optimized out>, state=<optimized out>) at major_gc.c:761
761 major_gc.c: No such file or directory.
(rr) t a a bt
Thread 3 (Thread 207106.207127 (mmap_hardlink_3_test.exe)):
#0 __lll_lock_wait (futex=futex@entry=0x55b832e56090 <all_domains+464>, private=0) at lowlevellock.c:52
#1 0x00007fb03ec89843 in __GI___pthread_mutex_lock (mutex=0x55b832e56090 <all_domains+464>) at ../nptl/pthread_mutex_lock.c:80
#2 0x000055b832dee151 in caml_plat_lock (m=0x55b832e56090 <all_domains+464>) at caml/platform.h:125
#3 backup_thread_func (v=0x55b832e55fe8 <all_domains+296>) at domain.c:623
#4 0x00007fb03ec86ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007fb03ea6cdef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 207106.207109 (mmap_hardlink_3_test.exe)):
#0 __lll_lock_wait (futex=futex@entry=0x55b832e55f68 <all_domains+168>, private=0) at lowlevellock.c:52
#1 0x00007fb03ec89843 in __GI___pthread_mutex_lock (mutex=0x55b832e55f68 <all_domains+168>) at ../nptl/pthread_mutex_lock.c:80
#2 0x000055b832dee151 in caml_plat_lock (m=0x55b832e55f68 <all_domains+168>) at caml/platform.h:125
#3 backup_thread_func (v=0x55b832e55ec0 <all_domains>) at domain.c:623
#4 0x00007fb03ec86ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007fb03ea6cdef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 207106.207106 (mmap_hardlink_3_test.exe)):
#0 caml_darken (v=0, ignored=<optimized out>, state=<optimized out>) at major_gc.c:761
#1 0x000055b832de7676 in caml_iterate_global_roots (rootlist=0x55b832e55580 <caml_global_roots_old>, rootlist=0x55b832e55580 <caml_global_roots_old>, fdata=0x0, f=0x55b832dcf570 <caml_darken>) at globroots.c:222
#2 caml_scan_global_roots (f=f@entry=0x55b832dcf570 <caml_darken>, fdata=fdata@entry=0x0) at globroots.c:233
#3 0x000055b832dd0008 in cycle_all_domains_callback (domain=domain@entry=0x7fb03e780000, unused=unused@entry=0x0, participating_count=<optimized out>, participating=participating@entry=0x55b832e5f360 <stw_request+64>) at major_gc.c:1093
#4 0x000055b832def4ee in caml_try_run_on_all_domains_with_spin_work (handler=handler@entry=0x55b832dcff40 <cycle_all_domains_callback>, data=data@entry=0x0, leader_setup=leader_setup@entry=0x0, enter_spin_callback=enter_spin_callback@entry=0x0, enter_spin_data=enter_spin_data@entry=0x0) at domain.c:985
#5 0x000055b832def60a in caml_try_run_on_all_domains (handler=handler@entry=0x55b832dcff40 <cycle_all_domains_callback>, data=data@entry=0x0, leader_setup=leader_setup@entry=0x0) at domain.c:999
#6 0x000055b832dd0d4f in major_collection_slice (howmuch=<optimized out>, participant_count=participant_count@entry=0, barrier_participants=barrier_participants@entry=0x0, mode=mode@entry=Slice_interruptible) at major_gc.c:1377
#7 0x000055b832dd0ec8 in caml_major_collection_slice (howmuch=howmuch@entry=-1) at major_gc.c:1396
#8 0x000055b832dedd91 in caml_poll_gc_work () at domain.c:1038
--Type <RET> for more, q to quit, c to continue without paging--
#9 0x000055b832dee038 in stw_handler (domain=<optimized out>) at domain.c:891
#10 0x000055b832dee0d1 in handle_incoming (s=<optimized out>) at domain.c:219
#11 caml_handle_incoming_interrupts () at domain.c:232
#12 handle_gc_interrupt () at domain.c:1060
#13 0x000055b832def667 in caml_handle_gc_interrupt () at domain.c:1077
#14 0x000055b832dce565 in caml_garbage_collection () at signals_nat.c:92
#15 0x000055b832df0a33 in caml_call_gc ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
Describe the issue
Sometimes when I run code in a domain, the GC segfaults. I saw this in Eio, but I now have a test-case that doesn't use it. It only happens if I also link with the
threadslibrary.To reproduce
There is a gist with the code here:
https://gist.github.com/talex5/3852aeebd437436fc516e4ddc77a7e03
Building this Dockerfile produces the problem for me:
I get this output:
Multicore OCaml build version
OCAML_RUNTIME_BUILD_GIT_HASH_IS_6be47af176Did you try running it with the debug runtime and heap verification ON?
Yes. It makes it less likely to crash but it still happens sometimes.
Backtrace