Fix shared library crash on s390x (RHEL/Fedora)#14514
Fix shared library crash on s390x (RHEL/Fedora)#14514Zaneham wants to merge 6 commits intoocaml:trunkfrom
Conversation
Explicitly initialize global root skiplists at runtime using a constructor function, rather than relying solely on static initialization. On some s390x systems (RHEL/Fedora), the BSS section of shared libraries may not be properly zeroed, causing the skiplist forward pointers to contain garbage instead of NULL. This leads to a segmentation fault in caml_skiplist_insert during startup. Fixes ocaml#13693
Wouldn't that totally break the semantics of C, which requires objects of static storage duration to be empty-initialized? and besides, these variables have an explicit initializer, though it seems that variables that are explicitly initialized to zero are also put in the bss (see GCC's |
Fair point, "BSS not zeroed" was an oversimplification. It's more likely an initialisation ordering issue with shared |
Co-authored-by: Antonin Décimo <antonin.decimo@gmail.com>
|
Thanks for pointing out the potential issue with shared library initialization on s390x, and for the proposed fix. I have the impression there are other global variables in the OCaml runtime system that rely on proper static initialization... How can we find all the global variables that could be affected by the s390x bug? If the bug is fixed in more recent Linux distributions, or in updated packages, we could also elect not to work around it in OCaml. Any details you could provide on this bug (e.g. a discussion on a RedHat or Ubuntu bug tracker) would be useful. |
Hello @xavierleroy I couldn't find a single upstream bug report that matches exactly, but it looks like an IFUNC resolver ordering issue specific to s390x. glibc's "delayed relocation" fix for IFUNC resolvers was https://patchwork.ozlabs.org/project/glibc/patch/20180606140223.4D11F439942E1@oldenburg.str.redhat.com/. s390x got a no-op stub. Some related bugs showing the pattern: https://bugzilla.redhat.com/show_bug.cgi?id=1398716 Since it's a gap in glibc's s390x support rather than a fixed bug, I don't think we can count on distro updates For auditing other affected globals, grep -rn 'SKIPLIST_STATIC_INITIALIZER' runtime/ should catch the main ones. |
Thanks for the tip, but this is not the question I was asking. The OCaml runtime system has a bunch of global variables that are statically initialized. Some are skiplists, as in Lines 46 to 51 in 3b09092 and others are not, such as Lines 157 to 160 in 3b09092 Your patch replaces static initialization by initialization by a constructor function in the first example above, but not in the second example above. Why? Are global variables of type
For reference, here is the description of the IFUNC feature: https://sourceware.org/glibc/wiki/GNU_IFUNC . It's about load-time resolution of function symbols. I don't see the connection with static initialization of global variables. Please explain like I'm 6.
We're talking about (possibly) incorrect static initialization of global variables at dynamic loading time. That's a standard feature of C since K&R. If it was really broken, it would affect lots of programs and be treated as a highest-priority issue. |
|
I was working with what I know from my albeit hobbyist work with mainframes and my interpretation of the docs. I can see my interpretation is wrong, I saw the issue, took a swing at it, and evidently missed. If you'd like I am happy to close this PR and I can look into this further later unless someone has the means to reproduce the bug. |
|
No problem. Would you be interested in getting access to our System-Z / RHEL test machine to investigate the #13693 issue further? If so, please e-mail me (https://xavierleroy.org/contact.html). |
|
Superseded by #14547 |
Fixes #13693
Explicitly initialise global root skiplists at runtime using a constructor function, rather than relying solely on static initialisation.
On some s390x systems (RHEL/Fedora), the BSS section of shared libraries may not be properly zeroed, causing the skiplist forward pointers to contain garbage instead of NULL. This leads to a segmentation fault in caml_skiplist_insert during startup.
I don't have access to s390x RHEL/Fedora hardware. Could someone with access to the affected platform verify this fix?