Skip to content

Fix shared library crash on s390x (RHEL/Fedora)#14514

Closed
Zaneham wants to merge 6 commits intoocaml:trunkfrom
Zaneham:s390x
Closed

Fix shared library crash on s390x (RHEL/Fedora)#14514
Zaneham wants to merge 6 commits intoocaml:trunkfrom
Zaneham:s390x

Conversation

@Zaneham
Copy link
Copy Markdown
Contributor

@Zaneham Zaneham commented Feb 3, 2026

Fixes #13693

Explicitly initialise global root skiplists at runtime using a constructor function, rather than relying solely on static initialisation.

On some s390x systems (RHEL/Fedora), the BSS section of shared libraries may not be properly zeroed, causing the skiplist forward pointers to contain garbage instead of NULL. This leads to a segmentation fault in caml_skiplist_insert during startup.

I don't have access to s390x RHEL/Fedora hardware. Could someone with access to the affected platform verify this fix?

Explicitly initialize global root skiplists at runtime using a
constructor function, rather than relying solely on static
initialization.

On some s390x systems (RHEL/Fedora), the BSS section of shared
libraries may not be properly zeroed, causing the skiplist forward
pointers to contain garbage instead of NULL. This leads to a
segmentation fault in caml_skiplist_insert during startup.

Fixes ocaml#13693
@MisterDA
Copy link
Copy Markdown
Contributor

MisterDA commented Feb 3, 2026

BSS section of shared libraries may not be properly zeroed

Wouldn't that totally break the semantics of C, which requires objects of static storage duration to be empty-initialized? and besides, these variables have an explicit initializer, though it seems that variables that are explicitly initialized to zero are also put in the bss (see GCC's -fno-zero-initialized-in-bss).
Out of curiosity, how can I double-check your claim?

@Zaneham
Copy link
Copy Markdown
Contributor Author

Zaneham commented Feb 3, 2026

BSS section of shared libraries may not be properly zeroed

Wouldn't that totally break the semantics of C, which requires objects of static storage duration to be empty-initialized? and besides, these variables have an explicit initializer, though it seems that variables that are explicitly initialized to zero are also put in the bss (see GCC's -fno-zero-initialized-in-bss). Out of curiosity, how can I double-check your claim?

Fair point, "BSS not zeroed" was an oversimplification. It's more likely an initialisation ordering issue with shared
libraries on RHEL/Fedora specifically. I can't reproduce locally since I don't have s390x RHEL/Fedora. The constructor approach should fix any initialisation ordering issues, but I'd appreciate someone with access verifying it.

Co-authored-by: Antonin Décimo <antonin.decimo@gmail.com>
@Zaneham Zaneham requested a review from MisterDA February 3, 2026 09:36
@xavierleroy
Copy link
Copy Markdown
Contributor

Thanks for pointing out the potential issue with shared library initialization on s390x, and for the proposed fix.

I have the impression there are other global variables in the OCaml runtime system that rely on proper static initialization... How can we find all the global variables that could be affected by the s390x bug?

If the bug is fixed in more recent Linux distributions, or in updated packages, we could also elect not to work around it in OCaml. Any details you could provide on this bug (e.g. a discussion on a RedHat or Ubuntu bug tracker) would be useful.

@Zaneham
Copy link
Copy Markdown
Contributor Author

Zaneham commented Feb 6, 2026

Thanks for pointing out the potential issue with shared library initialization on s390x, and for the proposed fix.

I have the impression there are other global variables in the OCaml runtime system that rely on proper static initialization... How can we find all the global variables that could be affected by the s390x bug?

If the bug is fixed in more recent Linux distributions, or in updated packages, we could also elect not to work around it in OCaml. Any details you could provide on this bug (e.g. a discussion on a RedHat or Ubuntu bug tracker) would be useful.

Hello @xavierleroy I couldn't find a single upstream bug report that matches exactly, but it looks like an IFUNC resolver ordering issue specific to s390x. glibc's "delayed relocation" fix for IFUNC resolvers was https://patchwork.ozlabs.org/project/glibc/patch/20180606140223.4D11F439942E1@oldenburg.str.redhat.com/. s390x got a no-op stub. Some related bugs showing the pattern:

https://bugzilla.redhat.com/show_bug.cgi?id=1398716
https://bugzilla.redhat.com/show_bug.cgi?id=1312462

Since it's a gap in glibc's s390x support rather than a fixed bug, I don't think we can count on distro updates
resolving it. I think the constructor approach seems like the safest workaround.

For auditing other affected globals, grep -rn 'SKIPLIST_STATIC_INITIALIZER' runtime/ should catch the main ones.

@xavierleroy
Copy link
Copy Markdown
Contributor

For auditing other affected globals, grep -rn 'SKIPLIST_STATIC_INITIALIZER' runtime/ should catch the main ones

Thanks for the tip, but this is not the question I was asking.

The OCaml runtime system has a bunch of global variables that are statically initialized. Some are skiplists, as in

struct skiplist caml_global_roots = SKIPLIST_STATIC_INITIALIZER;
/* mutable roots, don't know whether old or young */
struct skiplist caml_global_roots_young = SKIPLIST_STATIC_INITIALIZER;
/* generational roots pointing to minor or major heap */
struct skiplist caml_global_roots_old = SKIPLIST_STATIC_INITIALIZER;
/* generational roots pointing to major heap */

and others are not, such as

ocaml/runtime/signals.c

Lines 157 to 160 in 3b09092

CAMLexport void (*caml_enter_blocking_section_hook)(void) =
caml_enter_blocking_section_default;
CAMLexport void (*caml_leave_blocking_section_hook)(void) =
caml_leave_blocking_section_default;

Your patch replaces static initialization by initialization by a constructor function in the first example above, but not in the second example above. Why? Are global variables of type struct skiplist the only ones that are affected by the s390x bug on static initialization? What's so special about those global variables?

it looks like an IFUNC resolver ordering issue specific to s390x. glibc's "delayed relocation" fix for IFUNC resolvers was https://patchwork.ozlabs.org/project/glibc/patch/20180606140223.4D11F439942E1@oldenburg.str.redhat.com/. s390x got a no-op stub.

For reference, here is the description of the IFUNC feature: https://sourceware.org/glibc/wiki/GNU_IFUNC . It's about load-time resolution of function symbols. I don't see the connection with static initialization of global variables. Please explain like I'm 6.

Since it's a gap in glibc's s390x support rather than a fixed bug, I don't think we can count on distro updates
resolving it. I think the constructor approach seems like the safest workaround.

We're talking about (possibly) incorrect static initialization of global variables at dynamic loading time. That's a standard feature of C since K&R. If it was really broken, it would affect lots of programs and be treated as a highest-priority issue.

@Zaneham
Copy link
Copy Markdown
Contributor Author

Zaneham commented Feb 10, 2026

I was working with what I know from my albeit hobbyist work with mainframes and my interpretation of the docs. I can see my interpretation is wrong, I saw the issue, took a swing at it, and evidently missed. If you'd like I am happy to close this PR and I can look into this further later unless someone has the means to reproduce the bug.

@xavierleroy
Copy link
Copy Markdown
Contributor

No problem. Would you be interested in getting access to our System-Z / RHEL test machine to investigate the #13693 issue further? If so, please e-mail me (https://xavierleroy.org/contact.html).

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Mar 2, 2026

Superseded by #14547

@nojb nojb closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

libasmrun_shared.so segfaulting on s390x Fedora/RHEL

6 participants