s390x: use GOT-indirect calls to fix heap corruption with shared runt…#14547
s390x: use GOT-indirect calls to fix heap corruption with shared runt…#14547nojb merged 4 commits intoocaml:trunkfrom
Conversation
|
Nice detective work, thank you!
That's correct, but it's intentional: the standard ABI is too inefficient for a functional language. (We're not wasting 160 bytes for every stack frame.)
We're careful to switch to the C stack before calling C functions (see the
That's a good fix. The only downside is an increase in code size, but it's not too bad: +3% for the OCaml standard library. I didn't try to measure a possible increase in running time. A fix that would avoid increasing code size would be to force "bind now" resolution, as per Yet another option would be to maintain a "red zone" of 160 free bytes at the bottom of the OCaml stacks. There would be a single 160-byte area, not one per stack frame. That would give the dynamic resolver the space it expects. However, unlike the C stack, the OCaml stacks don't grow automatically, so it's still dangerous to run code produced by a C compiler (incl. the dynamic resolver) on the OCaml stacks. So, I'm OK with merging this PR, but I'll wait a bit for others to comment. |
|
Thanks for working on this! I've pushed a commit updating the The failure can be seen on Jenkins for any of the recent trunk runs by searching the logs for |
|
Would another option here be to effectively move all the OCaml-to-C parts of s390.S to the startup object of any program i.e. leave |
|
I thought about treating calls to runtime glue functions (the ones defined in s390x.S) specially, but then I realized that ocamlopt also generates Thinking some more about it, I believe So, the only Emitting load address + indirect call for all OCaml function calls doesn't hurt code size much, from my measurements, but I don't know if there's an impact on execution time. Naively, all these indirect calls can be predicted perfectly, but it could use a lot of branch prediction table entries. |
|
Hello @xavierleroy, apologies on the late reply I ended up being busy with a few things. Thanks for the information on the intentional ABI deviation. I'll actually go and see what narrowing it to just runtime glue calls will do and measure the branch prediction and performance. I'll report back when I have a spot of time. @dra27 Thanks for testing this! I might also give your idea a try as well and I'll report back. My solution is a bit like trying to defuse a bomb with a sledgehammer so I'll try and find a more elegant solution that limits the increase in code size. If anything I'll have learned a thing or two about OCamls codegen. |
Use GOT-indirect (lgrl @got + basr) only for runtime glue functions (caml_call_gc, caml_alloc*, caml_c_call, caml_raise_exn, etc.) where r15 may point at the OCaml fiber stack rather than a C stack frame. Regular OCaml-to-OCaml calls and tailcalls use brasl @plt as before, since these resolve eagerly and never hit the PLT lazy binding trampoline that assumes a C-ABI stack layout. This narrows the original fix (which made all calls GOT-indirect) per review feedback, reducing code size and preserving direct branch prediction for inter-module OCaml calls.
2ef6880 to
85fcbcb
Compare
|
Okay so I tested my fix versus the narrower (and elegant) solution. I did consider @dra27's startup object approach too. It's arguably cleaner but would need several build system changes across multiple files and I'm not quite there yet in comfort doing that I'm still learning the codebase and all of its quirks. Thanks again for your time. |
xavierleroy
left a comment
There was a problem hiding this comment.
Looks very good to me. Thanks !
|
I took the liberty of pushing two tiny tweaks. Good to go once CI passes. |
|
@Octachron: OK to cherry-pick to 5.5 (together with #14609)? |
Re-add missing commit from #14547
|
Yes, as this is a bug fix and we are still in the alpha/beta release phase. |
|
Cherry-pick to 5.5: 392c4ec |
This was a fun one. Classic storage overlay on s390x, on z/OS you'd get a nice S0C4 abend with a dump to work from,
but on Linux all you get is signal 11 and a long night with GDB.
The PLT lazy binding trampoline (_dl_runtime_resolve_vx) saves float registers relative to r15, assuming it's a proper
C stack frame. OCaml's r15 points to its own fiber stack, which lives on the heap. So, when runtime functions like
caml_alloc1 get called through the PLT with libasmrun_shared.so, the trampoline happily saves +infinity and epsilon
right over a skiplist cell. caml_skiplist_insert then tries to follow a forward pointer that's actually IEEE 754
infinity, and down we go.
Linking with -Wl,-z,now also fixes this by forcing eager symbol resolution at load time, but that just sidesteps the
problem. The codegen is still emitting calls that violate the s390x ABI whenever r15 is on the OCaml stack, and
anything else that assumes a valid C frame at r15 would hit the same class of bug.
The underlying ABI violation exists on all distros. It's confirmed on RHEL 8.9 and Fedora 41 but not observed on
Ubuntu 24.04, maybe due to differences in glibc's PLT resolver behaviour or heap layout. If anyone knows more, I would like your input. The bug is latent regardless.
Fix is straightforward: swap @plt calls for GOT-indirect calls (lgrl/basr instead of brasl). The GOT entries resolve at load time, so the trampoline never fires. This is the same pattern already used for data symbols in
emit_load_symbol_addr.
Tested on s390x RHEL 8.9:
Fixes #13693
Related: #14514 - my previous PR attempt which was a "flop"