Skip to content

reintroduce native compiler for s390x#11712

Merged
xavierleroy merged 24 commits intoocaml:trunkfrom
AlekseiNikiforovIBM:s390x_support
May 23, 2023
Merged

reintroduce native compiler for s390x#11712
xavierleroy merged 24 commits intoocaml:trunkfrom
AlekseiNikiforovIBM:s390x_support

Conversation

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor

It's first version of reintroduction of native compiler for s390x.

There are still some issues remain, some stuff to improve, and some questions. Help is appreciated!

  1. ocaml stack size is increased. This is a workaround for stack exhaustion crashes, like when running command "../../ocamlc.opt -nostdlib -I ../../stdlib -g -c -for-pack Dynlink_compilerlibs -strict-sequence -principal -absname -w +a-4-9-40-41-42-44-45-48 -warn-error +A -bin-annot -strict-formats -I byte -I dynlink_compilerlibs -o dynlink_compilerlibs/clflags.cmo dynlink_compilerlibs/clflags.ml" from "otherlibs/dynlink" directory, which is running when building native compiler.

  2. CFI is not working correctly yet.

  3. It looks like ocaml doesn't follow s390x ABI when generating s390x assembly code. It often doesn't allocate 160 bytes on stack. For example in function "camlStdlib.entry":

Dump of assembler code for function camlStdlib.entry:
   0x00000000014cf890 <+0>:     lay     %r15,-8(%r15)
   0x00000000014cf896 <+6>:     stg     %r14,0(%r15)
...
   0x00000000014d1268 <+6616>:  lay     %r11,-56(%r11)
   0x00000000014d126e <+6622>:  clg     %r11,0(%r10)
   0x00000000014d1274 <+6628>:  clg     %r11,0(%r10)
   0x00000000014d127a <+6634>:  jgl     0x14d1558 <camlStdlib.entry+7368>
   0x00000000014d1280 <+6640>:  la      %r3,8(%r11)
   0x00000000014d1284 <+6644>:  lghi    %r4,6144
   0x00000000014d1288 <+6648>:  stg     %r4,-8(%r3)
...
   0x00000000014d1534 <+7332>:  lg      %r14,0(%r15)
   0x00000000014d153a <+7338>:  la      %r15,8(%r15)
   0x00000000014d153e <+7342>:  br      %r14
   0x00000000014d1540 <+7344>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d1546 <+7350>:  jg      0x14d1440 <camlStdlib.entry+7088>
   0x00000000014d154c <+7356>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d1552 <+7362>:  jg      0x14d139c <camlStdlib.entry+6924>
   0x00000000014d1558 <+7368>:  brasl   %r14,0x159085c <caml_call_gc>
=> 0x00000000014d155e <+7374>:  jg      0x14d1280 <camlStdlib.entry+6640>
   0x00000000014d1564 <+7380>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d156a <+7386>:  jg      0x14d056a <camlStdlib.entry+3290>
   0x00000000014d1570 <+7392>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d1576 <+7398>:  jg      0x14d04d6 <camlStdlib.entry+3142>
   0x00000000014d157c <+7404>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d1582 <+7410>:  jg      0x14d0442 <camlStdlib.entry+2994>
   0x00000000014d1588 <+7416>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d158e <+7422>:  jg      0x14d03ae <camlStdlib.entry+2846>
   0x00000000014d1594 <+7428>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d159a <+7434>:  jg      0x14d031a <camlStdlib.entry+2698>
   0x00000000014d15a0 <+7440>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d15a6 <+7446>:  jg      0x14d0286 <camlStdlib.entry+2550>
   0x00000000014d15ac <+7452>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d15b2 <+7458>:  jg      0x14cfdae <camlStdlib.entry+1310>
   0x00000000014d15b8 <+7464>:  brasl   %r14,0x159085c <caml_call_gc>
   0x00000000014d15be <+7470>:  jg      0x14cfcb0 <camlStdlib.entry+1056>

Between call to entry to function camlStdlib.entry on 0x00000000014cf890 and caml_call_gc on 0x00000000014d1558 only 16 bytes on stack (register %r15) are allocated while it should be 160 bytes at least according to s390x ABI. Is this deviation from ABI intended? Changes like these affect how CFI has to be updated.

  1. When generating code for Iextcall operand, the stack is allocated beforehand. For s390x it actually allocates 160 bytes in that case. The problem is that in one of cases it has to switch stack before calling function, and due to that stack allocation happens on wrong stack, not on the one it switches to, but on the one it switches from. As workaround, I've explicitly allocated 160 bytes on that stack. In addition to that, I've allocated 8 more bytes to save old stack pointer. Can't use any other register since it's value is already being used for something else important. At least I didn't find such non-volatile register on s390x yet.

@xavierleroy
Copy link
Copy Markdown
Contributor

Thanks for your work to restore s390x support in OCaml 5, much appreciated.

Re: calling conventions, on pretty much every target, ocamlopt uses its own calling conventions to call OCaml functions, because the standard ELF ABI calling conventions are designed for C and not well suited to a functional language. Typically, OCaml's custom conventions will use more registers for parameter passing and try to reduce stack usage to a mimimum, because we can have deeply nested recursive calls.

For s390x, this means in particular that the 160 reserved bytes at the bottom of the stack are reserved only when calling a C function from OCaml. In OCaml 4, this is ensured by Proc.loc_external_arguments, which requests 160 bytes + whatever is needed for passing arguments on the stack. Then, Istackoffset instructions are generated, bracketing the Iextcall instruction, thus reserving the stack space for the duration of the C call.

There are also a few places in runtime/s390x.S where C functions are called, e.g. caml_garbage_collection. There, the call is also bracketed by lay %r15, -160(%r15) / lay %r15, 160(%r15) stack adjustments.

In OCaml 5, the C stack and the OCaml stack differ, so the stack adjustments are a parameter to Iextcall and are to be performed after the swich to the C stack. Likewise, for the s390x.S functions, I'd expect the adjustments to take place after the stack switch.

Let me konw if you have a more specific question about this stack business.

@xavierleroy
Copy link
Copy Markdown
Contributor

ocaml stack size is increased. This is a workaround for stack exhaustion crashes,

Could you explain what's going on? Is it the OCaml stack that overflows (suprisingly, because it's much larger in OCaml 5 than in OCaml 4) or the C stack? Do you have a gdb backtrace? What is the workaround you mention?

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Could you explain what's going on? Is it the OCaml stack that overflows (suprisingly, because it's much larger in OCaml 5 than in OCaml 4) or the C stack? Do you have a gdb backtrace? What is the workaround you mention?

Crashing command is ../../ocamlc.opt -nostdlib -I ../../stdlib -g -c -for-pack Dynlink_compilerlibs -strict-sequence -principal -absname -w +a-4-9-40-41-42-44-45-48 -warn-error +A -bin-annot -strict-formats -I byte -I dynlink_compilerlibs -o dynlink_compilerlibs/clflags.cmo dynlink_compilerlibs/clflags.ml. But it works fine on amd64. I think it's ocaml stack that's overflowing.

Workaround is here:
ac037e7

Crash looks like this:

Program received signal SIGSEGV, Segmentation fault.
0x0000000001538904 in caml_final_do_young_roots (act=act@entry=0x1547c90 <oldify_one>, fflags=fflags@entry=SCANNING_ONLY_YOUNG_VALUES, fdata=fdata@entry=0x3ffffffe8d0, d=d@entry=0x18edb90, do_last_val=do_last_val@entry=0) at runtime/finalise.c:221
221         Call_action (act, fdata, f->first.table[i].fun);
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.35-4.fc36.s390x
(gdb) bt
#0  0x0000000001538904 in caml_final_do_young_roots (act=act@entry=0x1547c90 <oldify_one>, fflags=fflags@entry=SCANNING_ONLY_YOUNG_VALUES, fdata=fdata@entry=0x3ffffffe8d0, d=d@entry=0x18edb90,
    do_last_val=do_last_val@entry=0) at runtime/finalise.c:221
#1  0x000000000154895a in caml_empty_minor_heap_promote (domain=domain@entry=0x18edb90, participating_count=participating_count@entry=1, participating=participating@entry=0x18de118 <stw_request+64>)
    at runtime/minor_gc.c:585
#2  0x0000000001548cae in caml_stw_empty_minor_heap_no_major_slice (domain=domain@entry=0x18edb90, participating_count=<optimized out>, participating=0x18de118 <stw_request+64>, unused=<optimized out>)
    at runtime/minor_gc.c:695
#3  0x0000000001548e44 in caml_stw_empty_minor_heap (domain=0x18edb90, unused=<optimized out>, participating_count=<optimized out>, participating=<optimized out>) at runtime/minor_gc.c:735
#4  0x0000000001533cbe in caml_try_run_on_all_domains_with_spin_work (handler=handler@entry=0x1548e20 <caml_stw_empty_minor_heap>, data=<optimized out>, data@entry=0x0, leader_setup=<optimized out>,
    leader_setup@entry=0x1547a30 <caml_empty_minor_heap_setup>, enter_spin_callback=<optimized out>, enter_spin_callback@entry=0x1547c20 <caml_do_opportunistic_major_slice>,
    enter_spin_data=enter_spin_data@entry=0x0) at runtime/domain.c:1462
#5  0x0000000001548f3e in caml_try_stw_empty_minor_heap_on_all_domains () at runtime/minor_gc.c:771
#6  caml_empty_minor_heaps_once () at runtime/minor_gc.c:791
#7  0x0000000001533770 in caml_poll_gc_work () at runtime/domain.c:1518
#8  0x000000000154d478 in caml_do_pending_actions_exn () at runtime/signals.c:301
#9  0x0000000001548fa0 in caml_alloc_small_dispatch (dom_st=dom_st@entry=0x18edb90, wosize=<optimized out>, flags=flags@entry=3, nallocs=<optimized out>, encoded_alloc_lens=0x1896fcb "\t\003\003")
    at runtime/minor_gc.c:813
#10 0x0000000001555cfe in caml_garbage_collection () at runtime/signals_nat.c:89
#11 <signal handler called>
#12 0x000003fffffff6ec in ?? ()
PC not saved
(gdb) l
216       uintnat i;
217       struct caml_final_info *f = d->final_info;
218
219       CAMLassert (f->first.old <= f->first.young);
220       for (i = f->first.old; i < f->first.young; i++) {
221         Call_action (act, fdata, f->first.table[i].fun);
222         Call_action (act, fdata, f->first.table[i].val);
223       }
224
225       CAMLassert (f->last.old <= f->last.old);
(gdb) print *f
$1 = {first = {table = 0x3ffed4c7df8, old = 3, young = 4397740953328, size = 4397741080000}, updated_first = 19930832, last = {table = 0x3, old = 4397733761864, young = 4397733734888, size = 19930832},
  updated_last = 4397732843816, todo_head = 0x3ffed4c7e10, todo_tail = 0x3, running_finalisation_function = 4397740953328, next = 0x3ffedcb7dc0}

The contents of *f are definitely wrong.

It looks like there are a lot of nested calls, and stack is eventually exhausted and overwritten, by instructions before marked one:

The place where stack is overwritten looks like this:
(gdb) bt
camlMtype.strengthen_lazy_sig$27_634 () at typing/mtype.ml:63
63        | (SigL_value(_, _, _) as sigelt) :: rem ->
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) disassemble
Dump of assembler code for function camlMtype.strengthen_lazy_sig$27_634:
   0x0000000001301778 <+0>:     lay     %r15,-80(%r15)
   0x000000000130177e <+6>:     stg     %r14,72(%r15)
   0x0000000001301784 <+12>:    lgr     %r7,%r5
   0x0000000001301788 <+16>:    clg     %r11,0(%r10)
   0x000000000130178e <+22>:    jgl     0x1301f8a <camlMtype.strengthen_lazy_sig$27_634+2066>
   0x0000000001301794 <+28>:    tmll    %r4,1
   0x0000000001301798 <+32>:    jge     0x13017ae <camlMtype.strengthen_lazy_sig$27_634+54>
   0x000000000130179e <+38>:    lg      %r14,72(%r15)
   0x00000000013017a4 <+44>:    lghi    %r2,1
   0x00000000013017a8 <+48>:    la      %r15,80(%r15)
   0x00000000013017ac <+52>:    br      %r14
   0x00000000013017ae <+54>:    lg      %r6,0(%r4)
   0x00000000013017b4 <+60>:    stg     %r2,48(%r15)
   0x00000000013017ba <+66>:    stg     %r3,56(%r15)
   0x00000000013017c0 <+72>:    stg     %r4,40(%r15)
=> 0x00000000013017c6 <+78>:    llgc    %r5,-1(%r6)
   0x00000000013017cc <+84>:    stg     %r6,32(%r15)
   0x00000000013017d2 <+90>:    stg     %r7,64(%r15)
   0x00000000013017d8 <+96>:    larl    %r0,0x155e750
   0x00000000013017de <+102>:   sllg    %r1,%r5,2
   0x00000000013017e4 <+108>:   agr     %r1,%r0
   0x00000000013017e8 <+112>:   lgf     %r1,0(%r1)
   0x00000000013017ee <+118>:   agr     %r1,%r0
   0x00000000013017f2 <+122>:   br      %r1
   0x00000000013017f4 <+124>:   lg      %r5,0(%r6)
   0x00000000013017fa <+130>:   lg      %r6,8(%r6)
...

@gasche
Copy link
Copy Markdown
Member

gasche commented Nov 14, 2022

Your workaround changes Stack_init_bsize, but this is only the initial size of the OCaml stack which should be extended as necessary -- see caml_call_realloc_stack in the runtime and how it is generated in function prologues by existing emitters. I haven't looked at your code, but could such stack-limit check be missing from the s390x backend or be adapted incorrectly?

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Yes, it could be the case.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Implemented stack reallocation and a bit of other stuff. The only missing thing I'm aware of is CFI. I'll try to implement it as well.

@xavierleroy
Copy link
Copy Markdown
Contributor

CFI is nice to have when debugging the OCaml runtime system itself, but we can live without it, so don't go out of your way to implement it. When you're happy with your code, could you please remove the "draft" status? Then we'll try to find reviewers.

@AlekseiNikiforovIBM AlekseiNikiforovIBM changed the title [draft] reintroduce native compiler for s390x reintroduce native compiler for s390x Dec 6, 2022
@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

I think everything is ready for review, although CFI could be further improved.

@kayceesrk kayceesrk self-assigned this Dec 7, 2022
@kayceesrk
Copy link
Copy Markdown
Contributor

Is there a good reference for the s390x memory model?

@xavierleroy
Copy link
Copy Markdown
Contributor

Is there a good reference for the s390x memory model?

It seems to be TSO with a twist, see https://check.cs.princeton.edu/papers/dlustig_thesis.pdf section 3.3.1.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Is there a way to dump/save all assembly code generated when building ocaml compiler and libraries? It could help to further improve CFI.

@bluddy
Copy link
Copy Markdown
Contributor

bluddy commented Dec 7, 2022

Just to understand, why do we want to add support for a seemingly long-discontinued machine?

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Dec 7, 2022

s390x is a historical name. The current name for this architecture is IBM Z and it's definitely not discontinued.

@xavierleroy
Copy link
Copy Markdown
Contributor

Is there a way to dump/save all assembly code generated when building ocaml compiler and libraries?

You can try to set the OCAMLPARAM environment variable:

OCAMLPARAM="s=1,_" make

This will leave .s files all around the place...

@xavierleroy
Copy link
Copy Markdown
Contributor

Did you run the testsuite recently? I'm getting a bunch of failed tests in native code (segmentation faults). This is on a LinuxONE VM running RHEL 8.5.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Yeah, I ran it, but I did it both on s390x and amd64. Both of them had similar amount of failures. Should all tests be fixed? Or should I take a look only at specific ones?

@gasche
Copy link
Copy Markdown
Member

gasche commented Dec 8, 2022

To my knowledge, there should be no test failure on trunk, so "0 failures" would be the target.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Rechecked it on current trunk, and got 0 failed on amd64 as well. I'll work on this pull request further to ensure tests are passing.

@xavierleroy
Copy link
Copy Markdown
Contributor

With the tip of your branch I'm seeing 154 failed tests. One of them seems to loop, so you should run the testsuite with a short timeout: TIMEOUT=60 make -C testsuite all.

@lthls lthls mentioned this pull request Dec 9, 2022
@lthls
Copy link
Copy Markdown
Contributor

lthls commented Dec 11, 2022

It seems that tests involving out-of-bounds array accesses are failing systematically. I haven't managed to get gdb running correctly yet so I don't know exactly what goes wrong. But you can trigger a bug with very small programs:

let () = try [||].(0) with _ -> ()

I get an Illegal instruction error in my qemu setup.

runtime/s390x.S Outdated
stg %r14, 0(%r15)
CFI_OFFSET(14, -168)
ENTER_FUNCTION
LEA_VAR(caml_array_bound_error_asm, %r2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LEA_VAR(caml_array_bound_error_asm, %r2)
LEA_VAR(caml_array_bound_error_asm, ADDITIONAL_ARG)

This should fix a few errors with array bounds checks

@lthls
Copy link
Copy Markdown
Contributor

lthls commented Dec 12, 2022

The next error I'm seeing is a segfault while scanning the stack during the minor collection triggered by domain termination. I suspect the stack is supposed to be empty in that case, but it's not detected as such and the frame descriptor lookup returns a null pointer that is then dereferenced.
I'm done debugging for today; I likely won't have time to investigate further in the next few days, but if someone wants to dig into this I might be able to provide a few more details.

@lthls
Copy link
Copy Markdown
Contributor

lthls commented Dec 16, 2022

I've created a branch with the fixes I've made here.
For now it fixes the issue with array bound errors and one issue with stacks getting wrong when returning from a callback (this fixes the issues on domain termination). There still remain other failures in the testsuite that I haven't investigated yet.
@AlekseiNikiforovIBM if you haven't already made the same patches you may want to cherry-pick my extra commit on your branch.

@lthls
Copy link
Copy Markdown
Contributor

lthls commented Jan 9, 2023

I've updated my branch (still here).
I now have only two failing tests:

  • tests/memory-model/forbidden.ml, which reports some forbidden behaviours (I can share the test report, or run it with more verbose flags).
  • tests/unboxed-primitive-args/test.ml, which fails with an assembler error (operand out of range). It's a big generated file, where the entry function jumps to a label at the end of the function and the relative offset of the jump is too big.

I will try to look at the second issue when I have more time, but I'll need help with the first one.

@AlekseiNikiforovIBM
Copy link
Copy Markdown
Contributor Author

Thanks, I'm also still working on fixing tests. Will take a look at your branch.

@xavierleroy
Copy link
Copy Markdown
Contributor

CLA received in good order. check-typo honored. Time for merging !

@xavierleroy xavierleroy merged commit 3fefff5 into ocaml:trunk May 23, 2023
@xavierleroy
Copy link
Copy Markdown
Contributor

CI shows that the "forbidden" and (to a lesser extent) "publish" tests take a lot of time on our 2-core test VM. Could this mean we need an s390x definition for cpu_relax in caml/platform.h ? The Linux kernel has a "diag 44" instruction for cpu_relax_yield but nothing for cpu_relax...

@xavierleroy
Copy link
Copy Markdown
Contributor

.. and smp_yield_cpu is either a "diag 9c" or "diag 44" instruction...

@xavierleroy
Copy link
Copy Markdown
Contributor

CI also fails on testsuite/tests/callback/callback_effects_gc.ml , but not in flambda mode, go figure...

@xavierleroy
Copy link
Copy Markdown
Contributor

Beginning of an explanation: this test is run with a very small minor heap (s=512 parameter). However, this parameter is also in force when the test is compiled using ocamlopt.opt ! It's ocamlopt.opt that segfaults, but not when compiled with flambda, perhaps because its allocation pattern is different.

When running OCAMLRUNPARAM=v=511,s=512 ./ocamlopt.opt -v, the segfault occurs shortly after "ref_table threshold crossed". s=1024 or ws=2048still crosses the threshold and crash. s=4096 does not.

@xavierleroy
Copy link
Copy Markdown
Contributor

I think I found it: in caml_alloc{1,2,3,N} there are signed comparisons that should be unsigned, otherwise comparison with the "stop-the-world" value for young_limit (0xFFFF...FFFF) gives the wrong result.

Octachron pushed a commit that referenced this pull request May 25, 2023
Update the s390x native-code generator and the s390x.S assembly glue for OCaml 5.

Co-authored-by: Vincent Laviron <vincent.laviron@gmail.com>
(cherry picked from commit 3fefff5)
@Octachron
Copy link
Copy Markdown
Member

Cherry-picked on 5.1 as 6a43309 with the companion fix #12258 (f1dd420).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants