Assorted fixes found while restarting the Jenkins CI by xavierleroy · Pull Request #11037 · ocaml/ocaml

xavierleroy · 2022-02-20T16:32:59Z

This PR is best reviewed commit by commit. Here is a summary:

Runtime system

c28aa81: Select smaller max minor heap size and max domains for 32-bit architectures. (This is a temporary workaround until Means of limiting how much memory is being reserved by the runtime, so that Valgrind and AFL can be used #10971 is resolved.)
3cea3ab: Wrong printf format for "New minor heap size" message. (Breaks on 32-bit platforms.)

configure tweaks

1ba1118: Report the MSVC, Cygwin, ~~and Mingw-32~~ ports as not supported currently. (This way, the Jenkins Web interface can show them as "unstable" in tasteful yellow rather than as "failed" in stress-inducing red.) (Edit: Mingw-32 almost works, in bytecode-only mode, so let's keep it supported for the moment.)
b7a12db: GCC < 4.9 is not supported by lack of C11 atomics.
bf1593f: Revised determination of PACKLD_FLAGS for bytecode-only builds. (Affects the Power ports.)

Jenkins scripts

1a89bab: Update the Jenkins CI "main" script to handle bytecode-only builds better

Fixing the test suite

1a09e47: Make tests/callback/test3.ml work in 32 bits too
c9e846c: Scale down some tests in tests/parallel
69a0a8c: tests/callback: fix interleaving of C and OCaml outputs

kayceesrk · 2022-02-21T02:58:54Z

Reg GCC < 4.9 not being supported, this comment

ocaml/runtime/caml/camlatomic.h

Lines 50 to 51 in 524a567

    
           /* Support for versions of gcc which have built-in atomics but do not 
        
              expose stdatomic.h (e.g. gcc 4.8) */

refers to GCC 4.8. Should this support also be removed?

kayceesrk

The changes look good to me. Reg

Scale down some tests in tests/parallel

@ctk21 has in the past argued that we keep the long-running tests as they bring out hard to trigger bugs. Perhaps he may have something to add here.

Octachron · 2022-02-21T08:08:04Z

We probably need at some point a collection of time-consuming tests that are run with a reduced and controlled frequency.

ctk21 · 2022-02-21T09:28:10Z

Scale down some tests in tests/parallel

@ctk21 has in the past argued that we keep the long-running tests as they bring out hard to trigger bugs. Perhaps he may have something to add here.

The motivation for the parallel join and domain_parallel_spawn_burn tests is to cause overload and ensure things still complete (in reasonable time). Our experience has been that they are very effective for catching concurrency bugs involving domain lifecycle and/or minor/major GC interactions. They also stress how the runtime spin wait sections and the OS scheduler interact.

There is a history of controversy for these tests - usually around time waiting for results and the fact they don't adapt to the machine they run on. However I think that we need to torture some machine(s) somewhere as part of the CI process (preferably within the PR process). The tests are crude and better ways to torture test a smaller number of machines may exist.

I agree with the scale down to run on 32bit hardware because of the 16 domain limit. In c9e846c the number of iterations and allocation sizes are changed, is there a strong motivation for those changes?

My concern is that we are sweeping bugs under the carpet (and weakening the test protection against future bugs). I feel that these test running longer than 10mins (with current parameters) is surfacing a bug (e.g. #10875 (comment)) that warrants investigation (although that investigation might not lie in the scope of this PR).

xavierleroy · 2022-02-21T10:02:36Z

Don't panic: I'm looking at ways to run the parallel tests with more demanding parameters on selected machines. Most Jenkins CI machines have only two cores and 2 Gb of RAM, so the big tests cause timeouts, especially when run in bytecode mode. A test that times out is useless.

xavierleroy · 2022-02-21T15:13:33Z

So, here is my proposal: the tests can get an idea of how many cores and how much memory they can use by looking up the OCAML_TEST_SIZE environment variable.

Commit adacfee adds documentation on the meaning of the variable
Commit ee27a9f changes the tests in tests/parallel to use the original parameters (not reduced) if OCAML_TEST_SIZE allows. (The reduced parameters are still the default.)
Commit 30e1669, done with @maranget, changes the tests in tests/memory-model to choose the number of cores used taking OCAML_TEST_SIZE into account. (The default is still 2, but 4 and 8 can be selected if possible.)

xavierleroy · 2022-02-21T15:20:46Z

By the way, here is the current status of Jenkins CI:

Under Unix (Linux, macOS, BSD, OmniOS):
- OpenBSD 6.7 and CentOS 7 not supported because GCC is too old
- Other x86_64 ports work fine, in native code and in bytecode
- Other architectures (i386, ARM, ARM64, Power, Z) work fine, in bytecode only.
Under Windows:
- Mingw-64 almost working in native code and in bytecode (some tests randomly failing, some compiler crashes)
- Mingw-32 almost working in bytecode only (likewise)
- Cygwin and MSVC ports not supported yet.

avsm · 2022-02-21T15:25:56Z

OpenBSD and CentOS 7 not supported because GCC is too old

What version of OpenBSD/amd64 is installed? The default cc has been clang for some time now:

$ cc -v
OpenBSD clang version 13.0.0
Target: amd64-unknown-openbsd7.0
Thread model: posix
InstalledDir: /usr/bin
$ uname -a
OpenBSD duck.recoil.org 7.0 GENERIC.MP#276 amd64

xavierleroy · 2022-02-21T15:35:10Z

I should have written "the OpenBSD virtual machine we have in our CI", which is quite old (OpenBSD 6.7...). We can try a fresh install of OpenBSD 7.0 some time in the future.

xavierleroy · 2022-02-21T16:23:40Z

Updating the table of results now that ARM64 native-code generation is back:

ARM64/Linux: works fine in native-code and in bytecode
ARM64/macOS: the "unwind" test fails, probably by lack of CFI annotations in arm64.S, but otherwise it works fine in native-code and in bytecode.

…ctures (This is a temporary workaround until ocaml#10971 is resolved.) On 32-bit architectures, the minor heap reservation has to fit a 32-bit address space. On some 64-bit systems (e.g. OpenBSD, Omnios), there is no over-commit, so the reservation has to fit the RAM + swap space. Take Minor_heap_max to be 1 Mwords (4 times Minor_heap_def) and limit domains to 16 on 32-bit architectures, instead of 128 on 64-bit architectures.

Because of output-complete-obj, PACKLD_FLAGS must be correctly set even for a bytecode-only build, where `$arch = none`. This affects only the 3 Power ports.

…etter - Don't do "make alldepend" if native-code compiler is unsupported - No need to test flambda in bytecode-only builds

We're getting timeouts on some CI machines, and on 32-bit architectures only 16 domains are available. - Reduce number of domains used by join.ml and domain_parallel_spawn_burn.ml. - Reduce data size and duration for domain_parallel_spawn_burn.ml and domain_serial_spawn_burn.ml.

The C side was not flushing explicitly, causing different interleavings with Glibc and with Musl.

So that they can run longer and use more cores on selected CI machines.

…f cores

gasche · 2022-02-28T22:49:40Z

Hi @ctk21 and @xavierleroy; I'm looking at domain_parallel_spawn_burn.ml and I have two naive questions:

There seems to be an obvious race on the running variable, which is a non-atomic reference written in one domain and read in a loop in three others. Is there a reason for not making this variabl atomic?
On my machine, the test is noticeably faster in bytecode (4s real time, 11s user time) than in native code (18s real time, 50s user time). Is it just me? Do we understand why that is the case?

ctk21 · 2022-03-01T10:05:06Z

Not that I can think of. Do we even need the running variable outside a DEBUG build for the asserts?
the test has highly variable runtime across: runs on the same machine, operating systems, machines of differing sizes, bytecode and native. It has always bugged me and it is always on the "must try to understand this better" list.

gasche · 2022-03-01T10:26:50Z

Do we even need the running variable outside a DEBUG build for the asserts?

(I think there is a misunderstand and you are talking about the s->running variable of interruptors, while I'm talking about a running : bool ref reference in the test code. Sorry for being unclear in the first place.)

The code is currently as follows:

ocaml/testsuite/tests/parallel/domain_parallel_spawn_burn.ml

Lines 25 to 46 in 04ddddd

    
           let test_parallel_spawn () = 
        
             for i = 1 to 20 do 
        
               Array.init num_domains (fun _ -> Domain.spawn (fun () -> burn [0])) 
        
               |> Array.iter join 
        
             done 
        
           let () = 
        
             let running = ref true in 
        
             let rec run_until_stop fn () = 
        
               while !running do 
        
                 fn (); 
        
               done 
        
             in 
        
             let domain_minor_gc = Domain.spawn (run_until_stop (fun () -> burn [8]; Gc.minor ())) in 
        
             let domain_major_gc = Domain.spawn (run_until_stop (fun () -> burn [8]; Gc.major ())) in 
        
             test_parallel_spawn (); 
        
             running := false; 
        
             join domain_minor_gc; 
        
             join domain_major_gc;

If I understand it, the idea is to run Gc.major and Gc.minor in an infinite loop, while other domains are doing a fixed amount of work, and then we stop the Gc.major and Gc.minor domains when the work is done. So we do need running, but currently it is used in a very racy way.

ctk21 · 2022-03-01T10:37:49Z

Oh sorry about the context mixup, running in the test can become Atomic.get/Atomic.set.

kayceesrk approved these changes Feb 21, 2022

View reviewed changes

xavierleroy added 12 commits February 21, 2022 17:28

Wrong printf format for "New minor heap size" message

cfcbf49

The MSVC and Cygwin ports are not supported currently

457df3d

GCC < 4.9 is not supported by lack of C11 atomics

fabc371

Revised determination of PACKLD_FLAGS

6ef8728

Because of output-complete-obj, PACKLD_FLAGS must be correctly set even for a bytecode-only build, where `$arch = none`. This affects only the 3 Power ports.

Update the Jenkins CI "main" script to handle bytecode-only configs b…

3b80ed4

…etter - Don't do "make alldepend" if native-code compiler is unsupported - No need to test flambda in bytecode-only builds

Make tests/callback/test3.ml work in 32 bits too

d47b1c7

tests/callback: fix interleaving of C and OCaml outputs

866e928

The C side was not flushing explicitly, causing different interleavings with Glibc and with Musl.

Document the OCAML_TEST_SIZE environment variable

574d55f

tests/parallel: parameterize some tests by the OCAML_TEST_SIZE env var

fb23835

So that they can run longer and use more cores on selected CI machines.

tests/memory-model: use OCAML_TEST_SIZE to scale the default number o…

7552691

…f cores

xavierleroy force-pushed the make-it-testable branch from 30e1669 to 7552691 Compare February 21, 2022 16:29

xavierleroy merged commit 2faed92 into ocaml:trunk Feb 21, 2022

xavierleroy mentioned this pull request Feb 21, 2022

Meta-issue for OCaml 5.0 release goals #11013

Closed

16 tasks

xavierleroy deleted the make-it-testable branch February 21, 2022 18:15

xavierleroy mentioned this pull request Feb 24, 2022

configure: C11 atomic support required for 5.00.0 #10940

Closed

3 tasks

jmid mentioned this pull request Mar 16, 2022

weaklifetime_par.ml gets killed by OS's OOM-reaper on Raspberry Pi 4 #11121

Closed

Conversation

xavierleroy commented Feb 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runtime system

configure tweaks

Jenkins scripts

Fixing the test suite

Uh oh!

kayceesrk commented Feb 21, 2022

Uh oh!

kayceesrk left a comment

Choose a reason for hiding this comment

Uh oh!

Octachron commented Feb 21, 2022

Uh oh!

ctk21 commented Feb 21, 2022

Uh oh!

xavierleroy commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Feb 21, 2022

Uh oh!

xavierleroy commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avsm commented Feb 21, 2022

Uh oh!

xavierleroy commented Feb 21, 2022

Uh oh!

xavierleroy commented Feb 21, 2022

Uh oh!

gasche commented Feb 28, 2022

Uh oh!

ctk21 commented Mar 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gasche commented Mar 1, 2022

Uh oh!

ctk21 commented Mar 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xavierleroy commented Feb 20, 2022 •

edited

Loading

xavierleroy commented Feb 21, 2022 •

edited

Loading

xavierleroy commented Feb 21, 2022 •

edited

Loading

ctk21 commented Mar 1, 2022 •

edited

Loading