Skip to content

Assorted fixes found while restarting the Jenkins CI#11037

Merged
xavierleroy merged 12 commits intoocaml:trunkfrom
xavierleroy:make-it-testable
Feb 21, 2022
Merged

Assorted fixes found while restarting the Jenkins CI#11037
xavierleroy merged 12 commits intoocaml:trunkfrom
xavierleroy:make-it-testable

Conversation

@xavierleroy
Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy commented Feb 20, 2022

This PR is best reviewed commit by commit. Here is a summary:

Runtime system

configure tweaks

  • 1ba1118: Report the MSVC, Cygwin, and Mingw-32 ports as not supported currently. (This way, the Jenkins Web interface can show them as "unstable" in tasteful yellow rather than as "failed" in stress-inducing red.) (Edit: Mingw-32 almost works, in bytecode-only mode, so let's keep it supported for the moment.)
  • b7a12db: GCC < 4.9 is not supported by lack of C11 atomics.
  • bf1593f: Revised determination of PACKLD_FLAGS for bytecode-only builds. (Affects the Power ports.)

Jenkins scripts

  • 1a89bab: Update the Jenkins CI "main" script to handle bytecode-only builds better

Fixing the test suite

  • 1a09e47: Make tests/callback/test3.ml work in 32 bits too
  • c9e846c: Scale down some tests in tests/parallel
  • 69a0a8c: tests/callback: fix interleaving of C and OCaml outputs

@kayceesrk
Copy link
Copy Markdown
Contributor

Reg GCC < 4.9 not being supported, this comment

/* Support for versions of gcc which have built-in atomics but do not
expose stdatomic.h (e.g. gcc 4.8) */

refers to GCC 4.8. Should this support also be removed?

Copy link
Copy Markdown
Contributor

@kayceesrk kayceesrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Reg

Scale down some tests in tests/parallel

@ctk21 has in the past argued that we keep the long-running tests as they bring out hard to trigger bugs. Perhaps he may have something to add here.

@Octachron
Copy link
Copy Markdown
Member

We probably need at some point a collection of time-consuming tests that are run with a reduced and controlled frequency.

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Feb 21, 2022

Scale down some tests in tests/parallel

@ctk21 has in the past argued that we keep the long-running tests as they bring out hard to trigger bugs. Perhaps he may have something to add here.

The motivation for the parallel join and domain_parallel_spawn_burn tests is to cause overload and ensure things still complete (in reasonable time). Our experience has been that they are very effective for catching concurrency bugs involving domain lifecycle and/or minor/major GC interactions. They also stress how the runtime spin wait sections and the OS scheduler interact.

There is a history of controversy for these tests - usually around time waiting for results and the fact they don't adapt to the machine they run on. However I think that we need to torture some machine(s) somewhere as part of the CI process (preferably within the PR process). The tests are crude and better ways to torture test a smaller number of machines may exist.

I agree with the scale down to run on 32bit hardware because of the 16 domain limit. In c9e846c the number of iterations and allocation sizes are changed, is there a strong motivation for those changes?

My concern is that we are sweeping bugs under the carpet (and weakening the test protection against future bugs). I feel that these test running longer than 10mins (with current parameters) is surfacing a bug (e.g. #10875 (comment)) that warrants investigation (although that investigation might not lie in the scope of this PR).

@xavierleroy
Copy link
Copy Markdown
Contributor Author

xavierleroy commented Feb 21, 2022

Don't panic: I'm looking at ways to run the parallel tests with more demanding parameters on selected machines. Most Jenkins CI machines have only two cores and 2 Gb of RAM, so the big tests cause timeouts, especially when run in bytecode mode. A test that times out is useless.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

So, here is my proposal: the tests can get an idea of how many cores and how much memory they can use by looking up the OCAML_TEST_SIZE environment variable.

  • Commit adacfee adds documentation on the meaning of the variable
  • Commit ee27a9f changes the tests in tests/parallel to use the original parameters (not reduced) if OCAML_TEST_SIZE allows. (The reduced parameters are still the default.)
  • Commit 30e1669, done with @maranget, changes the tests in tests/memory-model to choose the number of cores used taking OCAML_TEST_SIZE into account. (The default is still 2, but 4 and 8 can be selected if possible.)

@xavierleroy
Copy link
Copy Markdown
Contributor Author

xavierleroy commented Feb 21, 2022

By the way, here is the current status of Jenkins CI:

  • Under Unix (Linux, macOS, BSD, OmniOS):
    • OpenBSD 6.7 and CentOS 7 not supported because GCC is too old
    • Other x86_64 ports work fine, in native code and in bytecode
    • Other architectures (i386, ARM, ARM64, Power, Z) work fine, in bytecode only.
  • Under Windows:
    • Mingw-64 almost working in native code and in bytecode (some tests randomly failing, some compiler crashes)
    • Mingw-32 almost working in bytecode only (likewise)
    • Cygwin and MSVC ports not supported yet.

@avsm
Copy link
Copy Markdown
Member

avsm commented Feb 21, 2022

OpenBSD and CentOS 7 not supported because GCC is too old

What version of OpenBSD/amd64 is installed? The default cc has been clang for some time now:

$ cc -v
OpenBSD clang version 13.0.0
Target: amd64-unknown-openbsd7.0
Thread model: posix
InstalledDir: /usr/bin
$ uname -a
OpenBSD duck.recoil.org 7.0 GENERIC.MP#276 amd64

@xavierleroy
Copy link
Copy Markdown
Contributor Author

I should have written "the OpenBSD virtual machine we have in our CI", which is quite old (OpenBSD 6.7...). We can try a fresh install of OpenBSD 7.0 some time in the future.

@xavierleroy
Copy link
Copy Markdown
Contributor Author

Updating the table of results now that ARM64 native-code generation is back:

  • ARM64/Linux: works fine in native-code and in bytecode
  • ARM64/macOS: the "unwind" test fails, probably by lack of CFI annotations in arm64.S, but otherwise it works fine in native-code and in bytecode.

…ctures

(This is a temporary workaround until ocaml#10971 is resolved.)

On 32-bit architectures, the minor heap reservation has to fit a 32-bit
address space.

On some 64-bit systems (e.g. OpenBSD, Omnios), there is no over-commit,
so the reservation has to fit the RAM + swap space.

Take Minor_heap_max to be 1 Mwords (4 times Minor_heap_def) and limit
domains to 16 on 32-bit architectures, instead of 128 on 64-bit
architectures.
Because of output-complete-obj, PACKLD_FLAGS must be correctly set
even for a bytecode-only build, where `$arch = none`.

This affects only the 3 Power ports.
…etter

- Don't do "make alldepend" if native-code compiler is unsupported

- No need to test flambda in bytecode-only builds
We're getting timeouts on some CI machines, and on 32-bit architectures
only 16 domains are available.

- Reduce number of domains used by join.ml and domain_parallel_spawn_burn.ml.
- Reduce data size and duration for domain_parallel_spawn_burn.ml
  and domain_serial_spawn_burn.ml.
The C side was not flushing explicitly, causing different interleavings
with Glibc and with Musl.
So that they can run longer and use more cores on selected CI machines.
@xavierleroy xavierleroy merged commit 2faed92 into ocaml:trunk Feb 21, 2022
@xavierleroy xavierleroy deleted the make-it-testable branch February 21, 2022 18:15
@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 28, 2022

Hi @ctk21 and @xavierleroy; I'm looking at domain_parallel_spawn_burn.ml and I have two naive questions:

  1. There seems to be an obvious race on the running variable, which is a non-atomic reference written in one domain and read in a loop in three others. Is there a reason for not making this variabl atomic?

  2. On my machine, the test is noticeably faster in bytecode (4s real time, 11s user time) than in native code (18s real time, 50s user time). Is it just me? Do we understand why that is the case?

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Mar 1, 2022

  1. Not that I can think of. Do we even need the running variable outside a DEBUG build for the asserts?

  2. the test has highly variable runtime across: runs on the same machine, operating systems, machines of differing sizes, bytecode and native. It has always bugged me and it is always on the "must try to understand this better" list.

@gasche
Copy link
Copy Markdown
Member

gasche commented Mar 1, 2022

Do we even need the running variable outside a DEBUG build for the asserts?

(I think there is a misunderstand and you are talking about the s->running variable of interruptors, while I'm talking about a running : bool ref reference in the test code. Sorry for being unclear in the first place.)

The code is currently as follows:

let test_parallel_spawn () =
for i = 1 to 20 do
Array.init num_domains (fun _ -> Domain.spawn (fun () -> burn [0]))
|> Array.iter join
done
let () =
let running = ref true in
let rec run_until_stop fn () =
while !running do
fn ();
done
in
let domain_minor_gc = Domain.spawn (run_until_stop (fun () -> burn [8]; Gc.minor ())) in
let domain_major_gc = Domain.spawn (run_until_stop (fun () -> burn [8]; Gc.major ())) in
test_parallel_spawn ();
running := false;
join domain_minor_gc;
join domain_major_gc;

If I understand it, the idea is to run Gc.major and Gc.minor in an infinite loop, while other domains are doing a fixed amount of work, and then we stop the Gc.major and Gc.minor domains when the work is done. So we do need running, but currently it is used in a very racy way.

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented Mar 1, 2022

Oh sorry about the context mixup, running in the test can become Atomic.get/Atomic.set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants