Ring-buffer based runtime tracing (eventring) by sadiqj · Pull Request #10964 · ocaml/ocaml

sadiqj · 2022-01-28T16:16:36Z

Summary

Eventring is a runtime tracing system designed for continuous monitoring of OCaml applications. It intends to replace the existing file-logging based eventlog in the runtime. It is very low overhead - negligible when not in use and ~0.5% when enabled. It utilises per-domain memory-mapped ring buffers that enable consumption of tracing events both in and out of process. At present only runtime events (the same set as eventlog) are added to the buffers but the design is intended to also eventually support user events as well as integrate with statmemprof.

Design

Context

The OCaml runtime currently provides an event logging system "eventlog" which can emit a range of runtime and GC events in the Common Trace Format (CTF) to a log file. It is currently only enabled in the instrumented runtime. The file-based logging also makes using eventlog for long running applications very difficult.

Goal

The goal of this PR is to replace eventlog's implementation with a minimal implementation of a more flexible transport for runtime and GC events. It intends to enable consuming of these events from OCaml code both inside and outside the runtime being monitored.

This PR only intends to deal with runtime and GC events, not those originating from user code. There is follow-up work on porting eventlog-tools, documentation and making the Windows support more robust (and potentially more native). We hold off on those if there are design changes required at this stage and also because the PR is already fairly large.

Design

Probes

This PR retains the existing probes on trunk but moves most in to the default runtime. A few of the probes on hot paths (such as free list allocation) are left in the instrumentated runtime. The dividing line is essentially where the work performed during a particular event is very high compared to the (low) cost of emitting an event.

Transport

The on-disk log file in eventlog is replaced with per-domain memory-mapped ring buffers.

The ring buffer is structured as a flight recorder, overwriting old data when is insufficient space to write new events. This enables users to potentially only read the ring when some anomalous event occurs or a performance target is missed. No coordination is needed with consumers who read the events - they detect races with the producer and discard events when that happens.

The producer code is contained here . By default an .eventring file is created in the current directory (overridable by setting OCAML_EVENTRING_DIR). This file contains a ring buffer for each possible domain (Max_domains). It is laid out in a structure that enables sparsity. On-disk (or in-memory) footprint is proportional to the max number of concurrent domains the process has ever run.

On-disk structure

----------------------------------------------------------------
| File header (version, offsets, etc..)                        |
----------------------------------------------------------------
| Ring 0..Max_domains metadata                                 |
| (head and tail indexes, one per cache line)                  |
----------------------------------------------------------------
| Ring 0..Max_domains data                                     |
| (actual ring data, default 2^16 words = 512k bytes)          |
----------------------------------------------------------------

API

The consumer APIs are cursor-based and polled. Users create a cursor with a path and pid, and if successfully created can then poll the cursor with a set of callbacks to call on each event they are interested in.

The OCaml API is available in Eventring.mli, the C API is available in eventring.h.

Consumer code is contained in eventring_consumer.c. This contains both the implementations for the OCaml API and C API (the former being a thin layer over the latter).

Examples

Emitting minor and major slice GC pause times to stdout

Here's a simple consumer that prints the times of minor and major slices. It omits any support for multiple domains, error handling or lost events:

let minor_started = ref Int64.zero

let runtime_begin _ ts phase =
    match phase with
    | Eventring.EV_MAJOR_SLICE ->
        begin
          major_started := Eventring.Timestamp.to_int64 ts
        end
    | Eventring.EV_MINOR ->
        begin
          minor_started := Eventring.Timestamp.to_int64 ts
        end
    | _ -> ()


let ts_to_duration_ns ts0 ts1 =
  Int64.sub (Eventring.Timestamp.to_int64 ts0) ts1

let runtime_end _ ts phase =
    match phase with
    | Eventring.EV_MAJOR_SLICE ->
        Printf.printf "Major slice: %Ldns\n"
          (ts_to_duration_ns ts !major_started)
    | Eventring.EV_MINOR ->
        Printf.printf "Minor: %Ldns\n"
          (ts_to_duration_ns ts !minor_started)
    | _ -> ()

let () =
    let list_ref = ref [] in
    let cursor = Eventring.create_cursor None in
    let callbacks = Eventring.Callbacks.create ~runtime_begin ~runtime_end ()
    in
    while true do
        for _ = 1 to 100 do
            list_ref := [];
            for _ = 1 to 10 do
                list_ref := (Sys.opaque_identity(ref 42)) :: !list_ref
            done;
            Gc.full_major ();
        done;
        ignore(Eventring.read_poll cursor callbacks None);
        Unix.sleep 1
    done

When compiled and run with this PR's branch you will get an output like:

Minor: 1496ns
Minor: 1341ns
Minor: 1308ns
Major slice: 17605ns
Minor: 1435ns
Minor: 2986ns
Minor: 2688ns
Major slice: 26577ns
Minor: 2263ns
Minor: 1098ns
Minor: 973ns
Major slice: 17719ns
...

Code such as the above could be spawned in a separate domain and send metrics data to an external service (such as graphite or prometheus) over a socket.

Chrome tracing

This is a simplified example of how Eventring can be used to log out runtime data in a format that can be simply ingested by Chrome's trace viewer. Running the above project with OCAML_EVENTRING_START=1 _build/default/src/traceevents_lib.exe and then loading the resultant json file in to Chrome's tracing viewer will give something like the following:

Tests

The tests/eventring directory contains some other exampel of using the different APIs.

Benchmarks

Timing

For all macro benchmarks on sandmark that run on trunk, the performance impact when normalised to trunk (f22a84b) averaged over five runs is:

Variant	Geomean
Eventring disabled	0.998507
Eventring enabled	0.997360

<1 indicates eventring's branches were marginally faster than trunk.

The graph below shows the normalised (to trunk) runtime for long-running (>4s) macro benchmarks:

All macro benchmarks:

You can find the raw benchmarking data here: https://github.com/sadiqj/sandmark_benchmark_data/tree/main/eventring_pr

Instructions

Below is a run of sequential sandmark using perf stat to count the number of retired instructions:

Variant	Geomean
Eventring disabled	1.000133
Eventring enabled	1.000913

Here eventring resulted an 0.01% increase in instructions when not enabled and 0.09% when enabled.

For each benchmark below there are two rows, this PR without eventring enabled and this PR with it enabled. Instructions increase is the percentage increase in instructions for this PR over trunk e.g for the first row LU_decomposition this PR leads to an increase of 7 thousandths of a percent (2.25M extra instructions on the 32.28B trunk does).

name	instructions	variant	ninstructions
LU_decomposition 1024	32.284B	disabled	0.007%
LU_decomposition 1024	32.338B	enabled	0.175%
alt-ergo benchmarkalt-ergfill.why	10.726B	disabled	0.004%
alt-ergo benchmarkalt-ergfill.why	10.729B	enabled	0.035%
alt-ergo benchmarkalt-ergyyll.why	80.394B	disabled	0.005%
alt-ergo benchmarkalt-ergyyll.why	80.449B	enabled	0.074%
bdd 26	23.379B	disabled	0.005%
bdd 26	23.410B	enabled	0.139%
binarytrees5 21	64.412B	disabled	0.005%
binarytrees5 21	64.438B	enabled	0.045%
chameneos_redux_lwt 600000	8.287B	disabled	0.007%
chameneos_redux_lwt 600000	8.302B	enabled	0.196%
crout_decomposition	10.837B	disabled	0.004%
crout_decomposition	10.851B	enabled	0.139%
cubicle benchmarkcubiclgerman_pfs.cub	1011.065B	disabled	0.009%
cubicle benchmarkcubiclgerman_pfs.cub	1011.628B	enabled	0.064%
cubicle benchmarkcubiclszymanski_at.cub	2722.539B	disabled	0.006%
cubicle benchmarkcubiclszymanski_at.cub	2725.030B	enabled	0.098%
durand_kerner_aberth	1.054B	disabled	0.008%
durand_kerner_aberth	1.057B	enabled	0.268%
evolutionary_algorithm 10000 10000	253.398B	disabled	0.002%
evolutionary_algorithm 10000 10000	253.406B	enabled	0.005%
fannkuchredux 12	587.338B	disabled	-0.001%
fannkuchredux 12	587.338B	enabled	-0.000%
fannkuchredux2 12	590.380B	disabled	0.001%
fannkuchredux2 12	590.380B	enabled	0.001%
fasta3 25_000_000	21.914B	disabled	0.002%
fasta3 25_000_000	21.926B	enabled	0.056%
fasta6 25_000_000	17.562B	disabled	-0.000%
fasta6 25_000_000	17.562B	enabled	-0.001%
fft	19.972B	disabled	0.005%
fft	19.982B	enabled	0.059%
floyd_warshall 512	6.032B	disabled	0.002%
floyd_warshall 512	6.031B	enabled	-0.008%
game_of_life 256	64.563B	disabled	0.000%
game_of_life 256	64.563B	enabled	0.001%
grammatrix	447.579B	disabled	-0.000%
grammatrix	447.581B	enabled	0.000%
kb	16.056B	disabled	0.003%
kb	16.071B	enabled	0.097%
kb_no_exc	14.357B	disabled	0.270%
kb_no_exc	14.364B	enabled	0.318%
knucleotide	290.074B	disabled	0.000%
knucleotide	290.075B	enabled	0.001%
knucleotide3	304.365B	disabled	0.000%
knucleotide3	304.365B	enabled	0.000%
levinson_durbin	24.481B	disabled	0.008%
levinson_durbin	24.516B	enabled	0.150%
main	64.580B	disabled	0.002%
main	64.623B	enabled	0.069%
mandelbrot6 16_000	185.860B	disabled	-0.000%
mandelbrot6 16_000	185.860B	enabled	0.000%
matrix_multiplication 1024	32.692B	disabled	-0.000%
matrix_multiplication 1024	32.692B	enabled	0.001%
menhir -v --table benchmarkmenhisysver.mly	423.416B	disabled	0.007%
menhir -v --table benchmarkmenhisysver.mly	423.641B	enabled	0.060%
menhir -v -t benchmarkmenhikeywords.mly benchm...	37.790B	disabled	0.008%
menhir -v -t benchmarkmenhikeywords.mly benchm...	37.820B	enabled	0.087%
menhir benchmarkmenhiocaml.mly --list-errors -...	1154.743B	disabled	0.000%
menhir benchmarkmenhiocaml.mly --list-errors -...	1154.808B	enabled	0.006%
minilight-ocaml benchmarkminilighroomfront.ml.txt	80.788B	disabled	0.015%
minilight-ocaml benchmarkminilighroomfront.ml.txt	81.023B	enabled	0.307%
naive_multilayer	33.856B	disabled	0.117%
naive_multilayer	33.911B	enabled	0.281%
nbody 50_000_000	40.758B	disabled	-0.000%
nbody 50_000_000	40.759B	enabled	0.001%
pidigits5 10_000	11.882B	disabled	0.044%
pidigits5 10_000	11.890B	enabled	0.116%
qr_decomposition	19.412B	disabled	0.007%
qr_decomposition	19.455B	enabled	0.228%
quicksort 4000000	17.125B	disabled	-0.000%
quicksort 4000000	17.125B	enabled	0.001%
regexredux2	93.000B	disabled	-0.002%
regexredux2	93.003B	enabled	0.002%
revcomp2	20.703B	disabled	0.002%
revcomp2	20.705B	enabled	0.011%
sequence_cps 10000	14.535B	disabled	0.011%
sequence_cps 10000	14.576B	enabled	0.294%
setrip -enc -rseed 1067894368	2.454B	disabled	0.002%
setrip -enc -rseed 1067894368	2.456B	enabled	0.075%
spectralnorm2 5_500	59.332B	disabled	0.003%
spectralnorm2 5_500	59.393B	enabled	0.106%
test_decompress 64 524_288	28.016B	disabled	-0.013%
test_decompress 64 524_288	28.028B	enabled	0.029%
test_lwt 200	136.513B	disabled	0.001%
test_lwt 200	136.537B	enabled	0.019%
thread_ring_lwt_mvar 20_000	20.404B	disabled	0.007%
thread_ring_lwt_mvar 20_000	20.434B	enabled	0.154%
thread_ring_lwt_stream 20_000	63.936B	disabled	0.007%
thread_ring_lwt_stream 20_000	64.001B	enabled	0.108%
ydump -c sample.json	4.046B	disabled	-0.011%
ydump -c sample.json	4.047B	enabled	0.017%
zarith_pi 5000	3.006B	disabled	0.063%
zarith_pi 5000	3.017B	enabled	0.413%

ygrek · 2022-01-28T20:07:21Z

I am not sure using geomean is a good ground for performance impact argument. Looking at the chart some test cases became faster with eventring added, how come? And redux has 10% impact, is that right?

yminsky · 2022-01-29T16:16:58Z

Love to see more work in the tracing space!

I wonder if this has any relation to the tracing library we released recently: https://github.com/janestreet/tracing

Or to our work on leveraging Intel's processor trace: https://blog.janestreet.com/magic-trace/

Or to the work we've done adding probes to OCaml. Not sure if this is the best link for that: https://github.com/gretay-js/probes

sadiqj · 2022-01-29T18:42:46Z

I am not sure using geomean is a good ground for performance impact argument. Looking at the chart some test cases became faster with eventring added, how come? And redux has 10% impact, is that right?

I should have been more specific above the table. The numbers are the average of the normalised times against trunk (i.e normalised before averaging rather than after). Geometric mean is the right thing to do in this case: http://www.cse.unsw.edu.au/~cs9242/11/papers/Fleming_Wallace_86.pdf

Yes, some of the benchmarks show a speedup. This PR only adds instructions so you would think it would always make things slower but there's a couple of sources of noise.

First is just the machine itself. While the benchmarking machine has steps to reduce variability (https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking) there is still some. I have a batch run of a number of iterations of the macro benchmarking suite going to quantify what the run-to-run variability is.

Second source is actually just small changes can have knock-on effects. e.g adding a probe could change the alignment of a later loop an lead to a significant slowdown. http://eecs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf has some depressing examples. Also @ctk21 has good notes on his investigations https://github.com/ocaml-bench/notes/blob/master/apr19.md#what-we-learnt .

After this large batch run finishes I'll also kick off a run that only uses instruction counts.

There's also benchmarks on parallel programs that need to be done. I'm waiting on this in-case we need to revisit the design or implementation significantly.

sadiqj · 2022-01-29T19:32:00Z

Love to see more work in the tracing space!

Thanks for taking a look.

The aim for this PR is to only provide a facility for consuming events generated by the runtime's subsystems (mostly the GC but also some lifecycle events). This is intentional, I'm aware there's already some good observability tooling. Tracing, probes and memprof/memtrace being able to make use of eventring would be really powerful and I've already had a couple of discussions with JS folk on what some of that might look like. Eio also has an eye on replacing their tracing implementation with something atop of eventring.

It would be good to identify if there's anything in this design that precludes this from happening. Interfaces for plugging in a tracing or a probe library would be follow-up work requiring some community coordination.

Alternatively we might find we can't cover enough common usecases and eventring only functions as a way of transporting runtime events somewhere else. That's still better than the situation at the moment but it would feel like a missed opportunity.

yminsky · 2022-01-29T21:35:02Z

Yeah, maybe there's something to share here. I think @gretay-js is probably the right person to connect to here.

sadiqj · 2022-01-31T14:01:58Z

I've updated the PR description with a sandmark run for retired instructions.

With the exception of native_multilayer, this PR adds less than a quarter of a percent to retired instructions when eventring is compiled in but not enabled.

naive_multilayer sees nearly a 1% increase in instructions retired when eventring is disabled and 22% when it is enabled. It's allocating a fair amount (~5.5GB/s and 2.9k minor gcs/s) but I think there might be something else going on. Will investigate.

Results for run-to-run variation in timings are just waiting on me to figure out how to graph them.

sadiqj · 2022-01-31T16:27:40Z

I found the issue with naive_multilayer. It very frequently calls caml_make_vect which ultimately triggers a handle_gc_interrupt on each invocation and this causes an EV_INTERRUPT_GC to be emitted - even if there's nothing to do. Usually we only end up on this code path when there is something non-trivial to do (and so the relative cost of checking whether eventring is enabled and emitting the event is very low).

I was considering moving this event to the instrumented runtime but I'm actually not convinced it's that useful even there so will remove it and kick off another benchmarking run.

sadiqj · 2022-02-02T13:58:02Z

I've rerun and updated the benchmarking results after removing EV_INTERRUPT_GC.

gasche · 2022-02-02T15:27:24Z

I looked at the benchmarks and I'm a bit confused (as often).

The "retired instructions" measures look like we expect: at most there is 0.4% increase in instruction count when eventring is enabled, nothing to worry about, and when it is disabled it's mostly lost in the noise except for a couple outliers. (What's going on with these outliers? I wonder, but given that the difference is so small it's okay anyway.)

The "time" benchmarks exhibit wide noise, possibly due to code-placement issues: there are 10% changes in performance, either up or down, with no clear pattern or obvious relation to the change itself. This is not unexpected, and we shouldn't be surprised anymore. What I do find surprising is that there are a handful of outliers (one microbenchmark and two-four macrobenchmarks; really most cases where something "interesting" seems to happen) where the eventring-disabled version is noticeably slower than the eventring-enabled version. This is weird! In particular, both versions use the same code layout (right?), so the weird difference comes from something else. (And there are essentially no programs where the difference goes in the other direction, where eventring-enabled program is noticeably slower than the eventring-disabled program.)

Am I reading the graphs wrong (I expect that score >1 means "slower than the baseline", and <1 means "faster than the baseline".)? Could you have switched the two labels by mistake?

sadiqj · 2022-02-02T21:38:00Z

I looked at the benchmarks and I'm a bit confused (as often).

Thanks for taking a look @gasche . Your comment did get me thinking and I realised that the eventring-pr and eventring-enabled branches could result in different runtimes. To better fit sandmark's workflow the eventring-enabled branch is eventring-pr with a one-liner commit atop that always starts eventring, regardless of whether the environment variable is passed.

So instead I kicked off another benchmarking run, this time with the same runtime but only using environment variables to turn eventring on. There's still benchmarks where eventring enabled takes less time.

I picked knucleotide3 and ran the binary (taskset on an isolated core) and averaged over 6 runs of each there's still a ~3.7% performance advantage to having eventring enabled.

I wondered whether just having an extra environment variable the same size as OCAML_EVENTRING_START was enough but no dice, it performs the same as not enabled.

sadiqj · 2022-02-08T12:57:14Z

@shakthimaan is currently running benchmarks of this PR on the sandmark benchmarking infrastructure (thanks Shakthi!).

The sequential macro benchmarks are done and look remarkably similar to the ones that we did on our benchmarking setup:

There are indeed still benchmarks where eventring enabled performs better than disabled.

Parallel benchmarks are still running. I'll update this when they're done.

gasche · 2022-02-08T14:27:14Z

Personally I'm not so worried about the benchmarks: there is nothing in them that suggests that the changes does add noticeable overhead (we are discussing strange noise effects, which suggests that there is nothing else to discuss). And if we were to find noticeable performance impact later, we have the easy option of moving more probes from the normal runtime to the instrumented runtime. I would consider this point settled.

But if we want to make a decision on the PR, we need:

decisions on whether we like the new design better than the previous Eventlog design
and a review of the implementation

So far we haven't had those. Point (1) sounds like an easy sell ("it comes from the same group, we said yes last time, they say it's even better this time, so what's not to like?"), and we have a caml-devel meeting shortly where you can always poll to force a decision. Point (2) is the bottleneck, as always.

abbysmal

I made a first pass of review, I'm overall pretty enthusiastic about this PR and I hope we can get more people to review it.

It follows up pretty nicely eventlog's initial target, and bring lower overhead to the tracing infrastructure within the runtime, thus lowering the barrier to using these tools. (since we do not have to hide most of the counters and events behind the instrumented runtime anymore.)
This will prove especially useful as more users will write parallel programs, and eventlog proved to be a very useful tool to the Multicore team to investigate interactions between domains during GC events.

Now I will add a few extra comments:

I did not review the win32 code, and I will need to do a second pass of review on the consumer code, as well as to the writing to the eventring logic.
I remember some early discussions and could make assumptions about parts of the design how the code interacts with it, but I think adding a few comments, defines, could make the review experience a bit nicer.
I think having a nice comment describing the overall topology of the eventring structure could be useful (sort of in the same vein as the one that can be found for domain lifecycle in domain.c).

I had a few other nitpicks and remarks, all of them can be found in individual comments on the PR.

runtime/caml/eventring.h

runtime/eventring.c

stdlib/eventring.mli

runtime/eventring.c

abbysmal · 2022-02-09T15:26:05Z

runtime/caml/eventring.h

+};
+
+#define EVENTRING_NUM_ALLOC_BUCKETS 20
+#define EVENTRING_MAX_MSG_LENGTH (1 << 10)


We do not currently have high-level (as far as C stands) functions to actually write user events.
We have a few asserts and checks in place to avoid reaching this condition, so maybe documenting this specific restriction (which looks totally fine.) is not required.
It would anyway need to be explicitly stated in a dedicated API, as the writing to the ringbuffer logic is currently hidden behind CAML_INTERNAL and is not a primitive.

abbysmal · 2022-02-09T15:37:54Z

runtime/eventring_consumer.c

+  CAMLlocal2(tmp_callback, callbacks_root);
+  CAMLlocalN(params, 4);
+
+  callbacks_root = *((value *)callback_data);


I'd like to take a look if there's any problem with the various callback_data bits in this (and related) functions and if there's any potential for it to be harmfully moved.

runtime/caml/eventring.h

runtime/eventring_consumer.c

testsuite/tests/eventring/stubs.c

runtime/eventring_consumer.c

RichardWarburton · 2022-02-25T11:30:57Z

runtime/eventring_consumer.c

+    } while (cursor->current_positions[domain_num] < ring_tail &&
+             (max_events == 0 || consumed < max_events) && !early_exit);
+
+    cursor->next_read_domain = (domain_num + 1) % cursor->metadata->max_domains;


Modulus is expensive. Assuming you can't guarantee cursor->metadata->max_domains is divisible by two then might be worth replacing with domain_num + 1 == cursor->metadata->max_domains ? 0 : domain_num + 1?

patricoferris

This is a really cool feature @sadiqj !

I've been looking into using this with ocaml-multicore/eio#138 where I'm connecting to an existing eventring rather than creating them programatically hence testing the Some (dir, pid) logic.

runtime/eventring_consumer.c

sadiqj · 2022-03-06T18:29:09Z

This is a really cool feature @sadiqj !

I've been looking into using this with ocaml-multicore/eio#138 where I'm connecting to an existing eventring rather than creating them programatically hence testing the Some (dir, pid) logic.

Thanks @patricoferris . It looks like that code was broken after a refactoring. I really need to add a test for reading an external process' events but I may need some help from an ocamltest expert to figure out how we'd do it.

runtime/unix.c

sadiqj · 2022-04-29T17:21:13Z

The latest changes move eventring to otherlibs and rename it to Runtime_events. The headers are split in to those functions required by the runtime (for producing) and those required by consumers.

The only quirk that I still haven't ironed out (because I'm not entirely sure it needs to be changed) is that now the Runtime_events module is in otherlibs and not the stdlib, you can't start/stop it from OCaml without linking that. You can however turn it on using the environment variable even if the program doesn't link Runtime_events.

Thanks to @dra27 for all kinds of help in getting this moved.

sadiqj · 2022-05-04T19:09:28Z

(rebased to trunk just now)

I want to check where we are with the PR at the moment, as it would be great to get things merged and follow up with good documentation and tooling in time for 5.0. There's also some follow up work on the set of events we currently have, there may be some we want to remove (there are some no longer present in 5.0 that are there from 4.x) and add others. I've purposefully left that out to minimise the diff for this PR.

@gasche are you still planning to take a look at the ring buffer implementation itself?

ctk21

I like this - a binary ring buffer is much better than the old string based file mechanism for dumping events. I think the design of the ring buffer with the wrap around works with a concurrent reader.

On the testsuite side, it looks like there is good coverage. One area I don't think is covered in the tests is domain slot reuse - that's traditionally been a pain point for anything involving resource reuse. Altering the parallel test to go on to serially spawn/join 256 domains might easily get some coverage there.

I had a niggle around the Domain.id in the consuming API, I think we should consider being a bit more defensive with recursive polls on reading cursors and minor nits here & there.

otherlibs/runtime_events/runtime_events_consumer.c

otherlibs/runtime_events/runtime_events.mli

otherlibs/runtime_events/runtime_events_consumer.h

otherlibs/runtime_events/runtime_events_consumer.c

ctk21 · 2022-05-13T14:24:49Z

runtime/caml/runtime_events.h

+
+#ifdef CAML_INTERNALS
+
+struct runtime_events_buffer_header {


Is a little picture here useful which denoting how ring_head and ring_tail are operated on by the producer and used by the consumer?
(or even just comments for ring_head/ring_tail around what the producer and consumer see in these variables)

runtime/gc_ctrl.c

runtime/runtime_events.c

ctk21 · 2022-05-17T15:11:45Z

Looks like all the things I spotted in review (which focused mainly on the C code side) have been addressed.

gasche · 2022-05-19T18:02:49Z

As I said during the last development meeting, I have looked at the part of the PR that is not the C code itself. (Maybe not with the level of scrutiny of our usual full reviews, but I'm afraid we have to live with that in the absence of other volunteers.) Now that @ctk21 reviewed carefully the runtime_event_{consumer,producer}, we may have, in combination, enough assurance for an approval. @ctk21, do you agree with this assessment, or are there other things you would like to look at?

ctk21 · 2022-05-24T08:24:34Z

@gasche I agree with your assessment. I don't have further things that need addressing.

gasche

Let's go for a formal approval then.

Two remarks:

I'm still interested in that documentation we have been talking about :-)
The code I've looked at looked fresh/young, I expect that we will find some other minor issues that we haven't caught during review. I think this is okay for a non-critical piece of code (it does nothing unless enabled, so does not endanger correctness of the compiler), it's part of the deal of Multicore not being a battle-tested runtime implementation yet.

kayceesrk · 2022-05-24T12:45:29Z

Thanks for squashing the commits. The PR is a great addition for observability of OCaml programs. Looking forward to using it.

As a next step, I hope you will contribute the documentation necessary for using eventring.

sadiqj · 2022-05-24T13:34:18Z

Great. Thanks everyone for their time in providing feedback and reviewing.

Off to documentation and supporting tooling!

xavierleroy · 2022-06-02T07:59:29Z

This PR breaks the build on OmniOS:

gcc -m64 -c -O2 -fno-strict-aliasing -fwrapv -D_POSIX_PTHREAD_SEMANTICS -pthread -g -Wall -Werror -fno-common -fexcess-precision=standard -fno-tree-vrp -ffunction-sections  -I./runtime  -D_FILE_OFFSET_BITS=64  -DCAMLDLLIMPORT=  -o runtime/runtime_events.b.o runtime/runtime_events.c
runtime/runtime_events.c: In function 'runtime_events_teardown_raw':
runtime/runtime_events.c:138:12: error: passing argument 1 of 'munmap' from incompatible pointer type [-Werror=incompatible-pointer-types]
     munmap(current_metadata, current_ring_total_size);
            ^~~~~~~~~~~~~~~~
In file included from runtime/runtime_events.c:39:
/usr/include/sys/mman.h:245:19: note: expected 'caddr_t' {aka 'char *'} but argument is of type 'struct runtime_events_metadata_header *'
 extern int munmap(caddr_t, size_t);
                   ^~~~~~~
runtime/runtime_events.c: In function 'runtime_events_create_raw':
runtime/runtime_events.c:295:22: error: assignment to 'struct runtime_events_metadata_header *' from incompatible pointer type 'caddr_t' {aka 'char *'} [-Werror=incompatible-pointer-types]
     current_metadata = mmap(NULL, current_ring_total_size,
                      ^
cc1: all warnings being treated as errors

OmniOS is probably wrong w.r.t. the types of mmap and munmap. Possible fixes include

add casts: munmap((void *) current_metadata, ...) and current_metadata = (struct ... *) mmap(...)
figure out which defines to add to get POSIX-conformant header files from OmniOS
give up on OmniOS support entirely.

Your call.

sadiqj · 2022-06-02T18:24:51Z

Having a look at the headers in Illumos it seems the minimum we need is _POSIX_C_SOURCE=199309L to get the conformant header. I'll get a branch together and test.

xavierleroy · 2022-06-03T16:19:15Z

Something else is wrong on ARM 32 bits and PowerPC 32 bits:

/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:418: undefined reference to `__atomic_load_8'
/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:456: undefined reference to `__atomic_store_8'
/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:490: undefined reference to `__atomic_store_8'
/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:469: undefined reference to `__atomic_store_8'

These platforms implement 64-bit atomic operations using locks and library functions. For Clang the functions are linked automatically, but for GCC we need to link -latomic explicitly.

Do we really need 64-bit indices in the ring buffer on a 32-bit platform? Could you please use atomic_uintnat instead of atomic_uint_fast64_t ?

Note that a 64-bit atomic int implemented with locks occupies more than 8 bytes, so you're not getting the same memory layout on ARM-32 than on, say, x86-64, at any rate.

sadiqj · 2022-06-03T19:31:25Z

We don't need to keep 64-bit indices on 32-bit platforms. I can't imagine anyone would really want to monitor 32-bit from 64-bit or vice-versa (or across archs).

I'll test a combined fix on precheck to see if there's anything else that's fallen out.

xavierleroy · 2022-06-04T07:22:10Z

Thinking about it... if ring_head and ring_tail are 32-bit integers, they will overflow eventually, and this doesn't seem to be handled in the current code... That's a strong reason to keep using 64-bit integers.

sadiqj · 2022-06-04T08:14:50Z

Oh yes, this is a good point. I'll get -latomic included for 32-bit GCC.

xavierleroy · 2022-06-04T08:20:33Z

I'm on it, it's undergoing CI.

xavierleroy · 2022-06-04T10:35:09Z

I pushed the -latomic fix directly to trunk: 2d8a0cf . CI seems happy.

sadiqj · 2022-06-04T18:27:27Z

Thank you, I'll get the omnios fix in soon.

abbysmal mentioned this pull request Feb 1, 2022

domain.c: use an atomic counter for domain unique ids #10974

Merged

avsm mentioned this pull request Feb 2, 2022

Allocate domain state using malloc instead of mmap #10950

Merged

abbysmal reviewed Feb 9, 2022

View reviewed changes

sadiqj force-pushed the eventring-pr branch from 0ce8648 to 68a6b62 Compare February 10, 2022 16:36

abbysmal mentioned this pull request Feb 14, 2022

document and refactor the gc-stats code #11008

Merged

Octachron mentioned this pull request Feb 14, 2022

Meta-issue for OCaml 5.0 release goals #11013

Closed

16 tasks

sadiqj force-pushed the eventring-pr branch from 522c235 to 06653b1 Compare February 15, 2022 12:26