Skip to content

Ring-buffer based runtime tracing (eventring)#10964

Merged
kayceesrk merged 1 commit intoocaml:trunkfrom
sadiqj:eventring-pr
May 24, 2022
Merged

Ring-buffer based runtime tracing (eventring)#10964
kayceesrk merged 1 commit intoocaml:trunkfrom
sadiqj:eventring-pr

Conversation

@sadiqj
Copy link
Copy Markdown
Contributor

@sadiqj sadiqj commented Jan 28, 2022

Summary

Eventring is a runtime tracing system designed for continuous monitoring of OCaml applications. It intends to replace the existing file-logging based eventlog in the runtime. It is very low overhead - negligible when not in use and ~0.5% when enabled. It utilises per-domain memory-mapped ring buffers that enable consumption of tracing events both in and out of process. At present only runtime events (the same set as eventlog) are added to the buffers but the design is intended to also eventually support user events as well as integrate with statmemprof.

Design

Context

The OCaml runtime currently provides an event logging system "eventlog" which can emit a range of runtime and GC events in the Common Trace Format (CTF) to a log file. It is currently only enabled in the instrumented runtime. The file-based logging also makes using eventlog for long running applications very difficult.

Goal

The goal of this PR is to replace eventlog's implementation with a minimal implementation of a more flexible transport for runtime and GC events. It intends to enable consuming of these events from OCaml code both inside and outside the runtime being monitored.

This PR only intends to deal with runtime and GC events, not those originating from user code. There is follow-up work on porting eventlog-tools, documentation and making the Windows support more robust (and potentially more native). We hold off on those if there are design changes required at this stage and also because the PR is already fairly large.

Design

Probes

This PR retains the existing probes on trunk but moves most in to the default runtime. A few of the probes on hot paths (such as free list allocation) are left in the instrumentated runtime. The dividing line is essentially where the work performed during a particular event is very high compared to the (low) cost of emitting an event.

Transport

The on-disk log file in eventlog is replaced with per-domain memory-mapped ring buffers.

The ring buffer is structured as a flight recorder, overwriting old data when is insufficient space to write new events. This enables users to potentially only read the ring when some anomalous event occurs or a performance target is missed. No coordination is needed with consumers who read the events - they detect races with the producer and discard events when that happens.

The producer code is contained here . By default an .eventring file is created in the current directory (overridable by setting OCAML_EVENTRING_DIR). This file contains a ring buffer for each possible domain (Max_domains). It is laid out in a structure that enables sparsity. On-disk (or in-memory) footprint is proportional to the max number of concurrent domains the process has ever run.

On-disk structure

----------------------------------------------------------------
| File header (version, offsets, etc..)                        |
----------------------------------------------------------------
| Ring 0..Max_domains metadata                                 |
| (head and tail indexes, one per cache line)                  |
----------------------------------------------------------------
| Ring 0..Max_domains data                                     |
| (actual ring data, default 2^16 words = 512k bytes)          |
----------------------------------------------------------------

API

The consumer APIs are cursor-based and polled. Users create a cursor with a path and pid, and if successfully created can then poll the cursor with a set of callbacks to call on each event they are interested in.

The OCaml API is available in Eventring.mli, the C API is available in eventring.h.

Consumer code is contained in eventring_consumer.c. This contains both the implementations for the OCaml API and C API (the former being a thin layer over the latter).

Examples

Emitting minor and major slice GC pause times to stdout

Here's a simple consumer that prints the times of minor and major slices. It omits any support for multiple domains, error handling or lost events:

let minor_started = ref Int64.zero

let runtime_begin _ ts phase =
    match phase with
    | Eventring.EV_MAJOR_SLICE ->
        begin
          major_started := Eventring.Timestamp.to_int64 ts
        end
    | Eventring.EV_MINOR ->
        begin
          minor_started := Eventring.Timestamp.to_int64 ts
        end
    | _ -> ()


let ts_to_duration_ns ts0 ts1 =
  Int64.sub (Eventring.Timestamp.to_int64 ts0) ts1

let runtime_end _ ts phase =
    match phase with
    | Eventring.EV_MAJOR_SLICE ->
        Printf.printf "Major slice: %Ldns\n"
          (ts_to_duration_ns ts !major_started)
    | Eventring.EV_MINOR ->
        Printf.printf "Minor: %Ldns\n"
          (ts_to_duration_ns ts !minor_started)
    | _ -> ()

let () =
    let list_ref = ref [] in
    let cursor = Eventring.create_cursor None in
    let callbacks = Eventring.Callbacks.create ~runtime_begin ~runtime_end ()
    in
    while true do
        for _ = 1 to 100 do
            list_ref := [];
            for _ = 1 to 10 do
                list_ref := (Sys.opaque_identity(ref 42)) :: !list_ref
            done;
            Gc.full_major ();
        done;
        ignore(Eventring.read_poll cursor callbacks None);
        Unix.sleep 1
    done

When compiled and run with this PR's branch you will get an output like:

Minor: 1496ns
Minor: 1341ns
Minor: 1308ns
Major slice: 17605ns
Minor: 1435ns
Minor: 2986ns
Minor: 2688ns
Major slice: 26577ns
Minor: 2263ns
Minor: 1098ns
Minor: 973ns
Major slice: 17719ns
...

Code such as the above could be spawned in a separate domain and send metrics data to an external service (such as graphite or prometheus) over a socket.

Chrome tracing

This is a simplified example of how Eventring can be used to log out runtime data in a format that can be simply ingested by Chrome's trace viewer. Running the above project with OCAML_EVENTRING_START=1 _build/default/src/traceevents_lib.exe and then loading the resultant json file in to Chrome's tracing viewer will give something like the following:

Example of chrome tracing outpu

Tests

The tests/eventring directory contains some other exampel of using the different APIs.

Benchmarks

Timing

For all macro benchmarks on sandmark that run on trunk, the performance impact when normalised to trunk (f22a84b) averaged over five runs is:

Variant Geomean
Eventring disabled 0.998507
Eventring enabled 0.997360

<1 indicates eventring's branches were marginally faster than trunk.

The graph below shows the normalised (to trunk) runtime for long-running (>4s) macro benchmarks:

image

All macro benchmarks:

image

You can find the raw benchmarking data here: https://github.com/sadiqj/sandmark_benchmark_data/tree/main/eventring_pr

Instructions

Below is a run of sequential sandmark using perf stat to count the number of retired instructions:

image

Variant Geomean
Eventring disabled 1.000133
Eventring enabled 1.000913

Here eventring resulted an 0.01% increase in instructions when not enabled and 0.09% when enabled.

For each benchmark below there are two rows, this PR without eventring enabled and this PR with it enabled. Instructions increase is the percentage increase in instructions for this PR over trunk e.g for the first row LU_decomposition this PR leads to an increase of 7 thousandths of a percent (2.25M extra instructions on the 32.28B trunk does).

name instructions variant ninstructions
LU_decomposition 1024 32.284B disabled 0.007%
LU_decomposition 1024 32.338B enabled 0.175%
alt-ergo benchmarkalt-ergfill.why 10.726B disabled 0.004%
alt-ergo benchmarkalt-ergfill.why 10.729B enabled 0.035%
alt-ergo benchmarkalt-ergyyll.why 80.394B disabled 0.005%
alt-ergo benchmarkalt-ergyyll.why 80.449B enabled 0.074%
bdd 26 23.379B disabled 0.005%
bdd 26 23.410B enabled 0.139%
binarytrees5 21 64.412B disabled 0.005%
binarytrees5 21 64.438B enabled 0.045%
chameneos_redux_lwt 600000 8.287B disabled 0.007%
chameneos_redux_lwt 600000 8.302B enabled 0.196%
crout_decomposition 10.837B disabled 0.004%
crout_decomposition 10.851B enabled 0.139%
cubicle benchmarkcubiclgerman_pfs.cub 1011.065B disabled 0.009%
cubicle benchmarkcubiclgerman_pfs.cub 1011.628B enabled 0.064%
cubicle benchmarkcubiclszymanski_at.cub 2722.539B disabled 0.006%
cubicle benchmarkcubiclszymanski_at.cub 2725.030B enabled 0.098%
durand_kerner_aberth 1.054B disabled 0.008%
durand_kerner_aberth 1.057B enabled 0.268%
evolutionary_algorithm 10000 10000 253.398B disabled 0.002%
evolutionary_algorithm 10000 10000 253.406B enabled 0.005%
fannkuchredux 12 587.338B disabled -0.001%
fannkuchredux 12 587.338B enabled -0.000%
fannkuchredux2 12 590.380B disabled 0.001%
fannkuchredux2 12 590.380B enabled 0.001%
fasta3 25_000_000 21.914B disabled 0.002%
fasta3 25_000_000 21.926B enabled 0.056%
fasta6 25_000_000 17.562B disabled -0.000%
fasta6 25_000_000 17.562B enabled -0.001%
fft 19.972B disabled 0.005%
fft 19.982B enabled 0.059%
floyd_warshall 512 6.032B disabled 0.002%
floyd_warshall 512 6.031B enabled -0.008%
game_of_life 256 64.563B disabled 0.000%
game_of_life 256 64.563B enabled 0.001%
grammatrix 447.579B disabled -0.000%
grammatrix 447.581B enabled 0.000%
kb 16.056B disabled 0.003%
kb 16.071B enabled 0.097%
kb_no_exc 14.357B disabled 0.270%
kb_no_exc 14.364B enabled 0.318%
knucleotide 290.074B disabled 0.000%
knucleotide 290.075B enabled 0.001%
knucleotide3 304.365B disabled 0.000%
knucleotide3 304.365B enabled 0.000%
levinson_durbin 24.481B disabled 0.008%
levinson_durbin 24.516B enabled 0.150%
main 64.580B disabled 0.002%
main 64.623B enabled 0.069%
mandelbrot6 16_000 185.860B disabled -0.000%
mandelbrot6 16_000 185.860B enabled 0.000%
matrix_multiplication 1024 32.692B disabled -0.000%
matrix_multiplication 1024 32.692B enabled 0.001%
menhir -v --table benchmarkmenhisysver.mly 423.416B disabled 0.007%
menhir -v --table benchmarkmenhisysver.mly 423.641B enabled 0.060%
menhir -v -t benchmarkmenhikeywords.mly benchm... 37.790B disabled 0.008%
menhir -v -t benchmarkmenhikeywords.mly benchm... 37.820B enabled 0.087%
menhir benchmarkmenhiocaml.mly --list-errors -... 1154.743B disabled 0.000%
menhir benchmarkmenhiocaml.mly --list-errors -... 1154.808B enabled 0.006%
minilight-ocaml benchmarkminilighroomfront.ml.txt 80.788B disabled 0.015%
minilight-ocaml benchmarkminilighroomfront.ml.txt 81.023B enabled 0.307%
naive_multilayer 33.856B disabled 0.117%
naive_multilayer 33.911B enabled 0.281%
nbody 50_000_000 40.758B disabled -0.000%
nbody 50_000_000 40.759B enabled 0.001%
pidigits5 10_000 11.882B disabled 0.044%
pidigits5 10_000 11.890B enabled 0.116%
qr_decomposition 19.412B disabled 0.007%
qr_decomposition 19.455B enabled 0.228%
quicksort 4000000 17.125B disabled -0.000%
quicksort 4000000 17.125B enabled 0.001%
regexredux2 93.000B disabled -0.002%
regexredux2 93.003B enabled 0.002%
revcomp2 20.703B disabled 0.002%
revcomp2 20.705B enabled 0.011%
sequence_cps 10000 14.535B disabled 0.011%
sequence_cps 10000 14.576B enabled 0.294%
setrip -enc -rseed 1067894368 2.454B disabled 0.002%
setrip -enc -rseed 1067894368 2.456B enabled 0.075%
spectralnorm2 5_500 59.332B disabled 0.003%
spectralnorm2 5_500 59.393B enabled 0.106%
test_decompress 64 524_288 28.016B disabled -0.013%
test_decompress 64 524_288 28.028B enabled 0.029%
test_lwt 200 136.513B disabled 0.001%
test_lwt 200 136.537B enabled 0.019%
thread_ring_lwt_mvar 20_000 20.404B disabled 0.007%
thread_ring_lwt_mvar 20_000 20.434B enabled 0.154%
thread_ring_lwt_stream 20_000 63.936B disabled 0.007%
thread_ring_lwt_stream 20_000 64.001B enabled 0.108%
ydump -c sample.json 4.046B disabled -0.011%
ydump -c sample.json 4.047B enabled 0.017%
zarith_pi 5000 3.006B disabled 0.063%
zarith_pi 5000 3.017B enabled 0.413%

@ygrek
Copy link
Copy Markdown
Contributor

ygrek commented Jan 28, 2022

I am not sure using geomean is a good ground for performance impact argument. Looking at the chart some test cases became faster with eventring added, how come? And redux has 10% impact, is that right?

@yminsky
Copy link
Copy Markdown

yminsky commented Jan 29, 2022

Love to see more work in the tracing space!

I wonder if this has any relation to the tracing library we released recently: https://github.com/janestreet/tracing

Or to our work on leveraging Intel's processor trace: https://blog.janestreet.com/magic-trace/

Or to the work we've done adding probes to OCaml. Not sure if this is the best link for that: https://github.com/gretay-js/probes

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jan 29, 2022

I am not sure using geomean is a good ground for performance impact argument. Looking at the chart some test cases became faster with eventring added, how come? And redux has 10% impact, is that right?

I should have been more specific above the table. The numbers are the average of the normalised times against trunk (i.e normalised before averaging rather than after). Geometric mean is the right thing to do in this case: http://www.cse.unsw.edu.au/~cs9242/11/papers/Fleming_Wallace_86.pdf

Yes, some of the benchmarks show a speedup. This PR only adds instructions so you would think it would always make things slower but there's a couple of sources of noise.

First is just the machine itself. While the benchmarking machine has steps to reduce variability (https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking) there is still some. I have a batch run of a number of iterations of the macro benchmarking suite going to quantify what the run-to-run variability is.

Second source is actually just small changes can have knock-on effects. e.g adding a probe could change the alignment of a later loop an lead to a significant slowdown. http://eecs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf has some depressing examples. Also @ctk21 has good notes on his investigations https://github.com/ocaml-bench/notes/blob/master/apr19.md#what-we-learnt .

After this large batch run finishes I'll also kick off a run that only uses instruction counts.

There's also benchmarks on parallel programs that need to be done. I'm waiting on this in-case we need to revisit the design or implementation significantly.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jan 29, 2022

Love to see more work in the tracing space!

Thanks for taking a look.

The aim for this PR is to only provide a facility for consuming events generated by the runtime's subsystems (mostly the GC but also some lifecycle events). This is intentional, I'm aware there's already some good observability tooling. Tracing, probes and memprof/memtrace being able to make use of eventring would be really powerful and I've already had a couple of discussions with JS folk on what some of that might look like. Eio also has an eye on replacing their tracing implementation with something atop of eventring.

It would be good to identify if there's anything in this design that precludes this from happening. Interfaces for plugging in a tracing or a probe library would be follow-up work requiring some community coordination.

Alternatively we might find we can't cover enough common usecases and eventring only functions as a way of transporting runtime events somewhere else. That's still better than the situation at the moment but it would feel like a missed opportunity.

@yminsky
Copy link
Copy Markdown

yminsky commented Jan 29, 2022

Yeah, maybe there's something to share here. I think @gretay-js is probably the right person to connect to here.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jan 31, 2022

I've updated the PR description with a sandmark run for retired instructions.

With the exception of native_multilayer, this PR adds less than a quarter of a percent to retired instructions when eventring is compiled in but not enabled.

naive_multilayer sees nearly a 1% increase in instructions retired when eventring is disabled and 22% when it is enabled. It's allocating a fair amount (~5.5GB/s and 2.9k minor gcs/s) but I think there might be something else going on. Will investigate.

Results for run-to-run variation in timings are just waiting on me to figure out how to graph them.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jan 31, 2022

I found the issue with naive_multilayer. It very frequently calls caml_make_vect which ultimately triggers a handle_gc_interrupt on each invocation and this causes an EV_INTERRUPT_GC to be emitted - even if there's nothing to do. Usually we only end up on this code path when there is something non-trivial to do (and so the relative cost of checking whether eventring is enabled and emitting the event is very low).

I was considering moving this event to the instrumented runtime but I'm actually not convinced it's that useful even there so will remove it and kick off another benchmarking run.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Feb 2, 2022

I've rerun and updated the benchmarking results after removing EV_INTERRUPT_GC.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 2, 2022

I looked at the benchmarks and I'm a bit confused (as often).

The "retired instructions" measures look like we expect: at most there is 0.4% increase in instruction count when eventring is enabled, nothing to worry about, and when it is disabled it's mostly lost in the noise except for a couple outliers. (What's going on with these outliers? I wonder, but given that the difference is so small it's okay anyway.)

The "time" benchmarks exhibit wide noise, possibly due to code-placement issues: there are 10% changes in performance, either up or down, with no clear pattern or obvious relation to the change itself. This is not unexpected, and we shouldn't be surprised anymore. What I do find surprising is that there are a handful of outliers (one microbenchmark and two-four macrobenchmarks; really most cases where something "interesting" seems to happen) where the eventring-disabled version is noticeably slower than the eventring-enabled version. This is weird! In particular, both versions use the same code layout (right?), so the weird difference comes from something else. (And there are essentially no programs where the difference goes in the other direction, where eventring-enabled program is noticeably slower than the eventring-disabled program.)

Am I reading the graphs wrong (I expect that score >1 means "slower than the baseline", and <1 means "faster than the baseline".)? Could you have switched the two labels by mistake?

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Feb 2, 2022

I looked at the benchmarks and I'm a bit confused (as often).

Thanks for taking a look @gasche . Your comment did get me thinking and I realised that the eventring-pr and eventring-enabled branches could result in different runtimes. To better fit sandmark's workflow the eventring-enabled branch is eventring-pr with a one-liner commit atop that always starts eventring, regardless of whether the environment variable is passed.

So instead I kicked off another benchmarking run, this time with the same runtime but only using environment variables to turn eventring on. There's still benchmarks where eventring enabled takes less time.

I picked knucleotide3 and ran the binary (taskset on an isolated core) and averaged over 6 runs of each there's still a ~3.7% performance advantage to having eventring enabled.

I wondered whether just having an extra environment variable the same size as OCAML_EVENTRING_START was enough but no dice, it performs the same as not enabled.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Feb 8, 2022

@shakthimaan is currently running benchmarks of this PR on the sandmark benchmarking infrastructure (thanks Shakthi!).

The sequential macro benchmarks are done and look remarkably similar to the ones that we did on our benchmarking setup:

image

There are indeed still benchmarks where eventring enabled performs better than disabled.

Parallel benchmarks are still running. I'll update this when they're done.

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 8, 2022

Personally I'm not so worried about the benchmarks: there is nothing in them that suggests that the changes does add noticeable overhead (we are discussing strange noise effects, which suggests that there is nothing else to discuss). And if we were to find noticeable performance impact later, we have the easy option of moving more probes from the normal runtime to the instrumented runtime. I would consider this point settled.

But if we want to make a decision on the PR, we need:

  1. decisions on whether we like the new design better than the previous Eventlog design
  2. and a review of the implementation

So far we haven't had those. Point (1) sounds like an easy sell ("it comes from the same group, we said yes last time, they say it's even better this time, so what's not to like?"), and we have a caml-devel meeting shortly where you can always poll to force a decision. Point (2) is the bottleneck, as always.

Copy link
Copy Markdown
Contributor

@abbysmal abbysmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a first pass of review, I'm overall pretty enthusiastic about this PR and I hope we can get more people to review it.

It follows up pretty nicely eventlog's initial target, and bring lower overhead to the tracing infrastructure within the runtime, thus lowering the barrier to using these tools. (since we do not have to hide most of the counters and events behind the instrumented runtime anymore.)
This will prove especially useful as more users will write parallel programs, and eventlog proved to be a very useful tool to the Multicore team to investigate interactions between domains during GC events.

Now I will add a few extra comments:

I did not review the win32 code, and I will need to do a second pass of review on the consumer code, as well as to the writing to the eventring logic.
I remember some early discussions and could make assumptions about parts of the design how the code interacts with it, but I think adding a few comments, defines, could make the review experience a bit nicer.
I think having a nice comment describing the overall topology of the eventring structure could be useful (sort of in the same vein as the one that can be found for domain lifecycle in domain.c).

I had a few other nitpicks and remarks, all of them can be found in individual comments on the PR.

};

#define EVENTRING_NUM_ALLOC_BUCKETS 20
#define EVENTRING_MAX_MSG_LENGTH (1 << 10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not currently have high-level (as far as C stands) functions to actually write user events.
We have a few asserts and checks in place to avoid reaching this condition, so maybe documenting this specific restriction (which looks totally fine.) is not required.
It would anyway need to be explicitly stated in a dedicated API, as the writing to the ringbuffer logic is currently hidden behind CAML_INTERNAL and is not a primitive.

CAMLlocal2(tmp_callback, callbacks_root);
CAMLlocalN(params, 4);

callbacks_root = *((value *)callback_data);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to take a look if there's any problem with the various callback_data bits in this (and related) functions and if there's any potential for it to be harmfully moved.

} while (cursor->current_positions[domain_num] < ring_tail &&
(max_events == 0 || consumed < max_events) && !early_exit);

cursor->next_read_domain = (domain_num + 1) % cursor->metadata->max_domains;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulus is expensive. Assuming you can't guarantee cursor->metadata->max_domains is divisible by two then might be worth replacing with domain_num + 1 == cursor->metadata->max_domains ? 0 : domain_num + 1?

Copy link
Copy Markdown
Contributor

@patricoferris patricoferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool feature @sadiqj !

I've been looking into using this with ocaml-multicore/eio#138 where I'm connecting to an existing eventring rather than creating them programatically hence testing the Some (dir, pid) logic.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Mar 6, 2022

This is a really cool feature @sadiqj !

I've been looking into using this with ocaml-multicore/eio#138 where I'm connecting to an existing eventring rather than creating them programatically hence testing the Some (dir, pid) logic.

Thanks @patricoferris . It looks like that code was broken after a refactoring. I really need to add a test for reading an external process' events but I may need some help from an ocamltest expert to figure out how we'd do it.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Apr 29, 2022

The latest changes move eventring to otherlibs and rename it to Runtime_events. The headers are split in to those functions required by the runtime (for producing) and those required by consumers.

The only quirk that I still haven't ironed out (because I'm not entirely sure it needs to be changed) is that now the Runtime_events module is in otherlibs and not the stdlib, you can't start/stop it from OCaml without linking that. You can however turn it on using the environment variable even if the program doesn't link Runtime_events.

Thanks to @dra27 for all kinds of help in getting this moved.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented May 4, 2022

(rebased to trunk just now)

I want to check where we are with the PR at the moment, as it would be great to get things merged and follow up with good documentation and tooling in time for 5.0. There's also some follow up work on the set of events we currently have, there may be some we want to remove (there are some no longer present in 5.0 that are there from 4.x) and add others. I've purposefully left that out to minimise the diff for this PR.

@gasche are you still planning to take a look at the ring buffer implementation itself?

Copy link
Copy Markdown
Contributor

@ctk21 ctk21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - a binary ring buffer is much better than the old string based file mechanism for dumping events. I think the design of the ring buffer with the wrap around works with a concurrent reader.

On the testsuite side, it looks like there is good coverage. One area I don't think is covered in the tests is domain slot reuse - that's traditionally been a pain point for anything involving resource reuse. Altering the parallel test to go on to serially spawn/join 256 domains might easily get some coverage there.

I had a niggle around the Domain.id in the consuming API, I think we should consider being a bit more defensive with recursive polls on reading cursors and minor nits here & there.


#ifdef CAML_INTERNALS

struct runtime_events_buffer_header {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a little picture here useful which denoting how ring_head and ring_tail are operated on by the producer and used by the consumer?
(or even just comments for ring_head/ring_tail around what the producer and consumer see in these variables)

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented May 17, 2022

Looks like all the things I spotted in review (which focused mainly on the C code side) have been addressed.

@gasche
Copy link
Copy Markdown
Member

gasche commented May 19, 2022

As I said during the last development meeting, I have looked at the part of the PR that is not the C code itself. (Maybe not with the level of scrutiny of our usual full reviews, but I'm afraid we have to live with that in the absence of other volunteers.) Now that @ctk21 reviewed carefully the runtime_event_{consumer,producer}, we may have, in combination, enough assurance for an approval. @ctk21, do you agree with this assessment, or are there other things you would like to look at?

@ctk21
Copy link
Copy Markdown
Contributor

ctk21 commented May 24, 2022

@gasche I agree with your assessment. I don't have further things that need addressing.

Copy link
Copy Markdown
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go for a formal approval then.

Two remarks:

  • I'm still interested in that documentation we have been talking about :-)
  • The code I've looked at looked fresh/young, I expect that we will find some other minor issues that we haven't caught during review. I think this is okay for a non-critical piece of code (it does nothing unless enabled, so does not endanger correctness of the compiler), it's part of the deal of Multicore not being a battle-tested runtime implementation yet.

@kayceesrk
Copy link
Copy Markdown
Contributor

kayceesrk commented May 24, 2022

Thanks for squashing the commits. The PR is a great addition for observability of OCaml programs. Looking forward to using it.

As a next step, I hope you will contribute the documentation necessary for using eventring.

@kayceesrk kayceesrk merged commit ab54447 into ocaml:trunk May 24, 2022
@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented May 24, 2022

Great. Thanks everyone for their time in providing feedback and reviewing.

Off to documentation and supporting tooling!

@xavierleroy
Copy link
Copy Markdown
Contributor

This PR breaks the build on OmniOS:

gcc -m64 -c -O2 -fno-strict-aliasing -fwrapv -D_POSIX_PTHREAD_SEMANTICS -pthread -g -Wall -Werror -fno-common -fexcess-precision=standard -fno-tree-vrp -ffunction-sections  -I./runtime  -D_FILE_OFFSET_BITS=64  -DCAMLDLLIMPORT=  -o runtime/runtime_events.b.o runtime/runtime_events.c
runtime/runtime_events.c: In function 'runtime_events_teardown_raw':
runtime/runtime_events.c:138:12: error: passing argument 1 of 'munmap' from incompatible pointer type [-Werror=incompatible-pointer-types]
     munmap(current_metadata, current_ring_total_size);
            ^~~~~~~~~~~~~~~~
In file included from runtime/runtime_events.c:39:
/usr/include/sys/mman.h:245:19: note: expected 'caddr_t' {aka 'char *'} but argument is of type 'struct runtime_events_metadata_header *'
 extern int munmap(caddr_t, size_t);
                   ^~~~~~~
runtime/runtime_events.c: In function 'runtime_events_create_raw':
runtime/runtime_events.c:295:22: error: assignment to 'struct runtime_events_metadata_header *' from incompatible pointer type 'caddr_t' {aka 'char *'} [-Werror=incompatible-pointer-types]
     current_metadata = mmap(NULL, current_ring_total_size,
                      ^
cc1: all warnings being treated as errors

OmniOS is probably wrong w.r.t. the types of mmap and munmap. Possible fixes include

  • add casts: munmap((void *) current_metadata, ...) and current_metadata = (struct ... *) mmap(...)
  • figure out which defines to add to get POSIX-conformant header files from OmniOS
  • give up on OmniOS support entirely.

Your call.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jun 2, 2022

Having a look at the headers in Illumos it seems the minimum we need is _POSIX_C_SOURCE=199309L to get the conformant header. I'll get a branch together and test.

@xavierleroy
Copy link
Copy Markdown
Contributor

Something else is wrong on ARM 32 bits and PowerPC 32 bits:

/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:418: undefined reference to `__atomic_load_8'
/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:456: undefined reference to `__atomic_store_8'
/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:490: undefined reference to `__atomic_store_8'
/usr/bin/ld: /home/touraine/ci/builds/workspace/precheck/flambda/false/label/ocaml-arm-32/runtime/runtime_events.c:469: undefined reference to `__atomic_store_8'

These platforms implement 64-bit atomic operations using locks and library functions. For Clang the functions are linked automatically, but for GCC we need to link -latomic explicitly.

Do we really need 64-bit indices in the ring buffer on a 32-bit platform? Could you please use atomic_uintnat instead of atomic_uint_fast64_t ?

Note that a 64-bit atomic int implemented with locks occupies more than 8 bytes, so you're not getting the same memory layout on ARM-32 than on, say, x86-64, at any rate.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jun 3, 2022

We don't need to keep 64-bit indices on 32-bit platforms. I can't imagine anyone would really want to monitor 32-bit from 64-bit or vice-versa (or across archs).

I'll test a combined fix on precheck to see if there's anything else that's fallen out.

@xavierleroy
Copy link
Copy Markdown
Contributor

xavierleroy commented Jun 4, 2022

Thinking about it... if ring_head and ring_tail are 32-bit integers, they will overflow eventually, and this doesn't seem to be handled in the current code... That's a strong reason to keep using 64-bit integers.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jun 4, 2022

Oh yes, this is a good point. I'll get -latomic included for 32-bit GCC.

@xavierleroy
Copy link
Copy Markdown
Contributor

I'm on it, it's undergoing CI.

@xavierleroy
Copy link
Copy Markdown
Contributor

I pushed the -latomic fix directly to trunk: 2d8a0cf . CI seems happy.

@sadiqj
Copy link
Copy Markdown
Contributor Author

sadiqj commented Jun 4, 2022

Thank you, I'll get the omnios fix in soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.