Ring-buffer based runtime tracing (eventring)#10964
Conversation
|
I am not sure using geomean is a good ground for performance impact argument. Looking at the chart some test cases became faster with eventring added, how come? And redux has 10% impact, is that right? |
|
Love to see more work in the tracing space! I wonder if this has any relation to the tracing library we released recently: https://github.com/janestreet/tracing Or to our work on leveraging Intel's processor trace: https://blog.janestreet.com/magic-trace/ Or to the work we've done adding probes to OCaml. Not sure if this is the best link for that: https://github.com/gretay-js/probes |
I should have been more specific above the table. The numbers are the average of the normalised times against trunk (i.e normalised before averaging rather than after). Geometric mean is the right thing to do in this case: http://www.cse.unsw.edu.au/~cs9242/11/papers/Fleming_Wallace_86.pdf Yes, some of the benchmarks show a speedup. This PR only adds instructions so you would think it would always make things slower but there's a couple of sources of noise. First is just the machine itself. While the benchmarking machine has steps to reduce variability (https://github.com/ocaml-bench/ocaml_bench_scripts/#notes-on-hardware-and-os-settings-for-linux-benchmarking) there is still some. I have a batch run of a number of iterations of the macro benchmarking suite going to quantify what the run-to-run variability is. Second source is actually just small changes can have knock-on effects. e.g adding a probe could change the alignment of a later loop an lead to a significant slowdown. http://eecs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf has some depressing examples. Also @ctk21 has good notes on his investigations https://github.com/ocaml-bench/notes/blob/master/apr19.md#what-we-learnt . After this large batch run finishes I'll also kick off a run that only uses instruction counts. There's also benchmarks on parallel programs that need to be done. I'm waiting on this in-case we need to revisit the design or implementation significantly. |
Thanks for taking a look. The aim for this PR is to only provide a facility for consuming events generated by the runtime's subsystems (mostly the GC but also some lifecycle events). This is intentional, I'm aware there's already some good observability tooling. Tracing, probes and memprof/memtrace being able to make use of eventring would be really powerful and I've already had a couple of discussions with JS folk on what some of that might look like. It would be good to identify if there's anything in this design that precludes this from happening. Interfaces for plugging in a tracing or a probe library would be follow-up work requiring some community coordination. Alternatively we might find we can't cover enough common usecases and eventring only functions as a way of transporting runtime events somewhere else. That's still better than the situation at the moment but it would feel like a missed opportunity. |
|
Yeah, maybe there's something to share here. I think @gretay-js is probably the right person to connect to here. |
|
I've updated the PR description with a sandmark run for retired instructions. With the exception of
Results for run-to-run variation in timings are just waiting on me to figure out how to graph them. |
|
I found the issue with I was considering moving this event to the instrumented runtime but I'm actually not convinced it's that useful even there so will remove it and kick off another benchmarking run. |
|
I've rerun and updated the benchmarking results after removing |
|
I looked at the benchmarks and I'm a bit confused (as often). The "retired instructions" measures look like we expect: at most there is 0.4% increase in instruction count when eventring is enabled, nothing to worry about, and when it is disabled it's mostly lost in the noise except for a couple outliers. (What's going on with these outliers? I wonder, but given that the difference is so small it's okay anyway.) The "time" benchmarks exhibit wide noise, possibly due to code-placement issues: there are 10% changes in performance, either up or down, with no clear pattern or obvious relation to the change itself. This is not unexpected, and we shouldn't be surprised anymore. What I do find surprising is that there are a handful of outliers (one microbenchmark and two-four macrobenchmarks; really most cases where something "interesting" seems to happen) where the eventring-disabled version is noticeably slower than the eventring-enabled version. This is weird! In particular, both versions use the same code layout (right?), so the weird difference comes from something else. (And there are essentially no programs where the difference goes in the other direction, where eventring-enabled program is noticeably slower than the eventring-disabled program.) Am I reading the graphs wrong (I expect that score >1 means "slower than the baseline", and <1 means "faster than the baseline".)? Could you have switched the two labels by mistake? |
Thanks for taking a look @gasche . Your comment did get me thinking and I realised that the So instead I kicked off another benchmarking run, this time with the same runtime but only using environment variables to turn eventring on. There's still benchmarks where eventring enabled takes less time. I picked knucleotide3 and ran the binary (taskset on an isolated core) and averaged over 6 runs of each there's still a ~3.7% performance advantage to having eventring enabled. I wondered whether just having an extra environment variable the same size as |
|
@shakthimaan is currently running benchmarks of this PR on the sandmark benchmarking infrastructure (thanks Shakthi!). The sequential macro benchmarks are done and look remarkably similar to the ones that we did on our benchmarking setup: There are indeed still benchmarks where eventring enabled performs better than disabled. Parallel benchmarks are still running. I'll update this when they're done. |
|
Personally I'm not so worried about the benchmarks: there is nothing in them that suggests that the changes does add noticeable overhead (we are discussing strange noise effects, which suggests that there is nothing else to discuss). And if we were to find noticeable performance impact later, we have the easy option of moving more probes from the normal runtime to the instrumented runtime. I would consider this point settled. But if we want to make a decision on the PR, we need:
So far we haven't had those. Point (1) sounds like an easy sell ("it comes from the same group, we said yes last time, they say it's even better this time, so what's not to like?"), and we have a caml-devel meeting shortly where you can always poll to force a decision. Point (2) is the bottleneck, as always. |
abbysmal
left a comment
There was a problem hiding this comment.
I made a first pass of review, I'm overall pretty enthusiastic about this PR and I hope we can get more people to review it.
It follows up pretty nicely eventlog's initial target, and bring lower overhead to the tracing infrastructure within the runtime, thus lowering the barrier to using these tools. (since we do not have to hide most of the counters and events behind the instrumented runtime anymore.)
This will prove especially useful as more users will write parallel programs, and eventlog proved to be a very useful tool to the Multicore team to investigate interactions between domains during GC events.
Now I will add a few extra comments:
I did not review the win32 code, and I will need to do a second pass of review on the consumer code, as well as to the writing to the eventring logic.
I remember some early discussions and could make assumptions about parts of the design how the code interacts with it, but I think adding a few comments, defines, could make the review experience a bit nicer.
I think having a nice comment describing the overall topology of the eventring structure could be useful (sort of in the same vein as the one that can be found for domain lifecycle in domain.c).
I had a few other nitpicks and remarks, all of them can be found in individual comments on the PR.
runtime/caml/eventring.h
Outdated
| }; | ||
|
|
||
| #define EVENTRING_NUM_ALLOC_BUCKETS 20 | ||
| #define EVENTRING_MAX_MSG_LENGTH (1 << 10) |
There was a problem hiding this comment.
We do not currently have high-level (as far as C stands) functions to actually write user events.
We have a few asserts and checks in place to avoid reaching this condition, so maybe documenting this specific restriction (which looks totally fine.) is not required.
It would anyway need to be explicitly stated in a dedicated API, as the writing to the ringbuffer logic is currently hidden behind CAML_INTERNAL and is not a primitive.
runtime/eventring_consumer.c
Outdated
| CAMLlocal2(tmp_callback, callbacks_root); | ||
| CAMLlocalN(params, 4); | ||
|
|
||
| callbacks_root = *((value *)callback_data); |
There was a problem hiding this comment.
I'd like to take a look if there's any problem with the various callback_data bits in this (and related) functions and if there's any potential for it to be harmfully moved.
runtime/eventring_consumer.c
Outdated
| } while (cursor->current_positions[domain_num] < ring_tail && | ||
| (max_events == 0 || consumed < max_events) && !early_exit); | ||
|
|
||
| cursor->next_read_domain = (domain_num + 1) % cursor->metadata->max_domains; |
There was a problem hiding this comment.
Modulus is expensive. Assuming you can't guarantee cursor->metadata->max_domains is divisible by two then might be worth replacing with domain_num + 1 == cursor->metadata->max_domains ? 0 : domain_num + 1?
patricoferris
left a comment
There was a problem hiding this comment.
This is a really cool feature @sadiqj !
I've been looking into using this with ocaml-multicore/eio#138 where I'm connecting to an existing eventring rather than creating them programatically hence testing the Some (dir, pid) logic.
Thanks @patricoferris . It looks like that code was broken after a refactoring. I really need to add a test for reading an external process' events but I may need some help from an ocamltest expert to figure out how we'd do it. |
|
The latest changes move eventring to The only quirk that I still haven't ironed out (because I'm not entirely sure it needs to be changed) is that now the Thanks to @dra27 for all kinds of help in getting this moved. |
|
(rebased to trunk just now) I want to check where we are with the PR at the moment, as it would be great to get things merged and follow up with good documentation and tooling in time for 5.0. There's also some follow up work on the set of events we currently have, there may be some we want to remove (there are some no longer present in 5.0 that are there from 4.x) and add others. I've purposefully left that out to minimise the diff for this PR. @gasche are you still planning to take a look at the ring buffer implementation itself? |
ctk21
left a comment
There was a problem hiding this comment.
I like this - a binary ring buffer is much better than the old string based file mechanism for dumping events. I think the design of the ring buffer with the wrap around works with a concurrent reader.
On the testsuite side, it looks like there is good coverage. One area I don't think is covered in the tests is domain slot reuse - that's traditionally been a pain point for anything involving resource reuse. Altering the parallel test to go on to serially spawn/join 256 domains might easily get some coverage there.
I had a niggle around the Domain.id in the consuming API, I think we should consider being a bit more defensive with recursive polls on reading cursors and minor nits here & there.
|
|
||
| #ifdef CAML_INTERNALS | ||
|
|
||
| struct runtime_events_buffer_header { |
There was a problem hiding this comment.
Is a little picture here useful which denoting how ring_head and ring_tail are operated on by the producer and used by the consumer?
(or even just comments for ring_head/ring_tail around what the producer and consumer see in these variables)
|
Looks like all the things I spotted in review (which focused mainly on the C code side) have been addressed. |
|
As I said during the last development meeting, I have looked at the part of the PR that is not the C code itself. (Maybe not with the level of scrutiny of our usual full reviews, but I'm afraid we have to live with that in the absence of other volunteers.) Now that @ctk21 reviewed carefully the runtime_event_{consumer,producer}, we may have, in combination, enough assurance for an approval. @ctk21, do you agree with this assessment, or are there other things you would like to look at? |
|
@gasche I agree with your assessment. I don't have further things that need addressing. |
There was a problem hiding this comment.
Let's go for a formal approval then.
Two remarks:
- I'm still interested in that documentation we have been talking about :-)
- The code I've looked at looked fresh/young, I expect that we will find some other minor issues that we haven't caught during review. I think this is okay for a non-critical piece of code (it does nothing unless enabled, so does not endanger correctness of the compiler), it's part of the deal of Multicore not being a battle-tested runtime implementation yet.
|
Thanks for squashing the commits. The PR is a great addition for observability of OCaml programs. Looking forward to using it. As a next step, I hope you will contribute the documentation necessary for using eventring. |
|
Great. Thanks everyone for their time in providing feedback and reviewing. Off to documentation and supporting tooling! |
|
This PR breaks the build on OmniOS: OmniOS is probably wrong w.r.t. the types of
Your call. |
|
Having a look at the headers in Illumos it seems the minimum we need is |
|
Something else is wrong on ARM 32 bits and PowerPC 32 bits: These platforms implement 64-bit atomic operations using locks and library functions. For Clang the functions are linked automatically, but for GCC we need to link Do we really need 64-bit indices in the ring buffer on a 32-bit platform? Could you please use Note that a 64-bit atomic int implemented with locks occupies more than 8 bytes, so you're not getting the same memory layout on ARM-32 than on, say, x86-64, at any rate. |
|
We don't need to keep 64-bit indices on 32-bit platforms. I can't imagine anyone would really want to monitor 32-bit from 64-bit or vice-versa (or across archs). I'll test a combined fix on precheck to see if there's anything else that's fallen out. |
|
Thinking about it... if |
|
Oh yes, this is a good point. I'll get |
|
I'm on it, it's undergoing CI. |
|
I pushed the |
|
Thank you, I'll get the omnios fix in soon. |

Summary
Eventring is a runtime tracing system designed for continuous monitoring of OCaml applications. It intends to replace the existing file-logging based eventlog in the runtime. It is very low overhead - negligible when not in use and ~0.5% when enabled. It utilises per-domain memory-mapped ring buffers that enable consumption of tracing events both in and out of process. At present only runtime events (the same set as eventlog) are added to the buffers but the design is intended to also eventually support user events as well as integrate with statmemprof.
Design
Context
The OCaml runtime currently provides an event logging system "eventlog" which can emit a range of runtime and GC events in the Common Trace Format (CTF) to a log file. It is currently only enabled in the instrumented runtime. The file-based logging also makes using eventlog for long running applications very difficult.
Goal
The goal of this PR is to replace eventlog's implementation with a minimal implementation of a more flexible transport for runtime and GC events. It intends to enable consuming of these events from OCaml code both inside and outside the runtime being monitored.
This PR only intends to deal with runtime and GC events, not those originating from user code. There is follow-up work on porting eventlog-tools, documentation and making the Windows support more robust (and potentially more native). We hold off on those if there are design changes required at this stage and also because the PR is already fairly large.
Design
Probes
This PR retains the existing probes on trunk but moves most in to the default runtime. A few of the probes on hot paths (such as free list allocation) are left in the instrumentated runtime. The dividing line is essentially where the work performed during a particular event is very high compared to the (low) cost of emitting an event.
Transport
The on-disk log file in eventlog is replaced with per-domain memory-mapped ring buffers.
The ring buffer is structured as a flight recorder, overwriting old data when is insufficient space to write new events. This enables users to potentially only read the ring when some anomalous event occurs or a performance target is missed. No coordination is needed with consumers who read the events - they detect races with the producer and discard events when that happens.
The producer code is contained here . By default an .eventring file is created in the current directory (overridable by setting OCAML_EVENTRING_DIR). This file contains a ring buffer for each possible domain (Max_domains). It is laid out in a structure that enables sparsity. On-disk (or in-memory) footprint is proportional to the max number of concurrent domains the process has ever run.
On-disk structure
API
The consumer APIs are cursor-based and polled. Users create a cursor with a path and pid, and if successfully created can then poll the cursor with a set of callbacks to call on each event they are interested in.
The OCaml API is available in Eventring.mli, the C API is available in eventring.h.
Consumer code is contained in eventring_consumer.c. This contains both the implementations for the OCaml API and C API (the former being a thin layer over the latter).
Examples
Emitting minor and major slice GC pause times to stdout
Here's a simple consumer that prints the times of minor and major slices. It omits any support for multiple domains, error handling or lost events:
When compiled and run with this PR's branch you will get an output like:
Code such as the above could be spawned in a separate domain and send metrics data to an external service (such as graphite or prometheus) over a socket.
Chrome tracing
This is a simplified example of how Eventring can be used to log out runtime data in a format that can be simply ingested by Chrome's trace viewer. Running the above project with
OCAML_EVENTRING_START=1 _build/default/src/traceevents_lib.exeand then loading the resultant json file in to Chrome's tracing viewer will give something like the following:Tests
The tests/eventring directory contains some other exampel of using the different APIs.
Benchmarks
Timing
For all macro benchmarks on sandmark that run on trunk, the performance impact when normalised to trunk (f22a84b) averaged over five runs is:
<1 indicates eventring's branches were marginally faster than trunk.
The graph below shows the normalised (to trunk) runtime for long-running (>4s) macro benchmarks:
All macro benchmarks:
You can find the raw benchmarking data here: https://github.com/sadiqj/sandmark_benchmark_data/tree/main/eventring_pr
Instructions
Below is a run of sequential sandmark using perf stat to count the number of retired instructions:
Here eventring resulted an 0.01% increase in instructions when not enabled and 0.09% when enabled.
For each benchmark below there are two rows, this PR without eventring enabled and this PR with it enabled. Instructions increase is the percentage increase in instructions for this PR over trunk e.g for the first row
LU_decompositionthis PR leads to an increase of 7 thousandths of a percent (2.25M extra instructions on the 32.28B trunk does).