Instrumentation with LLVM XRay

## Goal from https://github.com/ClickHouse/ClickHouse/issues/71175

When the code is compiled with LLVM XRay, it adds a space in entry and exit of each large function (filled with nop instructions), that can be patched with callbacks in runtime. This has a low runtime overhead and can be used for production builds, so they can be traced when needed. LLVM already provides various callbacks for global application tracing. But we want to integrate these capabilities into ClickHouse, e.g. - to enable and disable tracing with SYSTEM queries, to dynamically add logs or failpoints (e.g. sleep for a certain time in a certain function), write tracing information into system tables, per-query tracing, and augmentation of traces with information about the query, implementing code coverage in production, etc.

This topic will be interesting for people who like low-level programming and compilers.

## Investigation materials

1. Read the [ClickHouse documentation about LLVM XRay](https://clickhouse.com/docs/knowledgebase/profiling-clickhouse-with-llvm-xray)
2. Read the LLVM XRay documentation: https://llvm.org/docs/XRay.html and https://llvm.org/docs/XRayExample.html
3. Watch the two talks about LLVM XRay to gain deeper understanding on how it works: https://www.youtube.com/watch?v=cBc_MxbzqhY and https://www.youtube.com/watch?v=jyL-__zOGcU. The part with the explanation on how they patch atomically the functions is brilliant. TL;DR they start patching backwards and patch the jmp instruction (2 bytes) the latest one. They make sure the jmp is always at an even address to ensure it doesn't cross the cache line (64b in x86_64), hence the atomicity 👌
4. Understand the different modes XRay has. Some summary:
- The xray-basic mode generates too big traces. We could use them if we enable traces only a little bit, but they grow up very quickly
- The xray-fdr mode [can only be controlled programmatically](https://discourse.llvm.org/t/xray-options-variable-does-not-enable-fdr-mode-and-log-flushing/48825) (enabling it in the source code of an app using XRay) as of now, and uses a circular buffer. It's meant to be something that you switch on and off, and collect the results of the latest that happened
- There's a "new" [xray-profiling](https://discourse.llvm.org/t/rfc-xray-profiling-in-llvm/47651) mode that is not even documented. I saw that in the source code. It's supposed to do what we'd expect when profiling. The size of the traces is minimum compared to xray-basic, since it already computes the times spent in the stack for each function. However, llvm-xray tools don't seem to understand that format. The code that generates the format is there, so I guess it's a matter of checking it out. Can we use this format as it is somehow already in order to avoid creating our own patch methods to profile function calls?
5. Check out the source code from LLVM XRay at https://github.com/llvm/llvm-project/tree/main/compiler-rt/lib/xray, https://github.com/llvm/llvm-project/tree/main/llvm/lib/XRay and the tests at https://github.com/llvm/llvm-project/tree/main/compiler-rt/lib/xray/tests/unit

## Implementation steps proposal

Since I've already done some investigation about this and it is task for an intern, let me write a more comprehensive list of stuff that I think we can do and in different steps, going from simpler to more complex stuff. The goal is to be able to install new sleds in ClickHouse to add instrumentation profiling and other interesting ideas to enhance logging, add sleeps, etc.

1. In order to patch functions, XRay's API uses some ID for each function. The alternative would be to patch all functions and filter the name in our handler, but I think that'd affect performance significantly. We can test it, anyway. So, we'll probably need to map function names to these IDs, since we'd like to have our `SYSTEM` statement call use function names. XRay creates a mapping that is embedded into the final executable in the `xray_instr_map` section of the ELF binary. This can be read with the `llvm-xray` CLI tool. We need to be able to read that section from within the same binary. I'd start writing a simple C++ application with some classes with simple methods, instrumenting that with XRay and trying to add a new sled for one of the existing functions. Check out [InstrumentationMap.cpp](https://github.com/llvm/llvm-project/blob/c2979c58d49bf3c7dc892ed9fb49cdca389130ee/llvm/lib/XRay/InstrumentationMap.cpp).
2. Create a new `SYSTEM` statement that ClickHouse interprets to patch. We can add for now new logging/sleep functionality to desired functions. e.g. `SYSTEM INSTRUMENT SLEEP/LOG [function] [parameters]`. I think every instrumentation should be logged in a new `system.instrumentation` table to know which functions have instrumentation. What other functionality do you think would be interesting that it's easy to implement?
- Example of a PR that adds a new system table: [system.error_log](https://github.com/ClickHouse/ClickHouse/pull/65381/)
- Example of a PR that adds a new system statement: [SYSTEM UNLOAD PRIMARY KEY](https://github.com/ClickHouse/ClickHouse/pull/62738)
3.  Add instrumentation to profiles a function. e.g. `SYSTEM INSTRUMENT PROFILE [function]`
- It can profile only the time it takes to execute that single function, not each of its children, which I think is more complicated. See bullet point 6
- Trace the time for the function in a new system table
4. Now that the new table in system that stores the data, we need to think how to export it to some format that can easily be consumed and visualized. I suggest using the [trace format](https://www.chromium.org/developers/how-tos/trace-event-profiling-tool/) that can be used from Chrome trace viewer, [Speedscope](https://www.speedscope.app/) or [Perfetto](https://ui.perfetto.dev/)
5. We would love to be able to add these instrumentations only for specific queries. In that case, it'd be a setting at query level that enables the type of instrumentation only for the given query. In order to do that, we'd have to do something very similar to bullet point 6: patch the function but filter at the handler level for that specific query
6. I think profiling the children/descendants of a function would be quite useful, but I also think it's not simple at all. How do we identify all functions we need to patch? That seems very difficult. A naïve approach could be as follows, but I haven't tested it. Beware the threading complications of all of it:
- In the entry handler for the function we want to profile/patch we install a handler for all functions. We store that thread id in a `threads_to_monitor` list, since other threads may end up calling this same function.
- We measure the time each function takes, as long as its thread it's in a thread of the `threads_to_monitor` list
- On the exit handler we remove our thread id from `threads_to_monitor` and if it's empty it means we need to unpatch this and all functions
- Of course, patching and unpatching all functions during each execution of a function seems not a good thing for performance and filtering the thread for every single function goes against to what I stated about performance in bullet point 1, but I have no better idea. Maybe the `xray-profiling` mode does it in a different way? Or maybe it does not even consider the children.






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instrumentation with LLVM XRay #74249

Goal from #71175

Investigation materials

Implementation steps proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Instrumentation with LLVM XRay #74249

Description

Goal from #71175

Investigation materials

Implementation steps proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions