Skip to content

Add instrumentation using XRay#89173

Merged
pamarcos merged 233 commits intoClickHouse:masterfrom
pamarcos:xray-instrumentation
Dec 2, 2025
Merged

Add instrumentation using XRay#89173
pamarcos merged 233 commits intoClickHouse:masterfrom
pamarcos:xray-instrumentation

Conversation

@pamarcos
Copy link
Copy Markdown
Member

@pamarcos pamarcos commented Oct 29, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add instrumentation at runtime using XRay to debug issues in production and profile deterministically. Resolves #74249

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Details

Add instrumentation to ClickHouse using LLVM's XRay feature to debug issues in production without needing to rebuild. This is a reworked PR to bring the work that @htuall started at #77153 to be production-ready.

Please take a look at the original issue to understand better what this is about. The TL;DR is that we can leverage LLVM's XRay to add a prolog and an epilog when entering and exiting functions that are longer than 200 instructions (by default) to execute extra handlers.

  • Enable the XRay instrumentation for Linux x86_64 and aarch64 for all non-sanitized builds. Using XRay along with other sanitizers is incompatible. See this comment.
  • New SYSTEM INSTRUMENT statement support to add and remove instrumentation points
  • New system table to inspect instrumented symbols: system.instrumentation
  • Add new columns to system.trace_log to query instrumentation log entries
  • Read the xray_instr_map only when there's a need to patch a function
  • 3 types of handlers so far: LOG, SLEEP and PROFILE. For the first two, you can install them on entry or on exit of the function

Closes #74249

Binary size comparison shows a ~4.5% increase for x86_64 and ~7% increase for aarch64.

htuall and others added 30 commits April 9, 2025 21:18
…h custom handler for xray + started working on adding SYSTEM-statements to add/remove instrumentation(wip)
…atements using it + attempted to add system.instrumentation table(not sure if it's coorrect
Let's build with ENABLE_XRAY=1 only for the Debug build on AMD
for now. This is just to ensure on the CI that it builds
correctly.

There are some USE_XRAY missing in other files, but it is left
on purpose as an excercise to the reader :)
…andler_name and function_name as identifiers and not string literals, add possibility to pass different type of parameters for differnt hadlers and make them optional
@pamarcos pamarcos enabled auto-merge December 2, 2025 19:42
@pamarcos pamarcos added this pull request to the merge queue Dec 2, 2025
@pamarcos
Copy link
Copy Markdown
Member Author

pamarcos commented Dec 2, 2025

I'm merging this after syncing with master because cross compilation builds affected slightly how XRay flavor needed to be built.
I know there's a performance degradation in sum_map , but I decided to merge anyway to catch what other performance issues we may find.

Merged via the queue into ClickHouse:master with commit 7f36330 Dec 2, 2025
125 of 130 checks passed
@pamarcos pamarcos deleted the xray-instrumentation branch December 2, 2025 20:13
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 2, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Dec 2, 2025

FWIW this breaks incremental build (not a big deal, but still)

I though that it was due to cmake cached variables, since previously we have ENABLE_XRAY=OFF, but even after explicit cmake -DENABLE_XRAY=ON it complains with

/src/ch/clickhouse/src/Interpreters/InstrumentationManager.h:5:10: fatal error: 'xray/xray_interface.h' file not found
    5 | #include <xray/xray_interface.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~

At this point I will simply clean the build cache.

@azat
Copy link
Copy Markdown
Member

azat commented Dec 2, 2025

Oh, this is actually real an issue with my setup, it uses xray_interface.h from system-wide clang, and I'm using clang compiled from sources, but I did not compile compiler-rt, and hence it cannot find llvm-project/compiler-rt/include/xray/xray_interface.h.

Though we cannot use bundled xray_interface.h (sigh)

@pamarcos pamarcos mentioned this pull request Dec 3, 2025
1 task
@pamarcos
Copy link
Copy Markdown
Member Author

pamarcos commented Dec 3, 2025

Quick guide:

@Algunenano
Copy link
Copy Markdown
Member

Though we cannot use bundled xray_interface.h (sigh)

Why can't we? Using the system one seems like a bug to me. We are trying to use as fewer system files as possible as this goes against it.

@Algunenano
Copy link
Copy Markdown
Member

Moved the discussion to an issue: #91475

break;
}

res->instrumentation_point_id = temporary_identifier->as<ASTLiteral>()->value.safeGet<UInt64>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

azat.local$  :) SYSTEM INSTRUMENT REMOVE 'QueryMetricLog::startQuery'
Exception on client:
Code: 170. DB::Exception: Bad get: has String, requested UInt64. (BAD_GET), Stack trace (when copying this message, always include the lines below):

0. /src/ch/clickhouse/contrib/llvm-project/libcxx/include/__exception/exception.h:113: Poco::Exception::Exception(String const&, int) @ 0x00000000218618f9
1. /src/ch/clickhouse/src/Common/Exception.cpp:137: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000012efd6ed
2. /src/ch/clickhouse/src/Common/Exception.h:172: DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000aa89c11
3. /src/ch/clickhouse/src/Common/Exception.h:58: DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000aa8968b
4. /src/ch/clickhouse/src/Common/Exception.h:190: DB::Exception::Exception<std::basic_string_view<char, std::char_traits<char>>, std::basic_string_view<char, std::char_traits<char>>>(int, FormatStringHelperImpl<std::type_identity<std::basic_string_view<char, std::char_traits<char>>>::type, std::type_identity<std::basic_string_view<char, std::char_traits<char>>>::type>, std::basic_string_view<char, std::char_traits<char>>&&, std::basic_string_view<char, std::char_traits<char>>&&) @ 0x00000000178c51fa
5. /src/ch/clickhouse/src/Core/Field.cpp:872: DB::NearestFieldTypeImpl<__decay(unsigned long), void>::Type& DB::Field::safeGet<unsigned long>() & @ 0x00000000178c5673
6. /src/ch/clickhouse/src/Parsers/ParserSystemQuery.cpp:800: DB::ParserSystemQuery::parseImpl(DB::IParser::Pos&, std::shared_ptr<DB::IAST>&, DB::Expected&) @ 0x000000001d32e877
7. /src/ch/clickhouse/src/Parsers/IParserBase.cpp:14: DB::IParserBase::parse(DB::IParser::Pos&, std::shared_ptr<DB::IAST>&, DB::Expected&) @ 0x000000001d2e5476
8. /src/ch/clickhouse/src/Parsers/ParserQuery.cpp:87: DB::ParserQuery::parseImpl(DB::IParser::Pos&, std::shared_ptr<DB::IAST>&, DB::Expected&) @ 0x000000001d31081d
9. /src/ch/clickhouse/src/Parsers/IParserBase.cpp:14: DB::IParserBase::parse(DB::IParser::Pos&, std::shared_ptr<DB::IAST>&, DB::Expected&) @ 0x000000001d2e5476
10. /src/ch/clickhouse/src/Parsers/parseQuery.cpp:321: DB::tryParseQuery(DB::IParser&, char const*&, char const*, String&, bool, String const&, bool, unsigned long, unsigned long, unsigned long, bool) @ 0x000000001d34e1e6
11. /src/ch/clickhouse/src/Client/ClientBase.cpp:411: DB::ClientBase::parseQuery(char const*&, char const*, DB::Settings const&, bool) @ 0x000000001c2d24e6
12. /src/ch/clickhouse/src/Client/ClientBase.cpp:2480: DB::ClientBase::analyzeMultiQueryText(char const*&, char const*&, char const*, std::shared_ptr<DB::IAST>&, std::unique_ptr<DB::Exception, std::default_delete<DB::Exception>>&) @ 0x000000001c2e34fd
13. /src/ch/clickhouse/src/Client/ClientBase.cpp:2604: DB::ClientBase::executeMultiQuery(String const&) @ 0x000000001c2e3c75
14. /src/ch/clickhouse/src/Client/ClientBase.cpp:2963: DB::ClientBase::processQueryText(String const&) @ 0x000000001c2e5165
15. /src/ch/clickhouse/src/Client/ClientBase.cpp:3679: DB::ClientBase::runInteractive() @ 0x000000001c2ede8c
16. /src/ch/clickhouse/programs/client/Client.cpp:402: DB::Client::main(std::vector<String, std::allocator<String>> const&) @ 0x000000001314f52d
17. /src/ch/clickhouse/base/poco/Util/src/Application.cpp:315: Poco::Util::Application::run() @ 0x000000002194e8b4
18. /src/ch/clickhouse/programs/client/Client.cpp:1266: mainEntryClickHouseClient(int, char**) @ 0x000000001315c5ce
19. /src/ch/clickhouse/programs/main.cpp:380: main @ 0x000000000aa7fd3f
20. ../sysdeps/nptl/libc_start_call_main.h:58: __libc_start_call_main @ 0x0000000000027635
21. ../csu/libc-start.c:360: __ieee754_ilogbf128 @ 0x00000000000276e9
22. _start @ 0x000000000aa7512e

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the expected behavior. It was something like this at first, but then I changed it to take IDs to remove specific instrumentation points because you may want to remove only 1 handler for a function but leave the rest.

I can make REMOVE to take both a specific ID and also an identifier/string that removes all handlers that match that function if you think it's worth it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to support removing by identifier/string, but my point here was that the exception is unexpected, it should be some kind of parsing error

case Type::INSTRUMENT_ADD:
{
ASTPtr temporary_identifier;
if (ParserIdentifier{}.parse(pos, temporary_identifier, expected))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also any reasons for using identifier over string literals?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when we thought about this, Alina and I were happy considering function names as identifiers the same way we consider table or database names. Now, I agree it's more natural to think of function names as literals. I'm going to do a follow-up PR with UX improvements based on your feedback to make this easier to use.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #93345

@azat
Copy link
Copy Markdown
Member

azat commented Dec 26, 2025

Also right now it binds to first symbol only, while debuggers usually bind to all symbols, i.e. we have a few executeQuery functions and simple SYSTEM INSTRUMENT ADD "executeQuery" LOG EXIT 'test' will bind to DB::YTsaurusClient::executeQuery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🎅 🎁 gift🎄 To make people wonder performance pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Instrumentation with LLVM XRay