Skip to content

Fix possible deadlock for jemalloc with enabled profiler#66346

Merged
antonio2368 merged 1 commit intoClickHouse:masterfrom
azat:jemalloc-deadlock-fix
Jul 11, 2024
Merged

Fix possible deadlock for jemalloc with enabled profiler#66346
antonio2368 merged 1 commit intoClickHouse:masterfrom
azat:jemalloc-deadlock-fix

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jul 10, 2024

Note: marked as not-for-changelog, since the culprit hadn't been included into any release yet - #65702

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix possible deadlock for jemalloc with enabled profiler

ClickHouse uses JEMALLOC_PROF_LIBGCC - _Unwind_Backtrace() for jemalloc, which calls _Unwind_Backtrace() during bootstrap of jemalloc, and if, dlsym() has allocations it will lead to deadlock:

Details
0  __lll_lock_wait (futex=futex@entry=0x16516410 <init_lock+64>, private=0) at lowlevellock.c:52
1  0x00007ffff7f9d0a3 in __GI___pthread_mutex_lock (mutex=0x16516410 <init_lock+64>) at ../nptl/pthread_mutex_lock.c:80
2  0x000000000ba6ec23 in pthread_mutex_lock (arg=arg@entry=0x16516410 <init_lock+64>) at ./build/RelWithDebInfo/./src/Common/ThreadFuzzer.cpp:445
3  0x0000000014215fd4 in malloc_mutex_lock_final (mutex=0x165163d0 <init_lock>) at ./contrib/jemalloc/include/jemalloc/internal/mutex.h:151
4  malloc_mutex_lock_slow (mutex=0x165163d0 <init_lock>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/mutex.c:90
5  0x00000000141defc6 in malloc_mutex_lock (tsdn=0x0, mutex=<optimized out>) at ./contrib/jemalloc/include/jemalloc/internal/mutex.h:217
6  malloc_init_hard () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2118
7  0x00000000141d05db in malloc_init () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:298
8  imalloc_init_check (sopts=<optimized out>, dopts=<optimized out>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2658
9  imalloc (sopts=<optimized out>, dopts=<optimized out>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2689
10 calloc (num=num@entry=1, size=size@entry=32) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2852
11 0x00007ffff7c42c05 in _dlerror_run (operate=operate@entry=0x7ffff7c42490 <dlsym_doit>, args=args@entry=0x7fffffffda40) at dlerror.c:148
12 0x00007ffff7c42525 in __dlsym (handle=<optimized out>, name=0xe7f568 "dl_iterate_phdr") at dlsym.c:70
13 0x000000001408e1a0 in (anonymous namespace)::getOriginalDLIteratePHDR () at ./build/RelWithDebInfo/./base/base/phdr_cache.cpp:44
14 dl_iterate_phdr (callback=0x16287a60 <libunwind::findUnwindSectionsByPhdr(dl_phdr_info*, unsigned long, void*)>, data=0x7fffffffdb58) at ./build/RelWithDebInfo/./base/base/phdr_cache.cpp:64
15 0x0000000016288468 in libunwind::LocalAddressSpace::findUnwindSections (this=<optimized out>, targetAddr=371765377, info=...) at ./contrib/libunwind/src/AddressSpace.hpp:605
16 libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister (this=this@entry=0x7fffffffde70, isReturnAddress=false) at ./contrib/libunwind/src/UnwindCursor.hpp:2553
17 0x0000000016287617 in __unw_init_local (cursor=cursor@entry=0x7fffffffde70, context=context@entry=0x7fffffffddc8) at ./build/RelWithDebInfo/./contrib/libunwind/src/libunwind.cpp:91
18 0x000000001628b094 in _Unwind_Backtrace (callback=0x14226f40 <prof_unwind_init_callback>, ref=0x0) at ./build/RelWithDebInfo/./contrib/libunwind/src/UnwindLevel1-gcc-ext.c:106
19 0x000000001421a7e0 in prof_boot2 (tsd=tsd@entry=0x7ffff7c357c0, base=<optimized out>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/prof.c:693
20 0x00000000141df3e7 in malloc_init_hard () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2151
21 0x00000000141d96f1 in malloc_init () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:298
22 nallocx (size=32, flags=0) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:4011
23 0x000000000b9fdd9b in _ZN6Memory23getActualAllocationSizeITpTkNSt3__17same_asISt11align_val_tEEJEQsr2DBE16OptionalArgumentIDpT_EEEmmS5_ (size=32) at ./src/Common/memory.h:197
24 _ZN6Memory11trackMemoryITpTkNSt3__17same_asISt11align_val_tEEJEQsr2DBE16OptionalArgumentIDpT_EEEmmR15AllocationTraceS5_ (size=32, trace=...) at ./src/Common/memory.h:208
25 operator new (size=size@entry=32) at ./build/RelWithDebInfo/./src/Common/new_delete.cpp:53
26 0x000000001540e0a6 in google::protobuf::internal::ShutdownData::get () at ./build/RelWithDebInfo/./contrib/google-protobuf/src/google/protobuf/message_lite.cc:679
27 google::protobuf::internal::OnShutdownRun (f=0x15407420 <google::protobuf::internal::DestroyString(void const*)>, arg=0x167a6368 <google::protobuf::internal::fixed_address_empty_string>) at ./build/RelWithDebInfo/./contrib/google-protobuf/src/google/protobuf/message_lite.cc:697
28 0x0000000015407560 in global constructors keyed to 000101 () at ./contrib/google-protobuf/src/google/protobuf/generated_message_util.h:180
29 0x000000001629101d in __libc_csu_init ()
30 0x00007ffff7dba010 in __libc_start_main (main=0x66e3b00 <main(int, char**)>, argc=4, argv=0x7fffffffe258, init=0x16290fd0 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe248) at ../csu/libc-start.c:264
31 0x00000000066e302e in _start ()

And this is indeed what happens for glibc prio 2.34, since only in 1 the allocation had been removed

But, jemalloc provides also JEMALLOC_PROF_LIBUNWIND, yes it is not llvm libunwind, but gnu one, but ClickHouse version of llvm libunwind provides unw_backtrace(), so we can use it, and it will not be called during jemalloc bootstrap, so deadlock should not be possible.

Funny thing that for this deadlock the workaround is to enable background_thread, because a thread created for it will initialize TLS for __dlerror().

Refs: jemalloc/jemalloc#585
Refs: jemalloc/jemalloc#2454
Fixes: #65702 (Cc: @antonio2368 )

ClickHouse uses JEMALLOC_PROF_LIBGCC - _Unwind_Backtrace() for jemalloc,
which calls _Unwind_Backtrace() during bootstrap of jemalloc, and if,
dlsym() has allocations it will lead to deadlock:

    0  __lll_lock_wait (futex=futex@entry=0x16516410 <init_lock+64>, private=0) at lowlevellock.c:52
    1  0x00007ffff7f9d0a3 in __GI___pthread_mutex_lock (mutex=0x16516410 <init_lock+64>) at ../nptl/pthread_mutex_lock.c:80
    2  0x000000000ba6ec23 in pthread_mutex_lock (arg=arg@entry=0x16516410 <init_lock+64>) at ./build/RelWithDebInfo/./src/Common/ThreadFuzzer.cpp:445
    3  0x0000000014215fd4 in malloc_mutex_lock_final (mutex=0x165163d0 <init_lock>) at ./contrib/jemalloc/include/jemalloc/internal/mutex.h:151
    4  malloc_mutex_lock_slow (mutex=0x165163d0 <init_lock>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/mutex.c:90
    5  0x00000000141defc6 in malloc_mutex_lock (tsdn=0x0, mutex=<optimized out>) at ./contrib/jemalloc/include/jemalloc/internal/mutex.h:217
    6  malloc_init_hard () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2118
    7  0x00000000141d05db in malloc_init () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:298
    8  imalloc_init_check (sopts=<optimized out>, dopts=<optimized out>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2658
    9  imalloc (sopts=<optimized out>, dopts=<optimized out>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2689
    10 calloc (num=num@entry=1, size=size@entry=32) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2852
    11 0x00007ffff7c42c05 in _dlerror_run (operate=operate@entry=0x7ffff7c42490 <dlsym_doit>, args=args@entry=0x7fffffffda40) at dlerror.c:148
    12 0x00007ffff7c42525 in __dlsym (handle=<optimized out>, name=0xe7f568 "dl_iterate_phdr") at dlsym.c:70
    13 0x000000001408e1a0 in (anonymous namespace)::getOriginalDLIteratePHDR () at ./build/RelWithDebInfo/./base/base/phdr_cache.cpp:44
    14 dl_iterate_phdr (callback=0x16287a60 <libunwind::findUnwindSectionsByPhdr(dl_phdr_info*, unsigned long, void*)>, data=0x7fffffffdb58) at ./build/RelWithDebInfo/./base/base/phdr_cache.cpp:64
    15 0x0000000016288468 in libunwind::LocalAddressSpace::findUnwindSections (this=<optimized out>, targetAddr=371765377, info=...) at ./contrib/libunwind/src/AddressSpace.hpp:605
    16 libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister (this=this@entry=0x7fffffffde70, isReturnAddress=false) at ./contrib/libunwind/src/UnwindCursor.hpp:2553
    17 0x0000000016287617 in __unw_init_local (cursor=cursor@entry=0x7fffffffde70, context=context@entry=0x7fffffffddc8) at ./build/RelWithDebInfo/./contrib/libunwind/src/libunwind.cpp:91
    18 0x000000001628b094 in _Unwind_Backtrace (callback=0x14226f40 <prof_unwind_init_callback>, ref=0x0) at ./build/RelWithDebInfo/./contrib/libunwind/src/UnwindLevel1-gcc-ext.c:106
    19 0x000000001421a7e0 in prof_boot2 (tsd=tsd@entry=0x7ffff7c357c0, base=<optimized out>) at ./build/RelWithDebInfo/./contrib/jemalloc/src/prof.c:693
    20 0x00000000141df3e7 in malloc_init_hard () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:2151
    21 0x00000000141d96f1 in malloc_init () at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:298
    22 nallocx (size=32, flags=0) at ./build/RelWithDebInfo/./contrib/jemalloc/src/jemalloc.c:4011
    23 0x000000000b9fdd9b in _ZN6Memory23getActualAllocationSizeITpTkNSt3__17same_asISt11align_val_tEEJEQsr2DBE16OptionalArgumentIDpT_EEEmmS5_ (size=32) at ./src/Common/memory.h:197
    24 _ZN6Memory11trackMemoryITpTkNSt3__17same_asISt11align_val_tEEJEQsr2DBE16OptionalArgumentIDpT_EEEmmR15AllocationTraceS5_ (size=32, trace=...) at ./src/Common/memory.h:208
    25 operator new (size=size@entry=32) at ./build/RelWithDebInfo/./src/Common/new_delete.cpp:53
    26 0x000000001540e0a6 in google::protobuf::internal::ShutdownData::get () at ./build/RelWithDebInfo/./contrib/google-protobuf/src/google/protobuf/message_lite.cc:679
    27 google::protobuf::internal::OnShutdownRun (f=0x15407420 <google::protobuf::internal::DestroyString(void const*)>, arg=0x167a6368 <google::protobuf::internal::fixed_address_empty_string>) at ./build/RelWithDebInfo/./contrib/google-protobuf/src/google/protobuf/message_lite.cc:697
    28 0x0000000015407560 in global constructors keyed to 000101 () at ./contrib/google-protobuf/src/google/protobuf/generated_message_util.h:180
    29 0x000000001629101d in __libc_csu_init ()
    30 0x00007ffff7dba010 in __libc_start_main (main=0x66e3b00 <main(int, char**)>, argc=4, argv=0x7fffffffe258, init=0x16290fd0 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe248) at ../csu/libc-start.c:264
    31 0x00000000066e302e in _start ()

And this is indeed what happens for glibc prio 2.34, since only in [2]
the allocation had been removed

  [2]: https://sourceware.org/git/?p=glibc.git;a=commit;h=fada9018199c21c469ff0e731ef75c6020074ac9

But, jemalloc provides also JEMALLOC_PROF_LIBUNWIND, yes it is not llvm
libunwind, but gnu one, but ClickHouse version of llvm libunwind
provides unw_backtrace(), so we can use it, and it will not be called
during jemalloc bootstrap, so deadlock should not be possible.

Funny thing that for this deadlock the workaround is to enable
background_thread, because a thread created for it will initialize TLS for
__dlerror().

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@robot-ch-test-poll4 robot-ch-test-poll4 added pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR. labels Jul 10, 2024
@antonio2368 antonio2368 self-assigned this Jul 10, 2024
@robot-ch-test-poll4
Copy link
Copy Markdown
Contributor

robot-ch-test-poll4 commented Jul 10, 2024

This is an automated comment for commit e6e8b06 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help❌ failure
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 10, 2024

And I can confirm that this helps:

$ docker run --rm -it ubuntu:20.04 apt-cache policy libc6
libc6:
  Installed: 2.31-0ubuntu9.14
  Candidate: 2.31-0ubuntu9.14
  Version table:
 *** 2.31-0ubuntu9.14 100
        100 /var/lib/dpkg/status
# without background_thread, with profiler, with -DJEMALLOC_PROF_LIBGCC=1
$ docker run -v $(which clickhouse):/bin/clickhouse --rm -it ubuntu:20.04 clickhouse server --help
^C^C^C^C # hunged
# without background_thread, with profiler, with -DJEMALLOC_PROF_LIBUNWIND=1
$ docker run -v $(which clickhouse):/bin/clickhouse --rm -it ubuntu:20.04 clickhouse server --help
usage:
clickhouse [OPTION] [-- [ARG]...]
positional arguments can be used to rewrite config.xml properties, for
example, --http_port=8010

-h, --help                        show help and exit
-V, --version                     show version and exit
-C<file>, --config-file=<file>    load configuration from a given file
-L<file>, --log-file=<file>       use given log file
-E<file>, --errorlog-file=<file>  use given log file for errors only
-P<file>, --pid-file=<file>       use given pidfile
--daemon                          Run application as a daemon.
--umask=mask                      Set the daemon's umask (octal, e.g. 027).
--pidfile=path                    Write the process ID of the application to
                                  given file.

Copy link
Copy Markdown
Member

@antonio2368 antonio2368 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And stacktraces are collected correctly with profiler

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 10, 2024

And so as profiler:

1|azat:.../clickhouse/.cmake (dev) [1]$ jeprof --tools addr2line:/bin/llvm-addr2line,nm:/bin/llvm-nm,objdump:/bin/llvm-objdump,c++filt:/bin/c++filt --text $(which clickhouse) /tmp/jemalloc_clickhouse.1602.3.heap
Using local file /src/ch/clickhouse/.cmake/programs/clickhouse.
Using local file /tmp/jemalloc_clickhouse.1602.3.heap.
Total: 8197.0 MB
  8197.0 100.0% 100.0%   8197.0 100.0% std::__1::vector::__destroy_vector::operator[abi:v15007] (inline)
     0.0   0.0% 100.0%   8197.0 100.0% DB::::FunctionMapOp::execute1
     0.0   0.0% 100.0%   8197.0 100.0% DB::::FunctionMapOp::execute2 (inline)
     0.0   0.0% 100.0%      1.8   0.0% DB::::buildSelectQueryDescription
     0.0   0.0% 100.0%      1.3   0.0% DB::AnyJoinImpl::join
     0.0   0.0% 100.0%      1.3   0.0% DB::AnyJoinState::setValue (inline)
     0.0   0.0% 100.0%      1.3   0.0% DB::Chunk::operator= (inline)
     0.0   0.0% 100.0%   8192.0  99.9% DB::Context::applySettingChange
     0.0   0.0% 100.0%      1.0   0.0% DB::Field (inline)
     0.0   0.0% 100.0%      1.0   0.0% DB::Field::Field (inline)
     0.0   0.0% 100.0%      1.0   0.0% DB::Field::createConcrete (inline)
     0.0   0.0% 100.0%   8196.0 100.0% DB::FunctionMapPopulateSeries::executeImplTyped
     0.0   0.0% 100.0%   8192.0  99.9% DB::MetadataStorageFromStaticFilesWebServer::isDirectory

P.S. you should always use llvm tools, since binutils are significantly slower

@azat
Copy link
Copy Markdown
Member Author

azat commented Jul 11, 2024

Stateless tests (ubsan) — fail: 1, passed: 6684, skipped: 56

Stateless tests (debug) [2/2] — Tests are not finished, fail: 6, passed: 2233, skipped: 28

2024.07.11 04:56:13.783145 [ 11934 ] {4682ae8f-c6ab-46d7-bbe9-1b9ee723803a} <Fatal> : Logical error: 'Inconsistent AST formatting: the query:
SELECT *, e, e2 FROM t1 FULL OUTER JOIN t2 ON (((t1.x = t2.x) AS e) AND ((t2.x IS NOT NULL) AND (t1.x IS NOT NULL))) OR ((t2.x IS NULL) AND (t1.x IS NULL)) AS e2 ORDER BY t1.x ASC NULLS LAST, t2.x ASC NULLS LAST
cannot parse.'.

@antonio2368 antonio2368 added this pull request to the merge queue Jul 11, 2024
Merged via the queue into ClickHouse:master with commit 3d02388 Jul 11, 2024
@azat azat deleted the jemalloc-deadlock-fix branch July 11, 2024 09:50
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 11, 2024
rynewang added a commit to ray-project/ray that referenced this pull request Nov 13, 2024
This fixes a jemalloc profiling deadlock, as hit by clickhouse
ClickHouse/ClickHouse#66346

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
This fixes a jemalloc profiling deadlock, as hit by clickhouse
ClickHouse/ClickHouse#66346

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants