Skip to content

script/run-make: enable ASan#56537

Draft
tchaikov wants to merge 7 commits intoceph:mainfrom
tchaikov:wip-cmake-enable-sanitizers
Draft

script/run-make: enable ASan#56537
tchaikov wants to merge 7 commits intoceph:mainfrom
tchaikov:wip-cmake-enable-sanitizers

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Mar 27, 2024

when performing tests, we should enable sanitizers for detecting potential issues. so, in this change, we enable ASsan, TSan and UBSan.

script/run-make.sh is used by our CI job for testing PRs, so enabling these sanitizers helps us to identify issues as early as possible. because ASan cannot be used along with TSan, we prefer using ASan for capturing memory related issue in favor of detecting the multi-threading issues.

also, because of https://bugs.llvm.org/show_bug.cgi?id=23272, we cannot enable multiple sanitizers. but we should enable UBSan as well, once we can use a higher version of Clang than Clang-14. with Clang-14, when enabling UBSan, we'd have following FTBFS

error: Cannot represent a difference across sections

when compiling src/tools/neorados.cc

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from fdbbfbf to f6879f8 Compare April 12, 2024 20:53
@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from f6879f8 to 2a81b40 Compare April 25, 2024 00:44
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 7d63d7e to 0e77bc6 Compare May 3, 2024 13:57
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 0e77bc6 to a517fcd Compare May 12, 2024 03:03
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from a517fcd to fb8844a Compare May 17, 2024 00:52
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from fb8844a to 9352cb2 Compare May 30, 2024 04:21
@github-actions github-actions bot added the rgw label May 30, 2024
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 79c2159 to 8233af6 Compare June 5, 2024 15:03
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 8233af6 to c2ce9b7 Compare August 2, 2024 22:05
@github-actions github-actions bot added the script label Aug 2, 2024
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from c2ce9b7 to 995c40b Compare August 5, 2024 23:42
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 995c40b to 88a106b Compare October 3, 2024 08:42
@github-actions github-actions bot added the stale label Dec 2, 2024
@cbodley cbodley removed the stale label Dec 2, 2024
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 56e3c39 to 631d630 Compare June 27, 2025 03:38
@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch 2 times, most recently from 2074de3 to 6ac74ab Compare June 30, 2025 08:47
@Matan-B Matan-B self-requested a review July 6, 2025 13:18
@github-actions
Copy link

github-actions bot commented Jul 9, 2025

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tchaikov tchaikov force-pushed the wip-cmake-enable-sanitizers branch from 6ac74ab to 81f6fd0 Compare July 13, 2025 06:41
@github-actions
Copy link

github-actions bot commented Aug 4, 2025

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@Matan-B
Copy link
Contributor

Matan-B commented Aug 21, 2025

Really is a nit to all the work that is put here but should we also update "Sanitizers" section under https://github.com/ceph/ceph?tab=readme-ov-file#build-types ?

@tchaikov
Copy link
Contributor Author

Really is a nit to all the work that is put here but should we also update "Sanitizers" section under https://github.com/ceph/ceph?tab=readme-ov-file#build-types ?

@Matan-B thanks for reviewing this change. i don't think we should update the "Sanitizers" column in this change. because, IMHO, the table in "Build Types" is to explain the build settings of different build modes provided by CMake, not by run-make.sh. yes, do_cmake.sh is referenced by this section as well. do_cmake.sh is relatively a low-level helper script which is called by run-make.sh, which is in turn the script changed by this pull request. so neither do_cmake.sh nor the CMake build types is related to this pull request. run-make.sh is mainly used by our CI and probably some developers.

@tchaikov
Copy link
Contributor Author

jenkins test make check

1 similar comment
@tchaikov
Copy link
Contributor Author

tchaikov commented Sep 5, 2025

jenkins test make check

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@tchaikov
Copy link
Contributor Author

tchaikov commented Feb 9, 2026

some of the reported failures can be fixed by following PRs

@tchaikov
Copy link
Contributor Author

jenkins test make check

Copy link
Contributor

@bill-scales bill-scales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing

@tchaikov
Copy link
Contributor Author

@bill-scales Hey Bill, thanks for the approval! I think there might be a mix-up though — were you perhaps reviewing #67544? Would love to get your eyes on this one too when you get a chance!

tchaikov and others added 7 commits March 18, 2026 15:45
when performing tests, we should enable sanitizers for detecting
potential issues. so, in this change, we enable ASsan, TSan and
UBSan.

script/run-make.sh is used by our CI job for testing PRs, so
enabling these sanitizers helps us to identify issues as early as
possible. because ASan cannot be used along with TSan, we prefer
using ASan for capturing memory related issue in favor of
detecting the multi-threading issues.

also, because of https://bugs.llvm.org/show_bug.cgi?id=23272, we
cannot enable multiple sanitizers. but we should enable UBSan as well,
once we can use a higher version of Clang than Clang-14. with
Clang-14, when enabling UBSan, we'd have following FTBFS
```
error: Cannot represent a difference across sections
```
when compiling `src/tools/neorados.cc`

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Fix ASan CHECK failure when exceptions are thrown during early
initialization, particularly in Python bindings that load Ceph
shared libraries.

ASan reported the following error:

  AddressSanitizer: CHECK failed: asan_interceptors.cpp:335
  "((__interception::real___cxa_throw)) != (0)" (0x0, 0x0)
    #0 CheckUnwind asan_rtl.cpp:69
    #1 CheckFailed sanitizer_termination.cpp:86
    #2 __interceptor___cxa_throw asan_interceptors.cpp:335
    #3 boost::throw_exception<boost::bad_lexical_cast>
    #4 boost::conversion::detail::throw_bad_cast
    #5 boost::lexical_cast<unsigned long, std::string>
    #6 librbd::rbd_features_from_string /ceph/src/librbd/Features.cc:67
    #7 get_rbd_options()::$_2::operator() rbd_options.cc:44
    #8 Option::pre_validate /ceph/src/common/options.cc:94
    #9 md_config_t::md_config_t /ceph/src/common/config.cc:208
    #10 CephContext::CephContext /ceph/src/common/ceph_context.cc:730
    #11 rados_create_cct /ceph/src/librados/librados_c.cc:120
    #12 Python rados module initialization

Root cause: When Python loads the Ceph shared library (e.g., rados.so),
CephContext initialization validates configuration options. The RBD
default features option validator calls rbd_features_from_string(),
which uses boost::lexical_cast to parse the feature string. When the
string is not numeric (e.g., "layering,exclusive-lock,..."), lexical_cast
throws boost::bad_lexical_cast.

This exception is properly caught and handled in the code. However, ASan's
exception interceptor (__cxa_throw) may not be fully initialized when
exceptions are thrown during early library initialization, causing a CHECK
failure.

Why qa/asan.supp is not sufficient:
The existing suppression in qa/asan.supp for __interceptor___cxa_throw
only suppresses ASan *reports* about the interceptor. It does NOT prevent
CHECK failures in ASan's runtime itself. CHECK failures are assertions
that terminate the program immediately, before any suppression mechanism
can be applied. The CHECK fails because real___cxa_throw is NULL (not yet
initialized), which is a precondition violation in ASan's interceptor code.

Suppressions work by filtering ASan's output after an issue is detected,
but they cannot prevent internal CHECK failures in ASan's initialization
logic.

Solution: Disable ASan's C++ exception interception by adding
intercept_cxx_exceptions=0 to ASAN_OPTIONS. This prevents ASan from
intercepting exception throws/catches, avoiding the initialization order
issue. Exception handling still works correctly; we just lose ASan's
ability to detect exception-related memory issues.

This is a known limitation when using ASan with code that throws
exceptions during static/early initialization, particularly in shared
libraries loaded by interpreters like Python.

Note: This does not hide real bugs - the exception is properly caught
and handled. We're only disabling ASan's interception mechanism to avoid
the initialization order problem.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
The ConcurrentOperations test had a race condition where threads
create_snap2 and create_snap3 were started before image1 finished
its snap_create and aio_close operations.

Since image1 holds the exclusive lock, when create_snap2 and
create_snap3 try to create snapshots, they must either:
1. Send remote requests to image1 (the lock owner), or
2. Wait to acquire the lock after image1 releases it

However, image1 is busy completing its own snap_create and then
executing aio_close, so it cannot process remote requests properly.
This causes the remote requests to timeout or fail, resulting in
snap_create returning non-zero error codes and triggering the
ceph_assert(r == 0) failures.

The fix ensures image1 fully completes (including aio_close and lock
release) before starting create_snap2 and create_snap3 threads. This
allows image2 or image3 to acquire the lock cleanly instead of trying
to coordinate with a closing image.

Fixes: https://tracker.ceph.com/issues/70691
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
Environment variables set via CMake's set_property(TEST ... PROPERTY ENVIRONMENT)
are available in the test shell process but are not automatically exported to child
processes. This causes sanitizer options like ASAN_OPTIONS to not propagate to
spawned ceph daemons and CLI tools.

The safe-to-destroy.sh test and other standalone tests spawn multiple child
processes (ceph, ceph-osd, ceph-mon, Python bindings). When these load shared
libraries during initialization, ASan's exception interceptor fails because
ASAN_OPTIONS=intercept_cxx_exceptions=0 (set in f0e2646) is not inherited.

Fix by explicitly exporting ASAN_OPTIONS, LSAN_OPTIONS, UBSAN_OPTIONS, and
TSAN_OPTIONS. Since detect-build-env-vars.sh is sourced by all standalone test
scripts, this ensures sanitizer options propagate to all child processes uniformly.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
When running readable.sh with a WITH_ASAN=ON build of ceph-dencoder,
ASAN processes need to find a contiguous 16+ TB shadow memory region
(1/8 of the 128 TB x86-64 user VA space). High ASLR entropy can
fragment the VA space, preventing ASAN from finding a suitable region.

Instead of requiring system-wide vm.mmap_rnd_bits=28 (which weakens
ASLR security for the entire host), wrap ceph-dencoder with 'setarch
$(uname -m) -R' when ASAN is detected. This disables ASLR only for the
specific ceph-dencoder processes, with no system-wide security impact.

Also simplify parallelism logic: extract NPROC calculation into a shared
variable and use it consistently across FreeBSD, Darwin, and Linux.

Reference: https://clang.llvm.org/docs/AddressSanitizer.html

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
When co_waiter is destroyed, the cancellation slot may still hold a
reference to the op_cancellation callback which captures 'this'. If
the cancellation signal is emitted after co_waiter is destroyed (e.g.,
during co_throttle shutdown), it results in a stack-use-after-scope
error.

Fix by adding a destructor that retrieves the cancellation slot from
the handler (if still active) and clears it before destruction. This
ensures the cancellation callback is removed before the co_waiter
object goes out of scope, preventing use-after-scope errors.

The cancellation slot cannot be stored as a member variable because
it becomes invalid after the handler is moved out in complete(). Instead,
we retrieve it on demand from the handler in the destructor, which is
the only place we need it.

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
rbd_features_from_string() uses boost::lexical_cast which throws
bad_lexical_cast when the input is not numeric. The exception is caught
and handled as "parse as feature name list instead".

This is normal control flow, but when ASAN's __cxa_throw interceptor is
misconfigured (e.g. with intercept_cxx_exceptions=0 leaving real___cxa_throw
NULL), any exception causes a CHECK failure. Even with a correctly configured
ASAN, throwing exceptions during config initialization adds overhead.

Replace the try/catch pattern with boost::conversion::try_lexical_convert,
which returns false on parse failure instead of throwing. This eliminates
the exception entirely, making the code more efficient and avoiding any
interaction with ASAN's exception interceptor.

The ASan report:

```
AddressSanitizer: CHECK failed: asan_interceptors.cpp:320 "((__interception::real___cxa_throw)) != (0)" (0x0, 0x0) (tid=30529)
    #0 0x7d668fac9a09 in CheckUnwind ../../../../src/libsanitizer/asan/asan_rtl.cpp:67
    #1 0x7d668faec105 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
    #2 0x7d668fa4b194 in __interceptor___cxa_throw ../../../../src/libsanitizer/asan/asan_interceptors.cpp:320
    #3 0x7d668ae4ec0f in void boost::throw_exception<boost::bad_lexical_cast>(boost::bad_lexical_cast const&) /opt/ceph/include/boost/throw_exception.hpp:165
    #4 0x7d668c1e1e0b in void boost::conversion::detail::throw_bad_cast<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long>() /opt/ceph/include/boost/lexical_cast/bad_lexical_cast.hpp:93
    #5 0x7d668c1e0e05 in unsigned long boost::lexical_cast<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /opt/ceph/include/boost/lexical_cast.hpp:43
    #6 0x7d668c1df609 in librbd::rbd_features_from_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*) /ceph/src/librbd/Features.cc:67
    #7 0x7d668b255a35 in get_rbd_options()::$_2::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const src/common/options/rbd_options.cc:44
    #8 0x7d668b255806 in int std::__invoke_impl<int, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>(std::__invoke_other, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:61
    #9 0x7d668b255754 in std::enable_if<is_invocable_r_v<int, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, int>::type std::__invoke_r<int, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>(get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:114
    #10 0x7d668b25563c in std::_Function_handler<int (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*), get_rbd_options()::$_2>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_function.h:290
    #11 0x7d668af924f1 in std::function<int (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_function.h:591
    #12 0x7d668af8c495 in Option::pre_validate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const /ceph/src/common/options.cc:94
    ceph#13 0x7d668aef9ec3 in md_config_t::md_config_t(ConfigValues&, ConfigTracker const&, bool) /ceph/src/common/config.cc:208
    ceph#14 0x7d668ae0040d in ceph::common::ConfigProxy::ConfigProxy(bool) /ceph/src/common/config_proxy.h:70
    ceph#15 0x7d668adf3f5d in ceph::common::CephContext::CephContext(unsigned int, ceph::common::CephContext::create_options const&) /ceph/src/common/ceph_context.cc:730
    ceph#16 0x7d668adf3785 in ceph::common::CephContext::CephContext(unsigned int, code_environment_t, int) /ceph/src/common/ceph_context.cc:724
    ceph#17 0x7d668aef5ee6 in common_preinit(CephInitParameters const&, code_environment_t, int) /ceph/src/common/common_init.cc:40
    ceph#18 0x7d668d4a47af in rados_create_cct(char const*, CephInitParameters*) /ceph/src/librados/librados_c.cc:120
    ceph#19 0x7d668d4a49db in _rados_create2 /ceph/src/librados/librados_c.cc:168
    ceph#20 0x7d668d94dc4b in __pyx_pf_5rados_5Rados_2__setup /ceph/build/src/pybind/rados/rados_processed.c:13219
    ceph#21 0x7d668d94dc4b in __pyx_pw_5rados_5Rados_3__setup /ceph/build/src/pybind/rados/rados_processed.c:12703
    ceph#22 0x7d668d94a347 in __Pyx_CyFunction_CallAsMethod /ceph/build/src/pybind/rados/rados_processed.c:93157
    ceph#23 0x58ac5086d0ba in _PyObject_MakeTpCall (/usr/bin/python3.10+0x1810ba)
    ceph#24 0x58ac508843da  (/usr/bin/python3.10+0x1983da)
    ceph#25 0x58ac50885076 in PyVectorcall_Call (/usr/bin/python3.10+0x199076)
    ceph#26 0x7d668d94cdd7 in __Pyx_PyObject_Call /ceph/build/src/pybind/rados/rados_processed.c:90994
    ceph#27 0x7d668d94cdd7 in __pyx_pf_5rados_5Rados___init__ /ceph/build/src/pybind/rados/rados_processed.c:12474
    ceph#28 0x7d668d94cdd7 in __pyx_pw_5rados_5Rados_1__init__ /ceph/build/src/pybind/rados/rados_processed.c:12443
    ceph#29 0x58ac5086d43a  (/usr/bin/python3.10+0x18143a)
    ceph#30 0x58ac50884d3a in PyObject_Call (/usr/bin/python3.10+0x198d3a)
    ceph#31 0x58ac508637de in _PyEval_EvalFrameDefault (/usr/bin/python3.10+0x1777de)
    ceph#32 0x58ac5087702b in _PyFunction_Vectorcall (/usr/bin/python3.10+0x18b02b)
    ceph#33 0x58ac508615fe in _PyEval_EvalFrameDefault (/usr/bin/python3.10+0x1755fe)
    ceph#34 0x58ac5087702b in _PyFunction_Vectorcall (/usr/bin/python3.10+0x18b02b)
    ceph#35 0x58ac508615fe in _PyEval_EvalFrameDefault (/usr/bin/python3.10+0x1755fe)
```

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants