Conversation
3a67fec to
f335455
Compare
28afc6a to
de4c324
Compare
de4c324 to
8cd3df9
Compare
fdbbfbf to
f6879f8
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
f6879f8 to
2a81b40
Compare
7d63d7e to
0e77bc6
Compare
0e77bc6 to
a517fcd
Compare
a517fcd to
fb8844a
Compare
fb8844a to
9352cb2
Compare
79c2159 to
8233af6
Compare
8233af6 to
c2ce9b7
Compare
c2ce9b7 to
995c40b
Compare
995c40b to
88a106b
Compare
56e3c39 to
631d630
Compare
2074de3 to
6ac74ab
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
6ac74ab to
81f6fd0
Compare
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
Really is a nit to all the work that is put here but should we also update "Sanitizers" section under https://github.com/ceph/ceph?tab=readme-ov-file#build-types ? |
@Matan-B thanks for reviewing this change. i don't think we should update the "Sanitizers" column in this change. because, IMHO, the table in "Build Types" is to explain the build settings of different build modes provided by CMake, not by |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
|
This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days. |
|
This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution! |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
jenkins test make check |
|
@bill-scales Hey Bill, thanks for the approval! I think there might be a mix-up though — were you perhaps reviewing #67544? Would love to get your eyes on this one too when you get a chance! |
when performing tests, we should enable sanitizers for detecting potential issues. so, in this change, we enable ASsan, TSan and UBSan. script/run-make.sh is used by our CI job for testing PRs, so enabling these sanitizers helps us to identify issues as early as possible. because ASan cannot be used along with TSan, we prefer using ASan for capturing memory related issue in favor of detecting the multi-threading issues. also, because of https://bugs.llvm.org/show_bug.cgi?id=23272, we cannot enable multiple sanitizers. but we should enable UBSan as well, once we can use a higher version of Clang than Clang-14. with Clang-14, when enabling UBSan, we'd have following FTBFS ``` error: Cannot represent a difference across sections ``` when compiling `src/tools/neorados.cc` Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Fix ASan CHECK failure when exceptions are thrown during early
initialization, particularly in Python bindings that load Ceph
shared libraries.
ASan reported the following error:
AddressSanitizer: CHECK failed: asan_interceptors.cpp:335
"((__interception::real___cxa_throw)) != (0)" (0x0, 0x0)
#0 CheckUnwind asan_rtl.cpp:69
#1 CheckFailed sanitizer_termination.cpp:86
#2 __interceptor___cxa_throw asan_interceptors.cpp:335
#3 boost::throw_exception<boost::bad_lexical_cast>
#4 boost::conversion::detail::throw_bad_cast
#5 boost::lexical_cast<unsigned long, std::string>
#6 librbd::rbd_features_from_string /ceph/src/librbd/Features.cc:67
#7 get_rbd_options()::$_2::operator() rbd_options.cc:44
#8 Option::pre_validate /ceph/src/common/options.cc:94
#9 md_config_t::md_config_t /ceph/src/common/config.cc:208
#10 CephContext::CephContext /ceph/src/common/ceph_context.cc:730
#11 rados_create_cct /ceph/src/librados/librados_c.cc:120
#12 Python rados module initialization
Root cause: When Python loads the Ceph shared library (e.g., rados.so),
CephContext initialization validates configuration options. The RBD
default features option validator calls rbd_features_from_string(),
which uses boost::lexical_cast to parse the feature string. When the
string is not numeric (e.g., "layering,exclusive-lock,..."), lexical_cast
throws boost::bad_lexical_cast.
This exception is properly caught and handled in the code. However, ASan's
exception interceptor (__cxa_throw) may not be fully initialized when
exceptions are thrown during early library initialization, causing a CHECK
failure.
Why qa/asan.supp is not sufficient:
The existing suppression in qa/asan.supp for __interceptor___cxa_throw
only suppresses ASan *reports* about the interceptor. It does NOT prevent
CHECK failures in ASan's runtime itself. CHECK failures are assertions
that terminate the program immediately, before any suppression mechanism
can be applied. The CHECK fails because real___cxa_throw is NULL (not yet
initialized), which is a precondition violation in ASan's interceptor code.
Suppressions work by filtering ASan's output after an issue is detected,
but they cannot prevent internal CHECK failures in ASan's initialization
logic.
Solution: Disable ASan's C++ exception interception by adding
intercept_cxx_exceptions=0 to ASAN_OPTIONS. This prevents ASan from
intercepting exception throws/catches, avoiding the initialization order
issue. Exception handling still works correctly; we just lose ASan's
ability to detect exception-related memory issues.
This is a known limitation when using ASan with code that throws
exceptions during static/early initialization, particularly in shared
libraries loaded by interpreters like Python.
Note: This does not hide real bugs - the exception is properly caught
and handled. We're only disabling ASan's interception mechanism to avoid
the initialization order problem.
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
The ConcurrentOperations test had a race condition where threads create_snap2 and create_snap3 were started before image1 finished its snap_create and aio_close operations. Since image1 holds the exclusive lock, when create_snap2 and create_snap3 try to create snapshots, they must either: 1. Send remote requests to image1 (the lock owner), or 2. Wait to acquire the lock after image1 releases it However, image1 is busy completing its own snap_create and then executing aio_close, so it cannot process remote requests properly. This causes the remote requests to timeout or fail, resulting in snap_create returning non-zero error codes and triggering the ceph_assert(r == 0) failures. The fix ensures image1 fully completes (including aio_close and lock release) before starting create_snap2 and create_snap3 threads. This allows image2 or image3 to acquire the lock cleanly instead of trying to coordinate with a closing image. Fixes: https://tracker.ceph.com/issues/70691 Signed-off-by: Kefu Chai <k.chai@proxmox.com>
Environment variables set via CMake's set_property(TEST ... PROPERTY ENVIRONMENT) are available in the test shell process but are not automatically exported to child processes. This causes sanitizer options like ASAN_OPTIONS to not propagate to spawned ceph daemons and CLI tools. The safe-to-destroy.sh test and other standalone tests spawn multiple child processes (ceph, ceph-osd, ceph-mon, Python bindings). When these load shared libraries during initialization, ASan's exception interceptor fails because ASAN_OPTIONS=intercept_cxx_exceptions=0 (set in f0e2646) is not inherited. Fix by explicitly exporting ASAN_OPTIONS, LSAN_OPTIONS, UBSAN_OPTIONS, and TSAN_OPTIONS. Since detect-build-env-vars.sh is sourced by all standalone test scripts, this ensures sanitizer options propagate to all child processes uniformly. Signed-off-by: Kefu Chai <k.chai@proxmox.com>
When running readable.sh with a WITH_ASAN=ON build of ceph-dencoder, ASAN processes need to find a contiguous 16+ TB shadow memory region (1/8 of the 128 TB x86-64 user VA space). High ASLR entropy can fragment the VA space, preventing ASAN from finding a suitable region. Instead of requiring system-wide vm.mmap_rnd_bits=28 (which weakens ASLR security for the entire host), wrap ceph-dencoder with 'setarch $(uname -m) -R' when ASAN is detected. This disables ASLR only for the specific ceph-dencoder processes, with no system-wide security impact. Also simplify parallelism logic: extract NPROC calculation into a shared variable and use it consistently across FreeBSD, Darwin, and Linux. Reference: https://clang.llvm.org/docs/AddressSanitizer.html Signed-off-by: Kefu Chai <k.chai@proxmox.com>
When co_waiter is destroyed, the cancellation slot may still hold a reference to the op_cancellation callback which captures 'this'. If the cancellation signal is emitted after co_waiter is destroyed (e.g., during co_throttle shutdown), it results in a stack-use-after-scope error. Fix by adding a destructor that retrieves the cancellation slot from the handler (if still active) and clears it before destruction. This ensures the cancellation callback is removed before the co_waiter object goes out of scope, preventing use-after-scope errors. The cancellation slot cannot be stored as a member variable because it becomes invalid after the handler is moved out in complete(). Instead, we retrieve it on demand from the handler in the destructor, which is the only place we need it. Signed-off-by: Kefu Chai <k.chai@proxmox.com>
rbd_features_from_string() uses boost::lexical_cast which throws
bad_lexical_cast when the input is not numeric. The exception is caught
and handled as "parse as feature name list instead".
This is normal control flow, but when ASAN's __cxa_throw interceptor is
misconfigured (e.g. with intercept_cxx_exceptions=0 leaving real___cxa_throw
NULL), any exception causes a CHECK failure. Even with a correctly configured
ASAN, throwing exceptions during config initialization adds overhead.
Replace the try/catch pattern with boost::conversion::try_lexical_convert,
which returns false on parse failure instead of throwing. This eliminates
the exception entirely, making the code more efficient and avoiding any
interaction with ASAN's exception interceptor.
The ASan report:
```
AddressSanitizer: CHECK failed: asan_interceptors.cpp:320 "((__interception::real___cxa_throw)) != (0)" (0x0, 0x0) (tid=30529)
#0 0x7d668fac9a09 in CheckUnwind ../../../../src/libsanitizer/asan/asan_rtl.cpp:67
#1 0x7d668faec105 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
#2 0x7d668fa4b194 in __interceptor___cxa_throw ../../../../src/libsanitizer/asan/asan_interceptors.cpp:320
#3 0x7d668ae4ec0f in void boost::throw_exception<boost::bad_lexical_cast>(boost::bad_lexical_cast const&) /opt/ceph/include/boost/throw_exception.hpp:165
#4 0x7d668c1e1e0b in void boost::conversion::detail::throw_bad_cast<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long>() /opt/ceph/include/boost/lexical_cast/bad_lexical_cast.hpp:93
#5 0x7d668c1e0e05 in unsigned long boost::lexical_cast<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /opt/ceph/include/boost/lexical_cast.hpp:43
#6 0x7d668c1df609 in librbd::rbd_features_from_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*) /ceph/src/librbd/Features.cc:67
#7 0x7d668b255a35 in get_rbd_options()::$_2::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const src/common/options/rbd_options.cc:44
#8 0x7d668b255806 in int std::__invoke_impl<int, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>(std::__invoke_other, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:61
#9 0x7d668b255754 in std::enable_if<is_invocable_r_v<int, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>, int>::type std::__invoke_r<int, get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>(get_rbd_options()::$_2&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/invoke.h:114
#10 0x7d668b25563c in std::_Function_handler<int (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*), get_rbd_options()::$_2>::_M_invoke(std::_Any_data const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*&&) /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_function.h:290
#11 0x7d668af924f1 in std::function<int (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)>::operator()(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const /usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/std_function.h:591
#12 0x7d668af8c495 in Option::pre_validate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const /ceph/src/common/options.cc:94
ceph#13 0x7d668aef9ec3 in md_config_t::md_config_t(ConfigValues&, ConfigTracker const&, bool) /ceph/src/common/config.cc:208
ceph#14 0x7d668ae0040d in ceph::common::ConfigProxy::ConfigProxy(bool) /ceph/src/common/config_proxy.h:70
ceph#15 0x7d668adf3f5d in ceph::common::CephContext::CephContext(unsigned int, ceph::common::CephContext::create_options const&) /ceph/src/common/ceph_context.cc:730
ceph#16 0x7d668adf3785 in ceph::common::CephContext::CephContext(unsigned int, code_environment_t, int) /ceph/src/common/ceph_context.cc:724
ceph#17 0x7d668aef5ee6 in common_preinit(CephInitParameters const&, code_environment_t, int) /ceph/src/common/common_init.cc:40
ceph#18 0x7d668d4a47af in rados_create_cct(char const*, CephInitParameters*) /ceph/src/librados/librados_c.cc:120
ceph#19 0x7d668d4a49db in _rados_create2 /ceph/src/librados/librados_c.cc:168
ceph#20 0x7d668d94dc4b in __pyx_pf_5rados_5Rados_2__setup /ceph/build/src/pybind/rados/rados_processed.c:13219
ceph#21 0x7d668d94dc4b in __pyx_pw_5rados_5Rados_3__setup /ceph/build/src/pybind/rados/rados_processed.c:12703
ceph#22 0x7d668d94a347 in __Pyx_CyFunction_CallAsMethod /ceph/build/src/pybind/rados/rados_processed.c:93157
ceph#23 0x58ac5086d0ba in _PyObject_MakeTpCall (/usr/bin/python3.10+0x1810ba)
ceph#24 0x58ac508843da (/usr/bin/python3.10+0x1983da)
ceph#25 0x58ac50885076 in PyVectorcall_Call (/usr/bin/python3.10+0x199076)
ceph#26 0x7d668d94cdd7 in __Pyx_PyObject_Call /ceph/build/src/pybind/rados/rados_processed.c:90994
ceph#27 0x7d668d94cdd7 in __pyx_pf_5rados_5Rados___init__ /ceph/build/src/pybind/rados/rados_processed.c:12474
ceph#28 0x7d668d94cdd7 in __pyx_pw_5rados_5Rados_1__init__ /ceph/build/src/pybind/rados/rados_processed.c:12443
ceph#29 0x58ac5086d43a (/usr/bin/python3.10+0x18143a)
ceph#30 0x58ac50884d3a in PyObject_Call (/usr/bin/python3.10+0x198d3a)
ceph#31 0x58ac508637de in _PyEval_EvalFrameDefault (/usr/bin/python3.10+0x1777de)
ceph#32 0x58ac5087702b in _PyFunction_Vectorcall (/usr/bin/python3.10+0x18b02b)
ceph#33 0x58ac508615fe in _PyEval_EvalFrameDefault (/usr/bin/python3.10+0x1755fe)
ceph#34 0x58ac5087702b in _PyFunction_Vectorcall (/usr/bin/python3.10+0x18b02b)
ceph#35 0x58ac508615fe in _PyEval_EvalFrameDefault (/usr/bin/python3.10+0x1755fe)
```
Signed-off-by: Kefu Chai <k.chai@proxmox.com>
when performing tests, we should enable sanitizers for detecting potential issues. so, in this change, we enable ASsan, TSan and UBSan.
script/run-make.sh is used by our CI job for testing PRs, so enabling these sanitizers helps us to identify issues as early as possible. because ASan cannot be used along with TSan, we prefer using ASan for capturing memory related issue in favor of detecting the multi-threading issues.
also, because of https://bugs.llvm.org/show_bug.cgi?id=23272, we cannot enable multiple sanitizers. but we should enable UBSan as well, once we can use a higher version of Clang than Clang-14. with Clang-14, when enabling UBSan, we'd have following FTBFS
when compiling
src/tools/neorados.ccContribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e