Bug #71390
closedvalgrind error: Leak_PossiblyLost operator new[](unsigned long) Objecter::start_tick() Objecter::start(OSDMap const*)
0%
Description
getting valgrind reports of several memory leaks that indicate rgw isn't shutting down cleanly. only seems to happen in rgw/notifications jobs
we get the sigterm and start shutting down, but several background threads continue running until the log ends 2 minutes later (because rgw_exit_timeout_secs = 2m)
2025-05-19T16:40:11.108+0000 f85d640 -1 received signal: Terminated from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --gen-suppressions=all --xml=yes --xml-file=/var/log/ceph/valgrind/ceph.client.0.log --time-stamp=yes --vgdb=yes --tool=memcheck --max-threads=1024 radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground (PID: 48435) UID: 0 2025-05-19T16:40:11.109+0000 f85d640 1 handle_sigterm 2025-05-19T16:40:11.110+0000 f85d640 1 handle_sigterm set alarm for 120 2025-05-19T16:40:11.111+0000 a17f500 -1 shutting down ... 2025-05-19T16:42:09.208+0000 13c662640 20 rgw reshard worker thread: processing logshard = reshard.0000000015 2025-05-19T16:42:09.212+0000 13c662640 20 rgw reshard worker thread: finish processing logshard = reshard.0000000015 , ret = 0
example runs:
https://pulpito.ceph.com/cbodley-2025-05-19_22:28:25-rgw-wip-cbodley-testing-distro-default-smithi/
https://pulpito.ceph.com/yuvalif-2025-05-19_16:10:52-rgw-wip_s3_full_object_dedup_merged-distro-default-smithi/
rgw log https://qa-proxy.ceph.com/teuthology/yuvalif-2025-05-19_16:10:52-rgw-wip_s3_full_object_dedup_merged-distro-default-smithi/8289254/remote/smithi096/log/rgw.ceph.client.0.log.gz
valgrind log https://qa-proxy.ceph.com/teuthology/yuvalif-2025-05-19_16:10:52-rgw-wip_s3_full_object_dedup_merged-distro-default-smithi/8289254/remote/smithi096/log/valgrind/ceph.client.0.log.gz
Updated by Yuval Lifshitz 10 months ago
- same fix that was done here: https://github.com/ceph/ceph/pull/58765 should be applied here: https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_notify.cc#L151
- set http client request timeout
Updated by Yuval Lifshitz 10 months ago
http timeout will be handled here: https://tracker.ceph.com/issues/71402
Updated by Yuval Lifshitz 10 months ago
- Assignee set to Yuval Lifshitz
- Backport set to tentacle, reef
Updated by Yuval Lifshitz 10 months ago
- Backport changed from tentacle, reef to tentacle, squid
Updated by Casey Bodley 10 months ago
- Priority changed from Normal to Urgent
raised to urgent prio since it's failing consistently in teuthology
Updated by Yuval Lifshitz 9 months ago
replaced the timer with an async waiter here: https://github.com/ceph/ceph/pull/63986
we should probably make this change, however, it does not resolve the above valgrind issue. see:
https://github.com/ceph/ceph/pull/63986#issuecomment-2981213684
Updated by Yuval Lifshitz 9 months ago
when running locally with bucket notifications tests, i get the following valgrind error:
==00:00:54:05.376 353814== 1,126 (1,056 direct, 70 indirect) bytes in 1 blocks are definitely lost in loss record 865 of 934
==00:00:54:05.376 353814== at 0x4843743: operator new[](unsigned long) (vg_replace_malloc.c:729)
==00:00:54:05.376 353814== by 0x55F3CA7: Objecter::linger_register(object_t const&, object_locator_t const&, int) (Objecter.cc:823)
==00:00:54:05.376 353814== by 0x20D4A1A: neorados::RADOS::watch_(neorados::Object, neorados::IOContext, std::optional<std::chrono::duration<long, std::ratio<1l, 1l> > >, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >, boost::asio::any_completion_handler<void (boost::system::error_code, unsigned long)>) (RADOS.cc:1503)
==00:00:54:05.376 353814== by 0x127B0C4: auto neorados::RADOS::watch<boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&>(neorados::Object, neorados::IOContext, std::optional<std::chrono::duration<long, std::ratio<1l, 1l> > >, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&)::{lambda(auto:1&&, neorados::Object, neorados::IOContext, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >)#1}::operator()<boost::asio::detail::consign_handler<boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::system::error_code, unsigned long>, boost::asio::executor_work_guard<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>, void, void> > >(boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&, neorados::Object, neorados::IOContext, fu2::abi_310::detail::function<fu2::abi_310::detail::config<true, false, 16ul>, fu2::abi_310::detail::property<true, false, void (boost::system::error_code, unsigned long, unsigned long, unsigned long, ceph::buffer::v15_2_0::list&&)> >) const (RADOS.hpp:1627)
==00:00:54:05.376 353814== by 0x127B32F: _ZZZN5boost4asio6detail20awaitable_frame_baseINS0_15any_io_executorEE15await_transformIZNS0_12async_resultINS0_15use_awaitable_tIS3_EEJFvNS_6system10error_codeEmEEE8initiateINS6_INS0_9consign_tIS8_JNS0_19executor_work_guardINS0_10io_context19basic_executor_typeISaIvELm0EEEvvEEEEEJSB_EE12init_wrapperIZN8neorados5RADOS5watchIRKS8_EEDaNSO_6ObjectENSO_9IOContextESt8optionalINSt6chrono8durationIlSt5ratioILl1ELl1EEEEEN3fu27abi_3106detail8functionINS14_6configILb1ELb0ELm16EEENS14_8propertyILb1ELb0EJFvSA_mmmON4ceph6buffer7v15_2_04listEEEEEEEOT_EUlS1I_ST_SU_S1G_E_EEJSt5tupleIJSK_EEST_SU_S1G_EEENS0_9awaitableImS3_EES1H_S8_DpT0_EUlPS1H_E_EEDaS1H_PNSt9enable_ifIXsrSt14is_convertibleINS0_9result_ofIFS1H_PS4_EE4typeEPNS1_16awaitable_threadIS3_EEE5valueEvE4typeEEN6result13await_suspendENSt7__n486116coroutine_handleIvEEENUlPvE_4_FUNES2B_ (consign.hpp:82)
==00:00:54:05.376 353814== by 0x1196EE5: resume (awaitable.hpp:501)
==00:00:54:05.376 353814== by 0x1196EE5: boost::asio::detail::awaitable_thread<boost::asio::any_io_executor>::pump() (awaitable.hpp:769)
==00:00:54:05.376 353814== by 0x1260D87: operator()<boost::container::flat_map<long unsigned int, logback_generation> > (use_awaitable.hpp:103)
==00:00:54:05.376 353814== by 0x1260D87: complete (composed.hpp:155)
==00:00:54:05.376 353814== by 0x1260D87: operator() (async_call.h:59)
==00:00:54:05.376 353814== by 0x1260D87: operator() (bind_handler.hpp:56)
==00:00:54:05.376 353814== by 0x1260D87: void boost::asio::detail::executor_function::complete<boost::asio::detail::binder0<ceph::async::async_dispatch<logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#4}, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&, , boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >(boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#4}&&, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&)::{lambda(auto:1&)#1}::operator()<boost::asio::detail::composed_op<{lambda(auto:1&)#1}, boost::asio::detail::composed_work<void (boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >)>, boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::container::flat_map<unsigned long, logback_generation, std::less<unsigned long>, void> >, void (boost::container::flat_map<unsigned long, logback_generation, std::less<unsigned long>, void>)> >(logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#4}&)::{lambda()#1}::operator()()::{lambda()#1}>, std::allocator<void> >(boost::asio::detail::executor_function::impl_base*, bool) (executor_function.hpp:113)
==00:00:54:05.376 353814== by 0xB77CAA: operator() (executor_function.hpp:61)
==00:00:54:05.376 353814== by 0xB77CAA: void boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul>::execute<boost::asio::detail::executor_function>(boost::asio::detail::executor_function&&) const (io_context.hpp:192)
==00:00:54:05.376 353814== by 0x125BCDD: ceph::async::async_dispatch<logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#4}, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&, , boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >(boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#4}&&, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&)::{lambda(auto:1&)#1}::operator()<boost::asio::detail::composed_op<{lambda(auto:1&)#1}, boost::asio::detail::composed_work<void (boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >)>, boost::asio::detail::awaitable_handler<boost::asio::any_io_executor, boost::container::flat_map<unsigned long, logback_generation, std::less<unsigned long>, void> >, void (boost::container::flat_map<unsigned long, logback_generation, std::less<unsigned long>, void>)> >(logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#4}&)::{lambda()#1}::operator()() (any_executor.hpp:681)
==00:00:54:05.376 353814== by 0x126507D: _ZZZN5boost4asio6detail20awaitable_frame_baseINS0_15any_io_executorEE15await_transformIZNS0_12async_resultINS0_15use_awaitable_tIS3_EEJFvNS_9container8flat_mapIm18logback_generationSt4lessImEvEEEEE8initiateINS1_17initiate_composedIZN4ceph5async14async_dispatchIZN19logback_generations5setupEPK18DoutPrefixProvider8log_typeEUlvE2_RKS8_JENS0_6strandINS0_10io_context19basic_executor_typeISaIvELm0EEEEEEEDaT2_OT_OT0_DpOT1_EUlRS11_E_FvSZ_EJSF_EEEJEEENS0_9awaitableISE_S3_EES11_S8_DpT0_EUlPS11_E_EEDaS11_PNSt9enable_ifIXsrSt14is_convertibleINS0_9result_ofIFS11_PS4_EE4typeEPNS1_16awaitable_threadIS3_EEE5valueEvE4typeEEN6result13await_suspendENSt7__n486116coroutine_handleIvEEENUlPvE_4_FUNES20_ (bind_handler.hpp:56)
==00:00:54:05.376 353814== by 0x1196EE5: resume (awaitable.hpp:501)
==00:00:54:05.376 353814== by 0x1196EE5: boost::asio::detail::awaitable_thread<boost::asio::any_io_executor>::pump() (awaitable.hpp:769)
==00:00:54:05.376 353814== by 0x1261F26: operator() (use_awaitable.hpp:74)
==00:00:54:05.376 353814== by 0x1261F26: complete (composed.hpp:155)
==00:00:54:05.376 353814== by 0x1261F26: operator() (async_call.h:95)
==00:00:54:05.376 353814== by 0x1261F26: operator() (bind_handler.hpp:56)
==00:00:54:05.376 353814== by 0x1261F26: void boost::asio::detail::executor_function::complete<boost::asio::detail::binder0<auto ceph::async::async_dispatch<logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#1}, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&, , boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> > >(boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >, logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#1}&&, boost::asio::use_awaitable_t<boost::asio::any_io_executor> const&) requires is_void_v<std::invoke_result<logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#1}>::type>::{lambda(auto:1&)#1}::operator()<boost::asio::detail::composed_op<{lambda(auto:1&)#1}, boost::asio::detail::composed_work<void (boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >)>, boost::asio::detail::awaitable_handler<boost::asio::any_io_executor>, void ()> >(logback_generations::setup(DoutPrefixProvider const*, log_type)::{lambda()#1}&)::{lambda()#1}::operator()()::{lambda()#1}>, std::allocator<void> >(boost::asio::detail::executor_function::impl_base*, bool) (executor_function.hpp:113)
...
Updated by Yuval Lifshitz 9 months ago
in the log file i see the following:
rgw notify: INFO: manager stopped. done processing all queues
however i nthe code there are other debug log indications (not seen):
cat ../src/rgw/driver/rados/rgw_notify.cc| ag stopped
ldpp_dout(this, 5) << "INFO: manager stopped. done cleanup for queue: " << queue_name << dendl;
ldpp_dout(this, 5) << "INFO: manager stopped. done processing for queue: " << queue_name << dendl;
ldpp_dout(this, 5) << "INFO: manager stopped. " << processed_queue_count << " queues are still being processed" << dendl;
ldpp_dout(this, 5) << "INFO: manager stopped. done processing all queues" << dendl;
Updated by Yuval Lifshitz 9 months ago
could be related to another valgrind failure happened with amqp tests:
https://pulpito.ceph.com/yuvalif-2025-06-30_13:25:26-rgw-wip-yuval-71402-distro-default-smithi/8358602/
failure_reason: 'valgrind error: InvalidRead mempool::pool_t::allocated_bytes() const ceph::common::CephContext::_refresh_perf_values()
this is invalid access to perf pointer inside cct, also happening on shutdown.
maybe the notification background thread access too late in the shutdown process?
Updated by Yuval Lifshitz 9 months ago
- Status changed from Triaged to In Progress
- Pull request ID set to 63986
- https://github.com/ceph/ceph/pull/63986/commits/44e11e93f0d758eeff8bcaceb109cee315a17fdc - make sure that if we shutdown while sending notifications we won't have to wait on a timer
- https://github.com/ceph/ceph/pull/63986/commits/119771f0d01b212516c2cef034685b4174b5f619 - change all sleeps in the code that are done to avoid busy wait to be no more than 1s. so that graceful shutdown is possible. also, debug logs were added to verify the shutdown process
when running the test locally, the shutdown process seems ok. however, valgrind is still showing errors: https://0x0.st/80aP.err
when running the test in teuthology, logs show graceful shutdown:
2025-07-01T21:01:16.231+0000 13ee67640 20 rgw notify: INFO: queue: :vddunh-9_topic ownership (lock) renewed 2025-07-01T21:01:45.645+0000 13ee67640 10 rgw notify: INFO: queue: :vddunh-9_topic. was removed. processing will stop 2025-07-01T21:01:45.647+0000 13ee67640 10 rgw notify: INFO: queue: :vddunh-9_topic. was removed. nothing to unlock 2025-07-01T21:01:45.647+0000 13ee67640 10 rgw notify: INFO: queue: :vddunh-9_topic not locked (ownership can move) 2025-07-01T21:01:45.647+0000 13ee67640 10 rgw notify: INFO: queue: :vddunh-9_topic marked for removal 2025-07-01T21:01:46.258+0000 13ee67640 20 rgw notify: INFO: performing stale reservation cleanup for queue: :vddunh-9_topic. next cleanup will happen at: Tue Jul 1 21:02:16 2025 2025-07-01T21:01:46.260+0000 13ee67640 20 rgw notify: INFO: processing queue list. next queues processing will happen at: Tue Jul 1 21:02:16 2025 2025-07-01T21:01:46.261+0000 13ee67640 10 rgw notify: INFO: queue: :vddunh-9_topic. was removed. cleanup will stop 2025-07-01T21:01:46.262+0000 13ee67640 10 rgw notify: INFO: queue: :vddunh-9_topic was removed 2025-07-01T21:02:16.292+0000 13ee67640 20 rgw notify: INFO: processing queue list. next queues processing will happen at: Tue Jul 1 21:02:46 2025 2025-07-01T21:02:25.269+0000 a1a6500 5 rgw notify: INFO: manager received stop signal. shutting down... 2025-07-01T21:02:26.240+0000 13ee67640 5 rgw notify: INFO: manager stopped. done processing all queues
but valgrind still finds issue:
http://qa-proxy.ceph.com/teuthology/yuvalif-2025-07-01_20:33:04-rgw:notifications-wip-yuval-71390-distro-default-smithi/8364886/remote/smithi186/log/valgrind/
Leak_PossiblyLost
operator new[](unsigned long)
Objecter::start_tick()
Objecter::start(OSDMap const*)
rgw::rados::create_config_store(DoutPrefixProvider const*)
DriverManager::create_config_store(DoutPrefixProvider const*, std::basic_string_view<char, std::char_traits<char> >)
rgw::AppMain::init_storage()
Leak_PossiblyLost
ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)
ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)
ceph::buffer::v15_2_0::create(unsigned int)
ceph::buffer::v15_2_0::list::iterator_impl<true>::copy_deep(unsigned int, ceph::buffer::v15_2_0::ptr&)
CryptoKey::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)
KeyRing::set_modifier(char const*, char const*, EntityName&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::ba
sic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v1
5_2_0::list> > >&)
KeyRing::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)
KeyRing::load(ceph::common::CephContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
KeyRing::from_ceph_context(ceph::common::CephContext*)
MonClient::init()
rgw::rados::create_config_store(DoutPrefixProvider const*)
DriverManager::create_config_store(DoutPrefixProvider const*, std::basic_string_view<char, std::char_traits<char> >)
rgw::AppMain::init_storage()
Leak_PossiblyLost
ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)
ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)
ceph::buffer::v15_2_0::create(unsigned int)
ceph::buffer::v15_2_0::list::iterator_impl<true>::copy_deep(unsigned int, ceph::buffer::v15_2_0::ptr&)
CryptoKey::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)
CephXTicketHandler::verify_service_ticket_reply(CryptoKey&, ceph::buffer::v15_2_0::list::iterator_impl<true>&)
CephXTicketManager::verify_service_ticket_reply(CryptoKey&, ceph::buffer::v15_2_0::list::iterator_impl<true>&)
CephxClientHandler::handle_response(int, ceph::buffer::v15_2_0::list::iterator_impl<true>&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)
MonConnection::handle_auth_done(AuthConnectionMeta*, unsigned long, ceph::buffer::v15_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)
MonClient::handle_auth_done(Connection*, AuthConnectionMeta*, unsigned long, unsigned int, ceph::buffer::v15_2_0::list const&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
; >*)
ProtocolV2::handle_auth_done(ceph::buffer::v15_2_0::list&)
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)
ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)
ceph::buffer::v15_2_0::create(unsigned int)
ProtocolV2::read_frame()
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)
ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)
ceph::buffer::v15_2_0::create(unsigned int)
ceph::buffer::v15_2_0::list::iterator_impl<true>::copy_deep(unsigned int, ceph::buffer::v15_2_0::ptr&)
CryptoKey::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)
CephXTicketHandler::verify_service_ticket_reply(CryptoKey&, ceph::buffer::v15_2_0::list::iterator_impl<true>&)
CephXTicketManager::verify_service_ticket_reply(CryptoKey&, ceph::buffer::v15_2_0::list::iterator_impl<true>&)
CephxClientHandler::handle_response(int, ceph::buffer::v15_2_0::list::iterator_impl<true>&, CryptoKey*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)
MonConnection::authenticate(MAuthReply*)
MonClient::handle_auth(MAuthReply*)
MonClient::ms_dispatch(Message*)
DispatchQueue::entry()
Leak_PossiblyLost
posix_memalign
ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)
ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)
ProtocolV2::read_frame_segment()
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
start_thread
clone
Leak_PossiblyLost
ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int)
ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)
ProtocolV2::read_frame_segment()
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
md_config_t::get_defaults_bl(ConfigValues const&, ceph::buffer::v15_2_0::list*)
MgrClient::_send_open()
MgrClient::service_daemon_register(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char&
gt; > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less
<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)
RGWRados::register_to_service_map(DoutPrefixProvider const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_trait
s<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > > > > const&)
rgw::AppMain::init_frontends2(rgw::RGWLib*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
md_config_t::get_config_bl(ConfigValues const&, unsigned long, ceph::buffer::v15_2_0::list*, unsigned long*)
MgrClient::_send_report()
MgrClient::_send_stats()
CommonSafeTimer<std::mutex>::timer_thread()
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
CephXTicketHandler::build_authorizer(unsigned long) const
MonClient::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*)
ProtocolV2::send_auth_request(std::vector<unsigned int, std::allocator<unsigned int> >&)
ProtocolV2::post_client_banner_exchange()
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
CephXAuthorizer::add_challenge(ceph::common::CephContext*, ceph::buffer::v15_2_0::list const&)
MonClient::handle_auth_reply_more(Connection*, AuthConnectionMeta*, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*)
ProtocolV2::handle_auth_reply_more(ceph::buffer::v15_2_0::list&)
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
CephXTicketHandler::build_authorizer(unsigned long) const
MonClient::get_auth_request(Connection*, AuthConnectionMeta*, unsigned int*, std::vector<unsigned int, std::allocator<unsigned int> >*, ceph::buffer::v15_2_0::list*)
ProtocolV2::send_auth_request(std::vector<unsigned int, std::allocator<unsigned int> >&)
ProtocolV2::post_client_banner_exchange()
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
CephXAuthorizer::add_challenge(ceph::common::CephContext*, ceph::buffer::v15_2_0::list const&)
MonClient::handle_auth_reply_more(Connection*, AuthConnectionMeta*, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*)
ProtocolV2::handle_auth_reply_more(ceph::buffer::v15_2_0::list&)
ProtocolV2::run_continuation(Ct<ProtocolV2>&)
AsyncConnection::process()
EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)
Leak_PossiblyLost
ceph::buffer::v15_2_0::list::refill_append_space(unsigned int)
ceph::buffer::v15_2_0::list::append(char const*, unsigned int)
md_config_t::get_defaults_bl(ConfigValues const&, ceph::buffer::v15_2_0::list*)
MgrClient::_send_open()
MgrClient::service_daemon_register(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char&
gt; > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less
<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)
RGWRados::register_to_service_map(DoutPrefixProvider const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::map<std::__cxx11::basic_string<char, std::char_trait
s<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator
<char> > > > > const&)
rgw::AppMain::init_frontends2(rgw::RGWLib*)
Updated by Yuval Lifshitz 9 months ago · Edited
when running valgrind locally, without doing any operation (notification related or other), i get a similar leak report (https://0x0.st/8GvG.err) to the one that i get after running the bucket notification test suite (https://0x0.st/80aP.err)
Updated by J. Eric Ivancich 9 months ago
Getting a lot of these errors making testing more difficult:
Updated by J. Eric Ivancich 8 months ago
- Related to Bug #72211: datalog/backing: valgrind Leak_DefinitelyLost possibly in neorados / Objecter added
Updated by Yuval Lifshitz 8 months ago
Eric,
Is the other failure happening for test not running bucket notifications?
Updated by Yuval Lifshitz 8 months ago
could this be the same issue as: https://tracker.ceph.com/issues/72211 ?
if so, we can probably mark as duplicate.
Updated by Yuval Lifshitz 8 months ago
running the notification tests against the following fix: https://github.com/ceph/ceph/pull/63698
did not fix the issue. see: http://pulpito.front.sepia.ceph.com/yuvalif-2025-08-05_17:41:35-rgw:notifications-wip-yuval-test-71066-distro-default-smithi/
Updated by Yuval Lifshitz 8 months ago
when running locally with valgrind on a cluster with realm defined, i get a different error report, which looks more similar to the failures seen in teuthology:
https://0x0.st/8hfB.err
e.g. is complains about a leak in: Objecter::start_tick()
==00:00:12:44.885 2164549== at 0x4843743: operator new[](unsigned long) (vg_replace_malloc.c:729) ==00:00:12:44.885 2164549== by 0x55F5EFC: allocate (new_allocator.h:151) ==00:00:12:44.885 2164549== by 0x55F5EFC: allocate (allocator.h:196) ==00:00:12:44.885 2164549== by 0x55F5EFC: allocate (alloc_traits.h:478) ==00:00:12:44.885 2164549== by 0x55F5EFC: box_allocate (function2.hpp:418) ==00:00:12:44.885 2164549== by 0x55F5EFC: construct<fu2::abi_400::detail::type_erasure::box<false, std::_Bind<void (Objecter::*(Objecter*))()>, std::allocator<std::_Bind<void (Objecter::*(Objecter*))()> > > > (function2.hpp:953) ==00:00:12:44.885 2164549== by 0x55F5EFC: init<fu2::abi_400::detail::type_erasure::box<false, std::_Bind<void (Objecter::*(Objecter*))()>, std::allocator<std::_Bind<void (Objecter::*(Objecter*))()> > > > (function2.hpp:1001) ==00:00:12:44.885 2164549== by 0x55F5EFC: erasure<std::_Bind<void (Objecter::*(Objecter*))()> > (function2.hpp:1177) ==00:00:12:44.885 2164549== by 0x55F5EFC: function<std::_Bind<void (Objecter::*(Objecter*))()> > (function2.hpp:1593) ==00:00:12:44.885 2164549== by 0x55F5EFC: make_unique<ceph::timer<ceph::coarse_mono_clock>::event, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >&, long unsigned int&, std::_Bind<void (Objecter::*(Objecter*))()> > (unique_ptr.h:1076) ==00:00:12:44.885 2164549== by 0x55F5EFC: add_event<void (Objecter::*)(), Objecter*> (ceph_timer.h:211) ==00:00:12:44.885 2164549== by 0x55F5EFC: add_event<void (Objecter::*)(), Objecter*> (ceph_timer.h:201) ==00:00:12:44.885 2164549== by 0x55F5EFC: Objecter::start_tick() (Objecter.cc:2175) ==00:00:12:44.885 2164549== by 0x55FD15D: Objecter::start(OSDMap const*) (Objecter.cc:432) ==00:00:12:44.885 2164549== by 0x4C174C0: librados::v14_2_0::RadosClient::connect() (RadosClient.cc:314) ==00:00:12:44.885 2164549== by 0x14DE8BF: rgw::rados::create_config_store(DoutPrefixProvider const*) (store.cc:42) ==00:00:12:44.885 2164549== by 0xFE10C2: DriverManager::create_config_store(DoutPrefixProvider const*, std::basic_string_view<char, std::char_traits<char> >) (rgw_sal.cc:391) ==00:00:12:44.885 2164549== by 0xB5445F: rgw::AppMain::init_storage() (rgw_appmain.cc:217) ==00:00:12:44.885 2164549== by 0xB1B768: main (rgw_main.cc:143)
same errors are see even if no notification tests are run
Updated by Casey Bodley 8 months ago
i see that Adam's https://github.com/ceph/ceph/pull/63698 merged to resolve the leaks outside of rgw/notifications, but his qa results in https://pulpito.ceph.com/aemerson-2025-08-07_06:15:17-rgw-wip-71066-distro-default-smithi/ still show the issues in this tracker
Updated by Yuval Lifshitz 8 months ago
note that this test: http://pulpito.front.sepia.ceph.com/yuvalif-2025-08-06_15:38:21-rgw:notifications-wip-yuval-test-leak-distro-default-smithi/
is running the bucket notification test suite against an RGW that does not have the notification code enabled.
it does not run kafka, amqp and http clients, and does not run the persistent notifications thread at all (hence, all the test failures).
this still hase the same memleaks reported.
Updated by Casey Bodley 8 months ago
thanks, https://github.com/ceph/ceph/pull/63189 may be the culprit since it only adds the watch when a realm is configured. that wasn't backported to tentacle
Updated by Casey Bodley 7 months ago
Casey Bodley wrote in #note-24:
thanks, https://github.com/ceph/ceph/pull/63189 may be the culprit since it only adds the watch when a realm is configured. that wasn't backported to tentacle
testing a revert in https://pulpito.ceph.com/cbodley-2025-08-13_18:44:20-rgw:notifications-wip-71390-distro-default-smithi/
Updated by Casey Bodley 7 months ago
Casey Bodley wrote in #note-25:
testing a revert in https://pulpito.ceph.com/cbodley-2025-08-13_18:44:20-rgw:notifications-wip-71390-distro-default-smithi/
still shows failures :(
Updated by J. Eric Ivancich 6 months ago
- Assignee changed from Yuval Lifshitz to Adam Emerson
Updated by Yuval Lifshitz 6 months ago
suggested way of investigation:
make sure that you build ceph with
cmake -DWITH_BOOST_VALGRIND ..
run vstart cluster with realm:
MON=1 OSD=1 MDS=0 MGR=0 ../src/test/rgw/test-rgw-multisite.sh 1
if you want to test the RGW, you have to set the right permissions:
export AWS_ACCESS_KEY_ID=1234567890 export AWS_SECRET_ACCESS_KEY=pencil export AWS_DEFAULT_REGION=zg1
kill the RGW and rerun under valgrind:
valgrind --show-leak-kinds=definite,indirect,possible --leak-check=full --trace-children=no --child-silent-after-fork=yes --num-callers=20 --track-origins=yes --time-stamp=yes --suppressions=../qa/valgrind.supp --tool=memcheck --max-threads=2048 -- ./bin/radosgw -c ./run/c1/ceph.conf --log-file=./run/c1/out/radosgw.8101.log --admin-socket=./run/c1/out/radosgw.8101.asok --pid-file=./run/c1/out/radosgw.8101.pid -n client.rgw.8101 --rgw_frontends="beast port=8101" --debug_ms=0 --debug_rgw=20 --debug_rgw_notification=20 -f &> valgrind.err
Ctrl-C the RGW process (no need to do anything related to bucket notifications, or any other operation)
in the valgrind error log there will be many errors, however, in the context of this tracker, look for:
==00:00:01:58.622 1712310== 24 bytes in 1 blocks are possibly lost in loss record 645 of 1,491 ==00:00:01:58.622 1712310== at 0x718E743: operator new[](unsigned long) (vg_replace_malloc.c:729) ==00:00:01:58.622 1712310== by 0x7F64FFC: allocate (new_allocator.h:151) ==00:00:01:58.622 1712310== by 0x7F64FFC: allocate (allocator.h:196) ==00:00:01:58.622 1712310== by 0x7F64FFC: allocate (alloc_traits.h:478) ==00:00:01:58.622 1712310== by 0x7F64FFC: box_allocate (function2.hpp:418) ==00:00:01:58.622 1712310== by 0x7F64FFC: construct<fu2::abi_400::detail::type_erasure::box<false, std::_Bind<void (Objecter::*(Objecter*))()>, std::allocator<std::_Bind<void (Objecter::*(Objecter*))()> > > > (function2.hpp:953) ==00:00:01:58.622 1712310== by 0x7F64FFC: init<fu2::abi_400::detail::type_erasure::box<false, std::_Bind<void (Objecter::*(Objecter*))()>, std::allocator<std::_Bind<void (Objecter::*(Objecter*))()> > > > (function2.hpp:1001) ==00:00:01:58.622 1712310== by 0x7F64FFC: erasure<std::_Bind<void (Objecter::*(Objecter*))()> > (function2.hpp:1177) ==00:00:01:58.622 1712310== by 0x7F64FFC: function<std::_Bind<void (Objecter::*(Objecter*))()> > (function2.hpp:1593) ==00:00:01:58.622 1712310== by 0x7F64FFC: make_unique<ceph::timer<ceph::coarse_mono_clock>::event, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<long int, std::ratio<1, 1000000000> > >&, long unsigned int&, std:: _Bind<void (Objecter::*(Objecter*))()> > (unique_ptr.h:1076) ==00:00:01:58.622 1712310== by 0x7F64FFC: add_event<void (Objecter::*)(), Objecter*> (ceph_timer.h:211) ==00:00:01:58.622 1712310== by 0x7F64FFC: add_event<void (Objecter::*)(), Objecter*> (ceph_timer.h:201) ==00:00:01:58.622 1712310== by 0x7F64FFC: Objecter::start_tick() (Objecter.cc:2178) ==00:00:01:58.622 1712310== by 0x7F6C89D: Objecter::start(OSDMap const*) (Objecter.cc:432) ==00:00:01:58.622 1712310== by 0x75621BC: librados::v14_2_0::RadosClient::connect() (RadosClient.cc:314) ==00:00:01:58.623 1712310== by 0x5431DBF: rgw::rados::create_config_store(DoutPrefixProvider const*) (store.cc:42) ==00:00:01:58.623 1712310== by 0x4F14D84: DriverManager::create_config_store(DoutPrefixProvider const*, std::basic_string_view<char, std::char_traits<char> >) (rgw_sal.cc:429) ==00:00:01:58.623 1712310== by 0x4A7AD9F: rgw::AppMain::init_storage() (rgw_appmain.cc:214) ==00:00:01:58.623 1712310== by 0x4A3F228: main (rgw_main.cc:143)
Updated by Yuval Lifshitz 6 months ago
- "tentacle" is running without this issue - "bisect good"
- issue was identified 19-may-25, so we can probably find a bad commit around that date
Updated by Casey Bodley 6 months ago
Yuval Lifshitz wrote in #note-29:
one options is investigating via "git bisect"
- "tentacle" is running without this issue - "bisect good"
- issue was identified 19-may-25, so we can probably find a bad commit around that date
thanks Yuval. looking at the latest tentacle baseline https://pulpito.ceph.com/teuthology-2025-09-26_22:40:03-rgw-tentacle-distro-default-smithi/
the rgw/notifications jobs aren't showing leaks, but one rgw/verify job does:
Leak_PossiblyLost operator new[](unsigned long) Objecter::start_tick() Objecter::start(OSDMap const*)
Updated by Nithya Balachandran about 1 month ago · Edited
- Pull request ID set to 67287
I was able to reproduce the issue with the steps Yuval provided.
The issue is that the RadosRealmWatcher reuses and overwrites the librados::Rados in the ConfigStore when calling watch_start().
The original Objecter is thus not shutdown or destroyed when the radosgw is shut down.
With the code changes:
00:00:02:08.799 1664370 LEAK SUMMARY:
00:00:02:08.799 1664370 definitely lost: 0 bytes in 0 blocks
00:00:02:08.799 1664370 indirectly lost: 0 bytes in 0 blocks
00:00:02:08.799 1664370 possibly lost: 0 bytes in 0 blocks
00:00:02:08.799 1664370 still reachable: 36,219 bytes in 1,463 blocks
00:00:02:08.799 1664370 suppressed: 256,781 bytes in 3,295 blocks
Casey, I'm not sure if we want to have a separate librados::Rados instance here. Please take a look at the PR.
Updated by Nithya Balachandran about 1 month ago
- Assignee changed from Adam Emerson to Nithya Balachandran
Updated by Nithya Balachandran about 1 month ago
- Backport changed from tentacle, squid to squid
This is not present in tentacle.
Updated by Nithya Balachandran about 1 month ago
- Status changed from In Progress to Fix Under Review
Updated by Upkeep Bot about 1 month ago
- Status changed from Fix Under Review to Pending Backport
- Merge Commit set to f5ae31d9708a1cb9486a5c325d8e6373f98d5f10
- Fixed In set to v20.3.0-5361-gf5ae31d970
- Upkeep Timestamp set to 2026-02-17T21:42:13+00:00
Updated by Casey Bodley about 1 month ago
Nithya Balachandran wrote in #note-33:
This is not present in tentacle.
@Nithya Balachandran can you please clarify why tentacle backport is not needed, but squid and main did need?
Updated by Upkeep Bot about 1 month ago
- Copied to Backport #74994: squid: valgrind error: Leak_PossiblyLost operator new[](unsigned long) Objecter::start_tick() Objecter::start(OSDMap const*) added
Updated by Upkeep Bot about 1 month ago
- Tags (freeform) set to backport_processed
Updated by Casey Bodley about 1 month ago
- Backport changed from squid to squid tentacle
- Tags (freeform) deleted (
backport_processed)
@Nithya Balachandran can you please clarify why tentacle backport is not needed, but squid and main did need?
adding tentacle back in the meantime. we can close the tentacle tracker if necessary
Updated by Upkeep Bot about 1 month ago
- Copied to Backport #74995: tentacle: valgrind error: Leak_PossiblyLost operator new[](unsigned long) Objecter::start_tick() Objecter::start(OSDMap const*) added
Updated by Upkeep Bot about 1 month ago
- Tags (freeform) set to backport_processed
Updated by Nithya Balachandran about 1 month ago
I don't think it should be present in squid either but I have not got around to checking the code there.
Updated by Casey Bodley about 1 month ago
- Status changed from Pending Backport to Resolved