Project

General

Profile

Actions

Bug #59335

open

Found coredumps on smithi related to sqlite3

Added by Laura Flores almost 3 years ago. Updated 9 days ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Backport:
tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

/a/yuriw-2023-03-28_22:43:59-rados-wip-yuri11-testing-2023-03-28-0950-distro-default-smithi/7223995

2023-03-29T04:48:18.349 DEBUG:teuthology.run_tasks:Unwinding manager internal.sudo
2023-03-29T04:48:18.361 INFO:teuthology.task.internal:Restoring /etc/sudoers...
2023-03-29T04:48:18.361 DEBUG:teuthology.orchestra.run.smithi052:> sudo mv -f /etc/sudoers.orig.teuthology /etc/sudoers
2023-03-29T04:48:18.373 DEBUG:teuthology.orchestra.run.smithi110:> sudo mv -f /etc/sudoers.orig.teuthology /etc/sudoers
2023-03-29T04:48:18.384 DEBUG:teuthology.run_tasks:Unwinding manager internal.coredump
2023-03-29T04:48:18.403 DEBUG:teuthology.orchestra.run.smithi052:> sudo sysctl -w kernel.core_pattern=core && sudo bash -c 'for f in `find /home/ubuntu/cephtest/archive/coredump -type f`; do file $f | grep -q systemd-sysusers && rm $f || true ; done' && rmdir --ignore-fail-on-non-empty -- /home/ubuntu/cephtest/archive/coredump
2023-03-29T04:48:18.415 DEBUG:teuthology.orchestra.run.smithi110:> sudo sysctl -w kernel.core_pattern=core && sudo bash -c 'for f in `find /home/ubuntu/cephtest/archive/coredump -type f`; do file $f | grep -q systemd-sysusers && rm $f || true ; done' && rmdir --ignore-fail-on-non-empty -- /home/ubuntu/cephtest/archive/coredump
2023-03-29T04:48:18.423 INFO:teuthology.orchestra.run.smithi052.stdout:kernel.core_pattern = core
2023-03-29T04:48:18.436 INFO:teuthology.orchestra.run.smithi110.stdout:kernel.core_pattern = core
2023-03-29T04:48:18.467 DEBUG:teuthology.orchestra.run.smithi052:> test -e /home/ubuntu/cephtest/archive/coredump
2023-03-29T04:48:18.490 WARNING:teuthology.task.internal:Found coredumps on ubuntu@smithi052.front.sepia.ceph.com, flagging run as failed
2023-03-29T04:48:18.490 DEBUG:teuthology.orchestra.run.smithi110:> test -e /home/ubuntu/cephtest/archive/coredump
2023-03-29T04:48:18.511 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-03-29T04:48:18.512 DEBUG:teuthology.run_tasks:Unwinding manager internal.archive
2023-03-29T04:48:18.524 INFO:teuthology.task.internal:Transferring archived files...
2023-03-29T04:48:18.524 DEBUG:teuthology.misc:Transferring archived files from smithi052:/home/ubuntu/cephtest/archive to /home/teuthworker/archive/yuriw-2023-03-28_22:43:59-rados-wip-yuri11-testing-2023-03-28-0950-distro-default-smithi/7223995/remote/smithi052
2023-03-29T04:48:18.525 DEBUG:teuthology.orchestra.run.smithi052:> sudo tar cz -f - -C /home/ubuntu/cephtest/archive -- .
2023-03-29T04:48:25.303 INFO:teuthology.task.internal:Transferring binaries for coredumps...
2023-03-29T04:48:25.513 INFO:teuthology.task.internal: core looks like: /home/teuthworker/archive/yuriw-2023-03-28_22:43:59-rados-wip-yuri11-testing-2023-03-28-0950-distro-default-smithi/7223995/remote/smithi052/coredump/1680065090.39304.core.gz: gzip compressed data, was "1680065090.39304.core", last modified: Wed Mar 29 04:48:18 2023, max compression

2023-03-29T04:48:25.514 INFO:teuthology.task.internal:core is compressed, try accessing gzip file ...
2023-03-29T04:48:26.160 INFO:teuthology.task.internal: core looks like: /tmp/tmpsr59t_lu: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'sqlite3 -cmd .output /dev/null -cmd .load libcephsqlite.so -cmd pragma journal_', real uid: 1000, effective uid: 1000, real gid: 1271, effective gid: 1271, execfn: '/usr/bin/sqlite3', platform: 'x86_64'

2023-03-29T04:48:26.190 INFO:teuthology.task.internal: dump_program: sqlite3 -cmd .output /dev/null -cmd .load libcephsqlite.so -cmd pragma journal_
2023-03-29T04:48:26.191 DEBUG:teuthology.orchestra.run.smithi052:> which 'sqlite3 -cmd .output /dev/null -cmd .load libcephsqlite.so -cmd pragma journal_'
2023-03-29T04:48:26.199 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-03-29T04:48:26.200 ERROR:teuthology.run_tasks:Manager failed: internal.archive

There is a core file located at /a/yuriw-2023-03-28_22:43:59-rados-wip-yuri11-testing-2023-03-28-0950-distro-default-smithi/7223995/remote/smithi052/coredump/1680065090.39304.core.gz.

The coredump was not generated by any daemon, and there was no executable. So, I checked the details with `file`. Looks like it has something to do with sqlite3. This is also reflected in the above teuthology log.

~$ file 1680065090.39304.core
1680065090.39304.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'sqlite3 -cmd .output /dev/null -cmd .load libcephsqlite.so -cmd pragma journal_', real uid: 1000, effective uid: 1000, real gid: 1271, effective gid: 1271, execfn: '/usr/bin/sqlite3', platform: 'x86_64'


Related issues 1 (0 open1 closed)

Has duplicate cephsqlite - Bug #63408: libcephsqlite fails with coredumpDuplicate

Actions
Actions #1

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253347

Actions #2

Updated by Laura Flores almost 3 years ago

  • Assignee set to Patrick Donnelly

Patrick would you be the person to go to for this?

Actions #3

Updated by Patrick Donnelly over 2 years ago

The common theme here is that atexit cleanup of sqlite3 is causing segmentation faults or:

2023-08-21T20:21:24.873 INFO:tasks.workunit.client.0.smithi141.stderr:Error: near line 3: stepping, disk I/O error (10)                                                                       
2023-08-21T20:21:24.885 INFO:tasks.workunit.client.0.smithi141.stderr:terminate called after throwing an instance of 'std::runtime_error'                                                     
2023-08-21T20:21:24.885 INFO:tasks.workunit.client.0.smithi141.stderr:  what():  EVP_DecryptInit_ex failed                                                                                    
2023-08-21T20:21:34.850 INFO:tasks.workunit.client.0.smithi141.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test_libcephsqlite.sh: line 129: 40508 Aborted                 (core dumped) sqlite3 -cmd '.output /dev/null' -cmd '.load libcephsqlite.so' -cmd 'pragma journal_mode = PERSIST' -cmd ".open file:///$pool:$ns/baz.db?vfs=ceph" -cmd '.output stdout' <<< "$a" 

/teuthology/pdonnell-2023-08-21_19:21:49-rados:basic-main-distro-default-smithi/7374185/teuthology.log

I've not isolated the cause yet and it's probably some kind of memory corruption :/

It's decidedly just an annoyance because this only happens with atexit cleanup.

Actions #4

Updated by Patrick Donnelly over 2 years ago

  • Has duplicate Bug #63408: libcephsqlite fails with coredump added
Actions #5

Updated by Patrick Donnelly over 2 years ago

/a/yuriw-2023-10-25_14:34:26-rados-wip-yuri5-testing-2023-10-24-0737-pacific-distro-default-smithi/7437307

Actions #6

Updated by Patrick Donnelly over 2 years ago

  • Assignee deleted (Patrick Donnelly)
Actions #7

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-01-19_23:17:02-rados-wip-yuri4-testing-2024-01-18-1257-pacific-distro-default-smithi/7524433

Actions #8

Updated by Laura Flores almost 2 years ago

/a/teuthology-2024-04-28_20:00:15-rados-main-distro-default-smithi/7677031

Actions #9

Updated by Laura Flores 12 months ago

/a/yuriw-2025-03-27_15:03:25-rados-wip-yuri7-testing-2025-03-26-1605-distro-default-smithi/8213263

Actions #10

Updated by Sridhar Seshasayee 12 months ago

/a/skanta-2025-03-27_08:02:07-rados-wip-bharath10-testing-2025-03-27-0430-distro-default-smithi/8212787

Actions #11

Updated by Laura Flores 12 months ago

/a/skanta-2025-04-04_06:10:17-rados-wip-bharath10-testing-2025-04-03-2112-distro-default-smithi/8223549

Actions #12

Updated by Sridhar Seshasayee 12 months ago

/a/yuriw-2025-04-07_23:32:09-rados-wip-yuri13-testing-2025-04-07-1144-distro-default-smithi/8229223

Actions #13

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

/a/skanta-2025-04-09_05:31:19-rados-wip-bharath17-testing-2025-04-08-0602-distro-default-smithi/8233188/

Actions #14

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

/a/lflores-2025-04-14_14:00:23-rados-wip-lflores-testing-2-2025-04-11-1133-distro-default-smithi/8239409

Actions #15

Updated by Laura Flores 11 months ago

/a/lflores-2025-04-15_00:57:49-rados-wip-lflores-testing-5-2025-04-14-1635-distro-default-gibba/8241287

Actions #16

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

  • Assignee set to Patrick Donnelly

from rados suite watch: ping @Patrick Donnelly

Hi Patrick,
do you mind taking a look at this, problem seems to be persistence in the main RADOS suite.

Actions #17

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

Patrick will take a look when he has time, he said the core files just needs debugging .. likely happends during atexit handling with the sqlite3 binary.

Actions #18

Updated by Jonathan Bailey 7 months ago

/a/skanta-2025-08-27_01:46:19-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8466510

Actions #19

Updated by Laura Flores 6 months ago

/a/yuriw-2025-09-15_21:18:21-rados-wip-yuri5-testing-2025-09-15-1211-tentacle-distro-default-smithi/8502193

Actions #20

Updated by Laura Flores 6 months ago

  • Backport set to tentacle
Actions #21

Updated by Aishwarya Mathuria 6 months ago

/a/yuriw-2025-09-24_18:46:32-rados-wip-yuri8-testing-2025-09-24-0752-tentacle-distro-default-smithi/8518613

Actions #22

Updated by Naveen Naidu 3 months ago

/a/skanta-2025-11-21_10:17:34-rados-wip-bharath11-testing-2025-11-21-0531-distro-default-smithi/8617954

Actions #23

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-21_20:56:39-rados-main-distro-default-trial/11810

Actions #24

Updated by Nitzan Mordechai about 1 month ago

/a/yuriw-2026-02-04_23:08:40-rados-wip-yuri3-testing-2026-02-04-1948-tentacle-distro-default-trial/35529

Actions #25

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25738

Actions #26

Updated by Nitzan Mordechai about 1 month ago

/a/yaarit-2026-02-10_23:48:52-rados-wip-rocky10-branch-of-the-day-2026-02-09-1770676549-distro-default-trial/44253

Actions #27

Updated by Nitzan Mordechai 13 days ago

/a/yuriw-2026-03-07_15:37:55-rados-wip-rocky10-branch-of-the-day-2026-03-06-1772840606-distro-default-trial/92991

Actions #28

Updated by Brad Hubbard 9 days ago · Edited

Nitzan Mordechai wrote in #note-27:

/a/yuriw-2026-03-07_15:37:55-rados-wip-rocky10-branch-of-the-day-2026-03-06-1772840606-distro-default-trial/92991

I managed to set up an environment to debug the coredump.

(gdb) bt
#0  __pthread_rwlock_rdlock_full64 (abstime=0x0, clockid=0, rwlock=0x0) at ./nptl/pthread_rwlock_common.c:298
#1  ___pthread_rwlock_rdlock (rwlock=0x0) at ./nptl/pthread_rwlock_rdlock.c:26
#2  0x00007f921e456ccd in CRYPTO_THREAD_read_lock (lock=<optimized out>) at ../crypto/threads_pthread.c:85
#3  0x00007f921e449f06 in ossl_lib_ctx_get_data (ctx=0x7f921e6d44e0 <default_context_int.lto_priv>, index=0, meth=0x7f921e68b7a0 <evp_method_store_method.lto_priv>) at ../crypto/context.c:377
#4  0x00007f921e529bf1 in get_evp_method_store (libctx=<optimized out>) at ../crypto/evp/evp_fetch.c:82
#5  inner_evp_generic_fetch.constprop.0 (methdata=methdata@entry=0x7f921a891730, prov=<optimized out>, prov@entry=0x0, operation_id=operation_id@entry=2, name=0x7f921e5dfcdb "id-aes128-GCM", 
    properties=properties@entry=0x7f921e5dbdbc "", new_method=new_method@entry=0x7f921e429de0 <evp_cipher_from_algorithm>, up_ref_method=0x7f921e421320 <evp_cipher_up_ref>, free_method=0x7f921e42a320 <evp_cipher_free>, name_id=0)
    at ../crypto/evp/evp_fetch.c:248
#6  0x00007f921e4212b3 in evp_generic_fetch (free_method=0x7f921e42a320 <evp_cipher_free>, up_ref_method=0x7f921e421320 <evp_cipher_up_ref>, new_method=0x7f921e429de0 <evp_cipher_from_algorithm>, properties=0x7f921e5dbdbc "", 
    name=<optimized out>, operation_id=2, libctx=0x0) at ../crypto/evp/evp_fetch.c:372
#7  EVP_CIPHER_fetch (ctx=ctx@entry=0x0, algorithm=<optimized out>, properties=properties@entry=0x7f921e5dbdbc "") at ../crypto/evp/evp_enc.c:1652
#8  0x00007f921e4297c0 in evp_cipher_init_internal (ctx=0x7f91fc00c0e0, cipher=0x7f921e688c20 <aesni_128_gcm>, impl=<optimized out>, key=0x0, iv=0x0, enc=<optimized out>, params=0x0) at ../crypto/evp/evp_enc.c:175
#9  0x00007f921e429c2f in EVP_CipherInit_ex (ctx=<optimized out>, cipher=<optimized out>, impl=<optimized out>, key=<optimized out>, iv=<optimized out>, enc=<optimized out>) at ../crypto/evp/evp_enc.c:436
#10 0x00007f921ee3e9f4 in ceph::crypto::onwire::AES128GCM_OnWireRxHandler::AES128GCM_OnWireRxHandler (cct=<optimized out>, new_nonce_format=true, nonce=..., key=..., this=0x7f91fc007d10) at msg/./src/msg/async/crypto_onwire.cc:186
#11 std::make_unique<ceph::crypto::onwire::AES128GCM_OnWireRxHandler, ceph::common::CephContext*&, std::array<unsigned char, 16ul>&, ceph::crypto::onwire::nonce_t&, bool&> () at /usr/include/c++/12/bits/unique_ptr.h:1065
#12 ceph::crypto::onwire::rxtx_t::create_handler_pair (cct=0x55efc9d53ba0, auth_meta=..., new_nonce_format=<optimized out>, crossed=<optimized out>) at msg/./src/msg/async/crypto_onwire.cc:309
#13 0x00007f921ee23c40 in ProtocolV2::handle_auth_done (this=0x7f91cc0945f0, payload=...) at /usr/include/c++/12/bits/shared_ptr_base.h:1665
#14 0x00007f921ee16a29 in ProtocolV2::run_continuation (this=0x7f91cc0945f0, continuation=...) at msg/./src/msg/async/ProtocolV2.cc:54
#15 0x00007f921edee56e in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=0, __args#0=<optimized out>, this=0x7f91cc0744d8) at /usr/include/c++/12/bits/std_function.h:591
#16 AsyncConnection::process (this=0x7f91cc074140) at msg/./src/msg/async/AsyncConnection.cc:485
#17 0x00007f921ee3300c in EventCenter::process_events (this=0x55efc9d0a058, timeout_microseconds=<optimized out>, working_dur=0x7f921a891d88) at msg/./src/msg/async/Event.cc:465
#18 0x00007f921ee38bf9 in operator() (__closure=<optimized out>) at msg/./src/msg/async/Stack.cc:50
#19 std::__invoke_impl<void, NetworkStack::add_thread(Worker*)::<lambda()>&> (__f=...) at /usr/include/c++/12/bits/invoke.h:61
#20 std::__invoke_r<void, NetworkStack::add_thread(Worker*)::<lambda()>&> (__fn=...) at /usr/include/c++/12/bits/invoke.h:111
#21 std::_Function_handler<void(), NetworkStack::add_thread(Worker*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/12/bits/std_function.h:290
#22 0x00007f921e81f253 in std::execute_native_thread_routine (__p=0x55efc9e9c5f0) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:82
#23 0x00007f921f5e8ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#24 0x00007f921f67a8d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

I wonder if we are racing with this thread?

Thread 3 (Thread 0x7f921f51d740 (LWP 21504)):
#0  _int_free (av=0x7f9204000030, p=0x7f92040111a0, have_lock=0) at ./malloc/malloc.c:4553
#1  0x00007f921f5f9453 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#2  0x00007f921e44105c in OPENSSL_LH_flush (lh=<optimized out>) at ../crypto/lhash/lhash.c:121
#3  OPENSSL_LH_flush (lh=lh@entry=0x7f920400f530) at ../crypto/lhash/lhash.c:109
#4  0x00007f921e4410af in OPENSSL_LH_free (lh=0x7f920400f530) at ../crypto/lhash/lhash.c:101
#5  OPENSSL_LH_free (lh=0x7f920400f530) at ../crypto/lhash/lhash.c:96
#6  0x00007f921e454192 in err_cleanup () at ../crypto/err/err.c:229
#7  OPENSSL_cleanup () at ../crypto/init.c:438
#8  OPENSSL_cleanup () at ../crypto/init.c:338
#9  0x00007f921f599495 in __run_exit_handlers (status=1, listp=0x7f921f76e838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#10 0x00007f921f599610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#11 0x00007f921f57dd97 in __libc_start_call_main (main=main@entry=0x55efa2b3b020, argc=argc@entry=13, argv=argv@entry=0x7ffea71715a8) at ../sysdeps/nptl/libc_start_call_main.h:74
#12 0x00007f921f57de40 in __libc_start_main_impl (main=0x55efa2b3b020, argc=13, argv=0x7ffea71715a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffea7171598) at ../csu/libc-start.c:392
#13 0x000055efa2b3c9c5 in ?? ()

Of course the rwlock = 0x0 is a giveaway that the lock has been destroyed.

Actions #29

Updated by Brad Hubbard 9 days ago · Edited

Brad Hubbard wrote in #note-28:

Nitzan Mordechai wrote in #note-27:

/a/yuriw-2026-03-07_15:37:55-rados-wip-rocky10-branch-of-the-day-2026-03-06-1772840606-distro-default-trial/92991

I managed to set up an environment to debug the coredump.

[...]

I wonder if we are racing with this thread?

[...]

Of course the rwlock = 0x0 is a giveaway that the lock has been destroyed.

Likely race...

Thread 1                         Thread 3
--------                         --------
EVP_CIPHER_fetch()
  get_evp_method_store()

                                 exit()
                                 __run_exit_handlers()
                                 OPENSSL_cleanup()
                                   free EVP store
                                   destroy locks

CRYPTO_THREAD_read_lock(lock=NULL)
Segfault

I wonder if adding something here would suffice or not?

https://github.com/ceph/ceph/blob/4982a6b618ff83769389cc0b7d66b84259b590d2/src/libcephsqlite.cc#L868-L902

Actions

Also available in: Atom PDF