Bug #50743: *: crash in pthread_getname_np - CephFS - Ceph

Actions

Copy link

Bug #50743

closed

*: crash in pthread_getname_np

Added by Yaarit Hatuka almost 5 years ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Milind Changire

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

100%

Source:

Q/A

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.15, Ceph - v17.2.7, Ceph - v18.2.4, Ceph - v19.2.1

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

60058

Tags (freeform):

backport_processed

Merge Commit:

6e728815272e79e9934c24bb5a26d290b4aef65a

Fixed In:

v19.3.0-5578-g6e72881527

Released In:

v20.2.0~1782

Upkeep Timestamp:

2025-11-01T01:28:38+00:00

Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

Sanitized backtrace:

    /lib64/libpthread.so.0(
    pthread_getname_np()
    ceph::logging::Log::dump_recent()
    MDSDaemon::respawn()
    Context::complete(int)
    MDSRank::respawn()
    MDSRank::handle_write_error(int)
    /usr/bin/ceph-mds(
    Context::complete(int)
    Finisher::finisher_thread_entry()
    /lib64/libpthread.so.0(
    clone()

Crash dump sample:

{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fca48244b20]",
        "pthread_getname_np()",
        "(ceph::logging::Log::dump_recent()+0x4b3) [0x7fca4980c643]",
        "(MDSDaemon::respawn()+0x15b) [0x55ca473547fb]",
        "(Context::complete(int)+0xd) [0x55ca4735c9ed]",
        "(MDSRank::respawn()+0x1c) [0x55ca4736289c]",
        "(MDSRank::handle_write_error(int)+0x1a6) [0x55ca47366a26]",
        "/usr/bin/ceph-mds(+0x1bef20) [0x55ca47366f20]",
        "(Context::complete(int)+0xd) [0x55ca4735c9ed]",
        "(Finisher::finisher_thread_entry()+0x1a5) [0x7fca495380b5]",
        "/lib64/libpthread.so.0(+0x814a) [0x7fca4823a14a]",
        "clone()" 
    ],
    "ceph_version": "16.2.1",
    "crash_id": "2021-05-08T02:06:04.479049Z_c595a994-0782-4309-9f1b-e2f7dedaf821",
    "entity_name": "mds.20297e33b547c2f168d55f24c5d7328709e9b647",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "7a4dc07de98f9fe5951207ab4b4b599270f9729af0b338a4212d3bf9335cf310",
    "timestamp": "2021-05-08T02:06:04.479049Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-70-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021" 
}

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Yaarit Hatuka almost 5 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)
Affected Versions v16.2.1 added

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

Project changed from RADOS to CephFS

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Subject changed from crash: /lib64/libpthread.so.0( to mds: crash in pthread_getname_np
Status changed from New to Need More Info

How do we know what the signal number was? Not clear to me what to do with this. I don't see anything obviously wrong with the call to pthread_getname_np in Log::dump_recent.

Actions

Copy link

Updated by Yaarit Hatuka almost 5 years ago

Hi Patrick,

We don't have the signal number yet in the telemetry crash reports.

You can see other crash events with the same signature here:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

in the Crashes table:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?viewPanel=26&orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

If it seems like this is not a bug after all, you can change the status to Rejected or Won't Fix.

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Yaarit Hatuka wrote:

Hi Patrick,

We don't have the signal number yet in the telemetry crash reports.

You can see other crash events with the same signature here:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

in the Crashes table:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?viewPanel=26&orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

If it seems like this is not a bug after all, you can change the status to Rejected or Won't Fix.

Sorry, how do you conclude it's not a bug after all?

Actions

Copy link

Updated by Yaarit Hatuka almost 5 years ago

oh, I mean in general, not necessarily in this case.

This was opened automatically by a telemetry-to-redmine bot that I'm working on, that will import the telemetry crash reports to redmine. It was opened with my user, but future issues will be opened with the telemetry bot user.

Actions

Copy link

Updated by Patrick Donnelly almost 5 years ago

Project changed from CephFS to Ceph
Subject changed from mds: crash in pthread_getname_np to *: crash in pthread_getname_np
Target version set to v17.0.0
Source changed from Telemetry to Q/A
Backport set to pacific

2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr:*** Caught signal (Segmentation fault) **
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr: in thread 7f819c376700 thread_name:ceph-mon
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr: ceph version 16.2.4-225-gf9084200 (f908420004cc81a30edb2b252b4d92f50c526280) pacific (stable)
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f8191010dc0]
2021-06-01T05:31:28.918 INFO:tasks.ceph.mon.a.smithi053.stderr: 2: pthread_getname_np()
2021-06-01T05:31:28.918 INFO:tasks.ceph.mon.a.smithi053.stderr: 3: (ceph::logging::Log::dump_recent()+0x4b3) [0x7f81938845a3]
2021-06-01T05:31:28.918 INFO:tasks.ceph.mon.a.smithi053.stderr: 4: ceph-mon(+0x53110b) [0x5594d213110b]
2021-06-01T05:31:28.919 INFO:tasks.ceph.mon.a.smithi053.stderr: 5: /lib64/libpthread.so.0(+0x12dc0) [0x7f8191010dc0]
2021-06-01T05:31:28.919 INFO:tasks.ceph.mon.a.smithi053.stderr: 6: gsignal()
2021-06-01T05:31:28.919 INFO:tasks.ceph.mon.a.smithi053.stderr: 7: abort()
2021-06-01T05:31:28.920 INFO:tasks.ceph.mon.a.smithi053.stderr: 8: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f819351059d]
2021-06-01T05:31:28.920 INFO:tasks.ceph.mon.a.smithi053.stderr: 9: /usr/lib64/ceph/libceph-common.so.2(+0x276766) [0x7f8193510766]
2021-06-01T05:31:28.920 INFO:tasks.ceph.mon.a.smithi053.stderr: 10: (Monitor::~Monitor()+0xb35) [0x5594d1ee0995]
2021-06-01T05:31:28.921 INFO:tasks.ceph.mon.a.smithi053.stderr: 11: (Monitor::~Monitor()+0xd) [0x5594d1ee09ed]
2021-06-01T05:31:28.921 INFO:tasks.ceph.mon.a.smithi053.stderr: 12: main()
2021-06-01T05:31:28.921 INFO:tasks.ceph.mon.a.smithi053.stderr: 13: __libc_start_main()
2021-06-01T05:31:28.922 INFO:tasks.ceph.mon.a.smithi053.stderr: 14: _start()
2021-06-01T05:31:28.980 INFO:tasks.ceph.mgr.z.smithi179.stderr:daemon-helper: command crashed with signal 15

From: /ceph/teuthology-archive/teuthology-2021-06-01_04:17:03-fs-pacific-distro-basic-smithi/6144511/teuthology.log

Test failed for other reasons but we finally have this showing up in QA.

Actions

Copy link

Updated by Neha Ojha almost 5 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Telemetry Bot about 4 years ago

Crash signature (v1) updated (diff)
Crash signature (v2) updated (diff)

Actions

Copy link

#10

Updated by Telemetry Bot about 4 years ago

Crash signature (v1) updated (diff)
Affected Versions v15.2.10, v15.2.11, v15.2.12, v15.2.13, v15.2.15, v15.2.4, v15.2.5, v15.2.7, v15.2.8, v16.2.4, v16.2.5, v16.2.6, v16.2.7 added

Actions

Copy link

#11

Updated by Radoslaw Zarzynski almost 4 years ago

Crash signature (v1) updated (diff)

Still Need More Info as the logs aren't there after the months.

Actions

Copy link

#12

Updated by Kotresh Hiremath Ravishankar over 1 year ago

I came across the mds crash in the Tracker Description on a downstream bz https://bugzilla.redhat.com/show_bug.cgi?id=2293509. Please find the analysis below for the same.

Analysing the mds logs, I see that the ceph-mds daemons have been terminated abruptly with the signal SIGTERM around 22 times. This resulted the mds to suicide and reach
STATE::DNE. Upon notifying the same to the monitor, monitor failed the mds which in turn blocklisted it and triggered re-spawn. In three of the 22
instances, the mds as crashed with segmentation fault without successful re-spawn with below traceback. I think the mds-crash is a race between mds and journal shutting down
and journal still writing to osd. The write failed because of Transport endpoint shutdown.

Traceback of mds crash from gdb

  Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `ceph-mds --fsid=ddedfd5b-c6e1-446a-b6e1-d1719af7800f --keyring=/etc/ceph/keyrin'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/292532' in core file.
#0  0x00007fe98778154c in __pthread_kill_implementation () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fe97d88e640 (LWP 292532))]
Missing separate debuginfos, use: dnf debuginfo-install ceph-mds-17.2.6-196.el9cp.x86_64
(gdb) bt
#0  0x00007fe98778154c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007fe987734d06 in raise () from /lib64/libc.so.6
#2  0x0000558a568a9eba in handle_oneshot_fatal_signal(int) ()
#3  <signal handler called>
#4  0x00007fe987780eb6 in pthread_getname_np@GLIBC_2.12 () from /lib64/libc.so.6
#5  0x00007fe988079d45 in ceph::logging::Log::dump_recent() () from /usr/lib64/ceph/libceph-common.so.2
#6  0x0000558a565a935a in MDSDaemon::respawn() ()
#7  0x0000558a565929bd in Context::complete(int) ()
#8  0x0000558a565bc7af in MDSRank::handle_write_error(int) ()
#9  0x0000558a565eed34 in LambdaContext<MDSRank::MDSRank(int, ceph::fair_mutex&, std::shared_ptr<LogChannel>&, CommonSafeTimer<ceph::fair_mutex>&, Beacon&, std::unique_ptr<MDSMap, std::default_delete<MDSMap> >&, Messenger*, MonClient*, MgrClient*, Context*, Context*, boost::asio::io_context&)::{lambda(int)#1}>::finish(int) [clone .lto_priv.0] ()
#10 0x0000558a565929bd in Context::complete(int) ()
#11 0x00007fe987e3d635 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.2
#12 0x00007fe98777f802 in start_thread () from /lib64/libc.so.6
#13 0x00007fe98771f450 in clone3 () from /lib64/libc.so.6

The following mds logs confirm the sequence of crash

[khiremat@supportshell-1 ceph]$ pwd
/home/blr/khiremat/03850956/0090-must-gather_yrp_sum2ocp_odf_20240619_2.tar.gz/must-gather.local.1247707847673627001/registry-sum2ocp-nokia-local-8443-odf4-odf-must-gather-rhel9-sha256-a359be47ef2eb0823548ebaa399cad1cafe09cb47c283da1949b7da980803530/ceph

-------
   -81> 2024-06-18T02:24:57.885+0000 7fe98409b640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
   -80> 2024-06-18T02:24:57.885+0000 7fe98409b640 -1 mds.ocs-storagecluster-cephfilesystem-b *** got signal Terminated ***
   -79> 2024-06-18T02:24:57.885+0000 7fe98409b640  1 mds.ocs-storagecluster-cephfilesystem-b suicide! Wanted state up:active
   -78> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 asok(0x558a574c0000) unregister_commands cache drop
   -77> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 asok(0x558a574c0000) unregister_commands cache status
...
...
    -35> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 asok(0x558a574c0000) unregister_commands tag path
   -34> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 mds.beacon.ocs-storagecluster-cephfilesystem-b set_want_state: up:active -> down:dne
   -33> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 mds.beacon.ocs-storagecluster-cephfilesystem-b Sending beacon down:dne seq 49
   -32> 2024-06-18T02:24:57.885+0000 7fe98409b640 10 monclient: _send_mon_message to mon.c at v2:172.30.120.243:3300/0
...
...
   -22> 2024-06-18T02:24:59.125+0000 7fe981896640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-18T02:24:29.125723+0000)
   -21> 2024-06-18T02:25:00.125+0000 7fe981896640 10 monclient: tick
   -20> 2024-06-18T02:25:00.125+0000 7fe981896640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-18T02:24:30.125825+0000)
   -19> 2024-06-18T02:25:00.517+0000 7fe98409b640  1 mds.0.19 shutdown: shutting down rank 0
   -18> 2024-06-18T02:25:00.517+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a5dd37dc0 osd_map(253..253 src has 1..253) v4 from mon.2 v2:172.30.120.243:3300/0
   -17> 2024-06-18T02:25:00.518+0000 7fe97e890640  2 mds.0.cache Memory usage:  total 373192, rss 113508, heap 223448, baseline 182488, 3731 / 8384 inodes have caps, 3744 caps, 0.446565 caps per inode
   -16> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x10000001fe2
   -15> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x1000009f368
   -14> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x100000942b0
   -13> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x1000009ca31
   -12> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x1000009f369
   -11> 2024-06-18T02:25:00.518+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a582931e0 mdsmap(e 25) v2 from mon.2 v2:172.30.120.243:3300/0
   -10> 2024-06-18T02:25:00.518+0000 7fe98409b640  5 mds.0.log shutdown
    -9> 2024-06-18T02:25:00.518+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a58293380 mdsmap(e 4294967295) v2 from mon.2 v2:172.30.120.243:3300/0
    -8> 2024-06-18T02:25:00.518+0000 7fe98409b640  1 mds.0.journaler.mdlog(rw) shutdown
    -7> 2024-06-18T02:25:00.518+0000 7fe97c88c640  4 MDSIOContextBase::complete: dropping for stopping 26C_Locker_FileUpdate_finish
    -6> 2024-06-18T02:25:00.518+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a58293520 mdsmap(e 26) v2 from mon.2 v2:172.30.120.243:3300/0
    -5> 2024-06-18T02:25:00.518+0000 7fe97d88e640 -1 mds.0.journaler.pq(rw) _finish_flush got (108) Cannot send after transport endpoint shutdown
    -4> 2024-06-18T02:25:00.518+0000 7fe97d88e640 -1 mds.0.journaler.pq(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
    -3> 2024-06-18T02:25:00.518+0000 7fe97c88c640  4 MDSIOContextBase::complete: dropping for stopping 18C_PurgeStrayLogged
    -2> 2024-06-18T02:25:00.519+0000 7fe97d88e640 -1 mds.0.19 we have been blocklisted (fenced), respawning...
    -1> 2024-06-18T02:25:00.519+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a58292680 mdsmap(e 27) v2 from mon.2 v2:172.30.120.243:3300/0
     0> 2024-06-18T02:25:00.519+0000 7fe97d88e640  1 mds.ocs-storagecluster-cephfilesystem-b respawn!
...
...

Logs to confirm the ceph-mds got terminated by SIGTERM around 20 times (4 of them are repetitive, because of dump).

[khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" | wc -l
24

[khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" 
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T02:21:27.882+0000 7fc0a2dc2640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:   -62> 2024-06-18T02:21:27.882+0000 7fc0a2dc2640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T02:24:57.885+0000 7fe98409b640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:   -81> 2024-06-18T02:24:57.885+0000 7fe98409b640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T07:05:27.884+0000 7fe3a71b3640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T07:08:27.885+0000 7fa2ecf4d640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
...
...
ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-a.log.1.gz:2024-06-18T08:50:26.128+0000 7f72aa312640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-a.log.1.gz:2024-06-18T09:09:26.130+0000 7f2596aac640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:2024-06-18T08:49:08.783+0000 7f08e17fd640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:  -351> 2024-06-18T08:49:08.783+0000 7f08e17fd640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:2024-06-18T08:52:38.782+0000 7f2a03346640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:  -110> 2024-06-18T08:52:38.782+0000 7f2a03346640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0

I scrubbed the monitor logs around the time of the SIGTERM.
Around the time the ceph-mds crashed, checking ceph-mon logs, monitor is not sending the SIGTERM.

2024-06-18T02:24:56.457+0000 7fbc4909a640  0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 13351/40053 objects degraded (33.333%), 181 pgs degraded, 201 pgs undersized (PG_DEGRADED)
2024-06-18T02:24:56.457+0000 7fbc4909a640  1 mon.b@1(leader).osd e252 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 343932928 full_alloc: 348127232 kv_alloc: 318767104
2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).mds e24 prepare_beacon: DNE from [mds.ocs-storagecluster-cephfilesystem-b{0:248126} state up:active seq 41 join_fscid=1 addr [v2:10.128.6.28:6800/4141633146,v1:10.128.6.28:6801/4141633146] compat {c=[1],r=[1],i=[7ff]}]
2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).mds e24 fail_mds_gid 248126 mds.ocs-storagecluster-cephfilesystem-b role 0
2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).osd e252 do_prune osdmap full prune enabled
2024-06-18T02:24:57.889+0000 7fbc44090640  1 mon.b@1(leader).osd e253 e253: 6 total, 4 up, 6 in
2024-06-18T02:24:57.891+0000 7fbc44090640  0 log_channel(cluster) log [DBG] : osdmap e253: 6 total, 4 up, 6 in
2024-06-18T02:24:58.462+0000 7fbc4909a640  0 log_channel(cluster) log [WRN] : Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2024-06-18T02:24:58.462+0000 7fbc4909a640 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2024-06-18T02:24:58.464+0000 7fbc44090640  0 mon.b@1(leader).mds e25 new map
2024-06-18T02:24:58.464+0000 7fbc44090640  0 mon.b@1(leader).mds e25 print_map
e25
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'ocs-storagecluster-cephfilesystem' (1)
fs_name ocs-storagecluster-cephfilesystem
epoch   25
flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
created 2024-06-12T10:47:45.025981+0000
modified        2024-06-18T02:24:58.462396+0000
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  253
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {}
failed  0
damaged
stopped
data_pools      [11]
metadata_pool   10
inline_data     disabled
balancer
standby_count_wanted    1
[mds.ocs-storagecluster-cephfilesystem-a{0:407035} state up:standby-replay seq 1 join_fscid=1 addr [v2:10.128.4.35:6800/1723782331,v1:10.128.4.35:6801/1723782331] compat {c=[1],r=[1],i=[7ff]}]

2024-06-18T02:24:58.466+0000 7fbc44090640  1 mon.b@1(leader).mds e25  taking over failed mds.0 with 407035/ocs-storagecluster-cephfilesystem-a [v2:10.128.4.35:6800/1723782331,v1:10.128.4.35:6801/1723782331]
2024-06-18T02:24:58.466+0000 7fbc44090640  0 log_channel(cluster) log [INF] : Standby daemon mds.ocs-storagecluster-cephfilesystem-a assigned to filesystem ocs-storagecluster-cephfilesystem as rank 0
2024-06-18T02:24:58.466+0000 7fbc44090640  0 log_channel(cluster) log [WRN] : Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)
2024-06-18T02:24:58.466+0000 7fbc44090640  0 log_channel(cluster) log [INF] : Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
2024-06-18T02:24:58.467+0000 7fbc44090640  0 log_channel(cluster) log [DBG] : fsmap ocs-storagecluster-cephfilesystem:0/1 1 up:standby-replay, 1 failed
2024-06-18T02:24:58.469+0000 7fbc44090640  0 mon.b@1(leader).mds e26 new map
2024-06-18T02:24:58.469+0000 7fbc44090640  0 mon.b@1(leader).mds e26 print_map
...
...

Instances of ceph-mon failing mds:

[khiremat@supportshell-1 ceph]$ find ./ceph_daemon_log_sum2wkr0* | grep ceph-mon | xargs grep fail_mds_gid
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:21:26.129+0000 7fbc46895640  1 mon.b@1(leader).mds e9 fail_mds_gid 14545 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:21:27.885+0000 7fbc46895640  1 mon.b@1(leader).mds e10 fail_mds_gid 14600 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:21:58.121+0000 7fbc46895640  1 mon.b@1(leader).mds e14 fail_mds_gid 248120 mds.ocs-storagecluster-cephfilesystem-b role -1
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:24:26.130+0000 7fbc46895640  1 mon.b@1(leader).mds e18 fail_mds_gid 406837 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).mds e24 fail_mds_gid 248126 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:05:26.129+0000 7fbc46895640  1 mon.b@1(leader).mds e34 fail_mds_gid 407035 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:05:27.884+0000 7fbc46895640  1 mon.b@1(leader).mds e37 fail_mds_gid 407059 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:08:26.128+0000 7fbc46895640  1 mon.b@1(leader).mds e45 fail_mds_gid 419368 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:08:27.890+0000 7fbc46895640  1 mon.b@1(leader).mds e46 fail_mds_gid 258236 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:41:26.128+0000 7fbc46895640  1 mon.b@1(leader).mds e58 fail_mds_gid 419587 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:41:27.884+0000 7fbc46895640  1 mon.b@1(leader).mds e61 fail_mds_gid 258467 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:44:26.128+0000 7fbc46895640  1 mon.b@1(leader).mds e69 fail_mds_gid 423271 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:44:27.885+0000 7fbc46895640  1 mon.b@1(leader).mds e70 fail_mds_gid 262358 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T09:07:31.032+0000 7fbc46895640  1 mon.b@1(leader).mds e119 fail_mds_gid 466452 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T09:09:26.130+0000 7fbc46895640  1 mon.b@1(leader).mds e121 fail_mds_gid 424714 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T09:10:49.602+0000 7fbc46895640  1 mon.b@1(leader).mds e129 fail_mds_gid 474632 mds.ocs-storagecluster-cephfilesystem-b role 0

There is no evidence in the logs for who sent the SIGTERM multiple times around following time.

---------------------------
khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" | awk '{print $1}'| awk -F "gz:" '{print $2}' | sort

2024-06-18T02:21:26.128+0000
2024-06-18T02:21:27.882+0000
2024-06-18T02:24:26.130+0000
2024-06-18T02:24:57.885+0000
2024-06-18T07:05:26.128+0000
2024-06-18T07:05:27.884+0000
2024-06-18T07:08:26.127+0000
2024-06-18T07:08:27.885+0000
2024-06-18T08:41:26.127+0000
2024-06-18T08:41:27.884+0000
2024-06-18T08:44:26.128+0000
2024-06-18T08:44:27.885+0000
2024-06-18T08:47:26.133+0000
2024-06-18T08:50:26.128+0000
2024-06-18T09:09:26.130+0000
2024-06-18T09:10:49.600+0000

I couldn't verify the rook operator logs around this time as the provided must gather didn't have rook-operator logs around this time. One of the suspect is that the rook operator could have
sent the SIGTERM ?

The ceph-mds crash is because the pthread_getname_p segfaulted because of invalid thread_id. I think this is because the thread no longer existed as the mds is shutdown.

Actions

Copy link

#13

Updated by Kotresh Hiremath Ravishankar over 1 year ago

Project changed from RADOS to CephFS

Actions

Copy link

#14

Updated by Venky Shankar over 1 year ago

Kotresh Hiremath Ravishankar wrote in #note-12:

I came across the mds crash in the Tracker Description on a downstream bz https://bugzilla.redhat.com/show_bug.cgi?id=2293509. Please find the analysis below for the same.

Analysing the mds logs, I see that the ceph-mds daemons have been terminated abruptly with the signal SIGTERM around 22 times. This resulted the mds to suicide and reach
STATE::DNE. Upon notifying the same to the monitor, monitor failed the mds which in turn blocklisted it and triggered re-spawn. In three of the 22
instances, the mds as crashed with segmentation fault without successful re-spawn with below traceback. I think the mds-crash is a race between mds and journal shutting down
and journal still writing to osd. The write failed because of Transport endpoint shutdown.

Traceback of mds crash from gdb
[...]

The following mds logs confirm the sequence of crash

[...]

Logs to confirm the ceph-mds got terminated by SIGTERM around 20 times (4 of them are repetitive, because of dump).

[...]

I scrubbed the monitor logs around the time of the SIGTERM.
Around the time the ceph-mds crashed, checking ceph-mon logs, monitor is not sending the SIGTERM.

[...]

Instances of ceph-mon failing mds:

[...]

There is no evidence in the logs for who sent the SIGTERM multiple times around following time.

---------------------------
khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" | awk '{print $1}'| awk -F "gz:" '{print $2}' | sort

[...]

I couldn't verify the rook operator logs around this time as the provided must gather didn't have rook-operator logs around this time. One of the suspect is that the rook operator could have
sent the SIGTERM ?

rook-operator could fail the MDS if it finds that the active MDS is unresponsive, but maybe not directly send a SIGTERM signal.

The ceph-mds crash is because the pthread_getname_p segfaulted because of invalid thread_id. I think this is because the thread no longer existed as the mds is shutdown.

Has the thread already joined? If not, the pthread call should still succeed.

Actions

Copy link

#15

Updated by Venky Shankar over 1 year ago

Related to Bug #66923: mds: crash with core dump during respawn added

Actions

Copy link

#16

Updated by Konstantin Shalygin over 1 year ago

Backport deleted (~~pacific~~)

Actions

Copy link

#17

Updated by Venky Shankar over 1 year ago

Category set to Correctness/Safety
Status changed from Need More Info to In Progress
Assignee set to Milind Changire
Target version changed from v17.0.0 to v20.0.0
Backport set to quincy,reef,squid

Actions

Copy link

#18

Updated by Venky Shankar over 1 year ago

Had a conversation with @Milind Changire. The logging infrastructure maintains an Entry structure

class Entry {
public:
  using time = log_time;

  Entry() = delete;
  Entry(short pr, short sub) :
    m_stamp(clock().now()),
    m_thread(pthread_self()),
    m_prio(pr),
    m_subsys(sub)
  {}
  Entry(const Entry &) = default;
  Entry& operator=(const Entry &) = default;
  Entry(Entry &&e) = default;
  Entry& operator=(Entry &&e) = default;
  virtual ~Entry() = default;

  virtual std::string_view strv() const = 0;
  virtual std::size_t size() const = 0;

  time m_stamp;
  pthread_t m_thread;
  short m_prio, m_subsys;

  static log_clock& clock() {
    static log_clock clock;
    return clock;
  }
};

It would suffice to include the thread name in this structure to avoid having to call pthread_getname_np() with the thread_id (m_thread) which is invalid (i.e., for a thread that has already terminated).

Actions

Copy link

#19

Updated by Milind Changire over 1 year ago

Pull request ID set to 60058

Pull request has been created

Actions

Copy link

#20

Updated by Venky Shankar over 1 year ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

#21

Updated by Konstantin Shalygin over 1 year ago

Affected Versions v16.2.15, v17.2.7, v18.2.4, v19.2.1 added
Affected Versions deleted (~~v15.2.10, v15.2.11, v15.2.12, v15.2.13, v15.2.15, v15.2.4, v15.2.5, v15.2.7, v15.2.8, v16.2.1, v16.2.4, v16.2.5, v16.2.6, v16.2.7~~)

Actions

Copy link

#22