Project

General

Profile

Actions

Bug #50743

closed

*: crash in pthread_getname_np

Added by Yaarit Hatuka almost 5 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Q/A
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-5578-g6e72881527
Released In:
v20.2.0~1782
Upkeep Timestamp:
2025-11-01T01:28:38+00:00

Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

Sanitized backtrace:

    /lib64/libpthread.so.0(
    pthread_getname_np()
    ceph::logging::Log::dump_recent()
    MDSDaemon::respawn()
    Context::complete(int)
    MDSRank::respawn()
    MDSRank::handle_write_error(int)
    /usr/bin/ceph-mds(
    Context::complete(int)
    Finisher::finisher_thread_entry()
    /lib64/libpthread.so.0(
    clone()

Crash dump sample:
{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fca48244b20]",
        "pthread_getname_np()",
        "(ceph::logging::Log::dump_recent()+0x4b3) [0x7fca4980c643]",
        "(MDSDaemon::respawn()+0x15b) [0x55ca473547fb]",
        "(Context::complete(int)+0xd) [0x55ca4735c9ed]",
        "(MDSRank::respawn()+0x1c) [0x55ca4736289c]",
        "(MDSRank::handle_write_error(int)+0x1a6) [0x55ca47366a26]",
        "/usr/bin/ceph-mds(+0x1bef20) [0x55ca47366f20]",
        "(Context::complete(int)+0xd) [0x55ca4735c9ed]",
        "(Finisher::finisher_thread_entry()+0x1a5) [0x7fca495380b5]",
        "/lib64/libpthread.so.0(+0x814a) [0x7fca4823a14a]",
        "clone()" 
    ],
    "ceph_version": "16.2.1",
    "crash_id": "2021-05-08T02:06:04.479049Z_c595a994-0782-4309-9f1b-e2f7dedaf821",
    "entity_name": "mds.20297e33b547c2f168d55f24c5d7328709e9b647",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "7a4dc07de98f9fe5951207ab4b4b599270f9729af0b338a4212d3bf9335cf310",
    "timestamp": "2021-05-08T02:06:04.479049Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-70-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#78-Ubuntu SMP Fri Mar 19 13:29:52 UTC 2021" 
}


Related issues 4 (0 open4 closed)

Related to CephFS - Bug #66923: mds: crash with core dump during respawnDuplicate

Actions
Copied to CephFS - Backport #68651: reef: *: crash in pthread_getname_npResolvedMilind ChangireActions
Copied to CephFS - Backport #68652: squid: *: crash in pthread_getname_npResolvedMilind ChangireActions
Copied to CephFS - Backport #68653: quincy: *: crash in pthread_getname_npRejectedMilind ChangireActions
Actions #1

Updated by Yaarit Hatuka almost 5 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.2.1 added
Actions #2

Updated by Neha Ojha almost 5 years ago

  • Project changed from RADOS to CephFS
Actions #3

Updated by Patrick Donnelly almost 5 years ago

  • Subject changed from crash: /lib64/libpthread.so.0( to mds: crash in pthread_getname_np
  • Status changed from New to Need More Info

How do we know what the signal number was? Not clear to me what to do with this. I don't see anything obviously wrong with the call to pthread_getname_np in Log::dump_recent.

Actions #4

Updated by Yaarit Hatuka almost 5 years ago

Hi Patrick,

We don't have the signal number yet in the telemetry crash reports.

You can see other crash events with the same signature here:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

in the Crashes table:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?viewPanel=26&orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

If it seems like this is not a bug after all, you can change the status to Rejected or Won't Fix.

Actions #5

Updated by Patrick Donnelly almost 5 years ago

Yaarit Hatuka wrote:

Hi Patrick,

We don't have the signal number yet in the telemetry crash reports.

You can see other crash events with the same signature here:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

in the Crashes table:
http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?viewPanel=26&orgId=1&var-sig_v2=8032fa5f1f2107af12b68e6fe2586cc5fe2c1ad99a09d4b1bbfa47a813852039

If it seems like this is not a bug after all, you can change the status to Rejected or Won't Fix.

Sorry, how do you conclude it's not a bug after all?

Actions #6

Updated by Yaarit Hatuka almost 5 years ago

oh, I mean in general, not necessarily in this case.

This was opened automatically by a telemetry-to-redmine bot that I'm working on, that will import the telemetry crash reports to redmine. It was opened with my user, but future issues will be opened with the telemetry bot user.

Actions #7

Updated by Patrick Donnelly almost 5 years ago

  • Project changed from CephFS to Ceph
  • Subject changed from mds: crash in pthread_getname_np to *: crash in pthread_getname_np
  • Target version set to v17.0.0
  • Source changed from Telemetry to Q/A
  • Backport set to pacific
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr:*** Caught signal (Segmentation fault) **
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr: in thread 7f819c376700 thread_name:ceph-mon
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr: ceph version 16.2.4-225-gf9084200 (f908420004cc81a30edb2b252b4d92f50c526280) pacific (stable)
2021-06-01T05:31:28.917 INFO:tasks.ceph.mon.a.smithi053.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f8191010dc0]
2021-06-01T05:31:28.918 INFO:tasks.ceph.mon.a.smithi053.stderr: 2: pthread_getname_np()
2021-06-01T05:31:28.918 INFO:tasks.ceph.mon.a.smithi053.stderr: 3: (ceph::logging::Log::dump_recent()+0x4b3) [0x7f81938845a3]
2021-06-01T05:31:28.918 INFO:tasks.ceph.mon.a.smithi053.stderr: 4: ceph-mon(+0x53110b) [0x5594d213110b]
2021-06-01T05:31:28.919 INFO:tasks.ceph.mon.a.smithi053.stderr: 5: /lib64/libpthread.so.0(+0x12dc0) [0x7f8191010dc0]
2021-06-01T05:31:28.919 INFO:tasks.ceph.mon.a.smithi053.stderr: 6: gsignal()
2021-06-01T05:31:28.919 INFO:tasks.ceph.mon.a.smithi053.stderr: 7: abort()
2021-06-01T05:31:28.920 INFO:tasks.ceph.mon.a.smithi053.stderr: 8: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f819351059d]
2021-06-01T05:31:28.920 INFO:tasks.ceph.mon.a.smithi053.stderr: 9: /usr/lib64/ceph/libceph-common.so.2(+0x276766) [0x7f8193510766]
2021-06-01T05:31:28.920 INFO:tasks.ceph.mon.a.smithi053.stderr: 10: (Monitor::~Monitor()+0xb35) [0x5594d1ee0995]
2021-06-01T05:31:28.921 INFO:tasks.ceph.mon.a.smithi053.stderr: 11: (Monitor::~Monitor()+0xd) [0x5594d1ee09ed]
2021-06-01T05:31:28.921 INFO:tasks.ceph.mon.a.smithi053.stderr: 12: main()
2021-06-01T05:31:28.921 INFO:tasks.ceph.mon.a.smithi053.stderr: 13: __libc_start_main()
2021-06-01T05:31:28.922 INFO:tasks.ceph.mon.a.smithi053.stderr: 14: _start()
2021-06-01T05:31:28.980 INFO:tasks.ceph.mgr.z.smithi179.stderr:daemon-helper: command crashed with signal 15

From: /ceph/teuthology-archive/teuthology-2021-06-01_04:17:03-fs-pacific-distro-basic-smithi/6144511/teuthology.log

Test failed for other reasons but we finally have this showing up in QA.

Actions #8

Updated by Neha Ojha almost 5 years ago

  • Project changed from Ceph to RADOS
Actions #9

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #10

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v15.2.10, v15.2.11, v15.2.12, v15.2.13, v15.2.15, v15.2.4, v15.2.5, v15.2.7, v15.2.8, v16.2.4, v16.2.5, v16.2.6, v16.2.7 added
Actions #11

Updated by Radoslaw Zarzynski almost 4 years ago

  • Crash signature (v1) updated (diff)

Still Need More Info as the logs aren't there after the months.

Actions #12

Updated by Kotresh Hiremath Ravishankar over 1 year ago

I came across the mds crash in the Tracker Description on a downstream bz https://bugzilla.redhat.com/show_bug.cgi?id=2293509. Please find the analysis below for the same.

Analysing the mds logs, I see that the ceph-mds daemons have been terminated abruptly with the signal SIGTERM around 22 times. This resulted the mds to suicide and reach
STATE::DNE. Upon notifying the same to the monitor, monitor failed the mds which in turn blocklisted it and triggered re-spawn. In three of the 22
instances, the mds as crashed with segmentation fault without successful re-spawn with below traceback. I think the mds-crash is a race between mds and journal shutting down
and journal still writing to osd. The write failed because of Transport endpoint shutdown.

Traceback of mds crash from gdb

  Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `ceph-mds --fsid=ddedfd5b-c6e1-446a-b6e1-d1719af7800f --keyring=/etc/ceph/keyrin'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/292532' in core file.
#0  0x00007fe98778154c in __pthread_kill_implementation () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fe97d88e640 (LWP 292532))]
Missing separate debuginfos, use: dnf debuginfo-install ceph-mds-17.2.6-196.el9cp.x86_64
(gdb) bt
#0  0x00007fe98778154c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007fe987734d06 in raise () from /lib64/libc.so.6
#2  0x0000558a568a9eba in handle_oneshot_fatal_signal(int) ()
#3  <signal handler called>
#4  0x00007fe987780eb6 in pthread_getname_np@GLIBC_2.12 () from /lib64/libc.so.6
#5  0x00007fe988079d45 in ceph::logging::Log::dump_recent() () from /usr/lib64/ceph/libceph-common.so.2
#6  0x0000558a565a935a in MDSDaemon::respawn() ()
#7  0x0000558a565929bd in Context::complete(int) ()
#8  0x0000558a565bc7af in MDSRank::handle_write_error(int) ()
#9  0x0000558a565eed34 in LambdaContext<MDSRank::MDSRank(int, ceph::fair_mutex&, std::shared_ptr<LogChannel>&, CommonSafeTimer<ceph::fair_mutex>&, Beacon&, std::unique_ptr<MDSMap, std::default_delete<MDSMap> >&, Messenger*, MonClient*, MgrClient*, Context*, Context*, boost::asio::io_context&)::{lambda(int)#1}>::finish(int) [clone .lto_priv.0] ()
#10 0x0000558a565929bd in Context::complete(int) ()
#11 0x00007fe987e3d635 in Finisher::finisher_thread_entry() () from /usr/lib64/ceph/libceph-common.so.2
#12 0x00007fe98777f802 in start_thread () from /lib64/libc.so.6
#13 0x00007fe98771f450 in clone3 () from /lib64/libc.so.6

The following mds logs confirm the sequence of crash

[khiremat@supportshell-1 ceph]$ pwd
/home/blr/khiremat/03850956/0090-must-gather_yrp_sum2ocp_odf_20240619_2.tar.gz/must-gather.local.1247707847673627001/registry-sum2ocp-nokia-local-8443-odf4-odf-must-gather-rhel9-sha256-a359be47ef2eb0823548ebaa399cad1cafe09cb47c283da1949b7da980803530/ceph

-------
   -81> 2024-06-18T02:24:57.885+0000 7fe98409b640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
   -80> 2024-06-18T02:24:57.885+0000 7fe98409b640 -1 mds.ocs-storagecluster-cephfilesystem-b *** got signal Terminated ***
   -79> 2024-06-18T02:24:57.885+0000 7fe98409b640  1 mds.ocs-storagecluster-cephfilesystem-b suicide! Wanted state up:active
   -78> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 asok(0x558a574c0000) unregister_commands cache drop
   -77> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 asok(0x558a574c0000) unregister_commands cache status
...
...
    -35> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 asok(0x558a574c0000) unregister_commands tag path
   -34> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 mds.beacon.ocs-storagecluster-cephfilesystem-b set_want_state: up:active -> down:dne
   -33> 2024-06-18T02:24:57.885+0000 7fe98409b640  5 mds.beacon.ocs-storagecluster-cephfilesystem-b Sending beacon down:dne seq 49
   -32> 2024-06-18T02:24:57.885+0000 7fe98409b640 10 monclient: _send_mon_message to mon.c at v2:172.30.120.243:3300/0
...
...
   -22> 2024-06-18T02:24:59.125+0000 7fe981896640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-18T02:24:29.125723+0000)
   -21> 2024-06-18T02:25:00.125+0000 7fe981896640 10 monclient: tick
   -20> 2024-06-18T02:25:00.125+0000 7fe981896640 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-06-18T02:24:30.125825+0000)
   -19> 2024-06-18T02:25:00.517+0000 7fe98409b640  1 mds.0.19 shutdown: shutting down rank 0
   -18> 2024-06-18T02:25:00.517+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a5dd37dc0 osd_map(253..253 src has 1..253) v4 from mon.2 v2:172.30.120.243:3300/0
   -17> 2024-06-18T02:25:00.518+0000 7fe97e890640  2 mds.0.cache Memory usage:  total 373192, rss 113508, heap 223448, baseline 182488, 3731 / 8384 inodes have caps, 3744 caps, 0.446565 caps per inode
   -16> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x10000001fe2
   -15> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x1000009f368
   -14> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x100000942b0
   -13> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x1000009ca31
   -12> 2024-06-18T02:25:00.518+0000 7fe97e890640  4 mds.0.purge_queue push: pushing inode 0x1000009f369
   -11> 2024-06-18T02:25:00.518+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a582931e0 mdsmap(e 25) v2 from mon.2 v2:172.30.120.243:3300/0
   -10> 2024-06-18T02:25:00.518+0000 7fe98409b640  5 mds.0.log shutdown
    -9> 2024-06-18T02:25:00.518+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a58293380 mdsmap(e 4294967295) v2 from mon.2 v2:172.30.120.243:3300/0
    -8> 2024-06-18T02:25:00.518+0000 7fe98409b640  1 mds.0.journaler.mdlog(rw) shutdown
    -7> 2024-06-18T02:25:00.518+0000 7fe97c88c640  4 MDSIOContextBase::complete: dropping for stopping 26C_Locker_FileUpdate_finish
    -6> 2024-06-18T02:25:00.518+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a58293520 mdsmap(e 26) v2 from mon.2 v2:172.30.120.243:3300/0
    -5> 2024-06-18T02:25:00.518+0000 7fe97d88e640 -1 mds.0.journaler.pq(rw) _finish_flush got (108) Cannot send after transport endpoint shutdown
    -4> 2024-06-18T02:25:00.518+0000 7fe97d88e640 -1 mds.0.journaler.pq(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
    -3> 2024-06-18T02:25:00.518+0000 7fe97c88c640  4 MDSIOContextBase::complete: dropping for stopping 18C_PurgeStrayLogged
    -2> 2024-06-18T02:25:00.519+0000 7fe97d88e640 -1 mds.0.19 we have been blocklisted (fenced), respawning...
    -1> 2024-06-18T02:25:00.519+0000 7fe982898640  0 ms_deliver_dispatch: unhandled message 0x558a58292680 mdsmap(e 27) v2 from mon.2 v2:172.30.120.243:3300/0
     0> 2024-06-18T02:25:00.519+0000 7fe97d88e640  1 mds.ocs-storagecluster-cephfilesystem-b respawn!
...
...

Logs to confirm the ceph-mds got terminated by SIGTERM around 20 times (4 of them are repetitive, because of dump).

[khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" | wc -l
24

[khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" 
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T02:21:27.882+0000 7fc0a2dc2640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:   -62> 2024-06-18T02:21:27.882+0000 7fc0a2dc2640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T02:24:57.885+0000 7fe98409b640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:   -81> 2024-06-18T02:24:57.885+0000 7fe98409b640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T07:05:27.884+0000 7fe3a71b3640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr00.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log.1.gz:2024-06-18T07:08:27.885+0000 7fa2ecf4d640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
...
...
ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-a.log.1.gz:2024-06-18T08:50:26.128+0000 7f72aa312640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-a.log.1.gz:2024-06-18T09:09:26.130+0000 7f2596aac640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:2024-06-18T08:49:08.783+0000 7f08e17fd640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:  -351> 2024-06-18T08:49:08.783+0000 7f08e17fd640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:2024-06-18T08:52:38.782+0000 7f2a03346640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
ceph_daemon_log_sum2wkr02.sum2ocp.nokia.local/log/ceph-mds.ocs-storagecluster-cephfilesystem-b.log:  -110> 2024-06-18T08:52:38.782+0000 7f2a03346640 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0

I scrubbed the monitor logs around the time of the SIGTERM.
Around the time the ceph-mds crashed, checking ceph-mon logs, monitor is not sending the SIGTERM.

2024-06-18T02:24:56.457+0000 7fbc4909a640  0 log_channel(cluster) log [WRN] : Health check update: Degraded data redundancy: 13351/40053 objects degraded (33.333%), 181 pgs degraded, 201 pgs undersized (PG_DEGRADED)
2024-06-18T02:24:56.457+0000 7fbc4909a640  1 mon.b@1(leader).osd e252 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 343932928 full_alloc: 348127232 kv_alloc: 318767104
2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).mds e24 prepare_beacon: DNE from [mds.ocs-storagecluster-cephfilesystem-b{0:248126} state up:active seq 41 join_fscid=1 addr [v2:10.128.6.28:6800/4141633146,v1:10.128.6.28:6801/4141633146] compat {c=[1],r=[1],i=[7ff]}]
2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).mds e24 fail_mds_gid 248126 mds.ocs-storagecluster-cephfilesystem-b role 0
2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).osd e252 do_prune osdmap full prune enabled
2024-06-18T02:24:57.889+0000 7fbc44090640  1 mon.b@1(leader).osd e253 e253: 6 total, 4 up, 6 in
2024-06-18T02:24:57.891+0000 7fbc44090640  0 log_channel(cluster) log [DBG] : osdmap e253: 6 total, 4 up, 6 in
2024-06-18T02:24:58.462+0000 7fbc4909a640  0 log_channel(cluster) log [WRN] : Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2024-06-18T02:24:58.462+0000 7fbc4909a640 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2024-06-18T02:24:58.464+0000 7fbc44090640  0 mon.b@1(leader).mds e25 new map
2024-06-18T02:24:58.464+0000 7fbc44090640  0 mon.b@1(leader).mds e25 print_map
e25
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1

Filesystem 'ocs-storagecluster-cephfilesystem' (1)
fs_name ocs-storagecluster-cephfilesystem
epoch   25
flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
created 2024-06-12T10:47:45.025981+0000
modified        2024-06-18T02:24:58.462396+0000
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  253
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {}
failed  0
damaged
stopped
data_pools      [11]
metadata_pool   10
inline_data     disabled
balancer
standby_count_wanted    1
[mds.ocs-storagecluster-cephfilesystem-a{0:407035} state up:standby-replay seq 1 join_fscid=1 addr [v2:10.128.4.35:6800/1723782331,v1:10.128.4.35:6801/1723782331] compat {c=[1],r=[1],i=[7ff]}]

2024-06-18T02:24:58.466+0000 7fbc44090640  1 mon.b@1(leader).mds e25  taking over failed mds.0 with 407035/ocs-storagecluster-cephfilesystem-a [v2:10.128.4.35:6800/1723782331,v1:10.128.4.35:6801/1723782331]
2024-06-18T02:24:58.466+0000 7fbc44090640  0 log_channel(cluster) log [INF] : Standby daemon mds.ocs-storagecluster-cephfilesystem-a assigned to filesystem ocs-storagecluster-cephfilesystem as rank 0
2024-06-18T02:24:58.466+0000 7fbc44090640  0 log_channel(cluster) log [WRN] : Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)
2024-06-18T02:24:58.466+0000 7fbc44090640  0 log_channel(cluster) log [INF] : Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
2024-06-18T02:24:58.467+0000 7fbc44090640  0 log_channel(cluster) log [DBG] : fsmap ocs-storagecluster-cephfilesystem:0/1 1 up:standby-replay, 1 failed
2024-06-18T02:24:58.469+0000 7fbc44090640  0 mon.b@1(leader).mds e26 new map
2024-06-18T02:24:58.469+0000 7fbc44090640  0 mon.b@1(leader).mds e26 print_map
...
...

Instances of ceph-mon failing mds:

[khiremat@supportshell-1 ceph]$ find ./ceph_daemon_log_sum2wkr0* | grep ceph-mon | xargs grep fail_mds_gid
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:21:26.129+0000 7fbc46895640  1 mon.b@1(leader).mds e9 fail_mds_gid 14545 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:21:27.885+0000 7fbc46895640  1 mon.b@1(leader).mds e10 fail_mds_gid 14600 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:21:58.121+0000 7fbc46895640  1 mon.b@1(leader).mds e14 fail_mds_gid 248120 mds.ocs-storagecluster-cephfilesystem-b role -1
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:24:26.130+0000 7fbc46895640  1 mon.b@1(leader).mds e18 fail_mds_gid 406837 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T02:24:57.886+0000 7fbc46895640  1 mon.b@1(leader).mds e24 fail_mds_gid 248126 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:05:26.129+0000 7fbc46895640  1 mon.b@1(leader).mds e34 fail_mds_gid 407035 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:05:27.884+0000 7fbc46895640  1 mon.b@1(leader).mds e37 fail_mds_gid 407059 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:08:26.128+0000 7fbc46895640  1 mon.b@1(leader).mds e45 fail_mds_gid 419368 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T07:08:27.890+0000 7fbc46895640  1 mon.b@1(leader).mds e46 fail_mds_gid 258236 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:41:26.128+0000 7fbc46895640  1 mon.b@1(leader).mds e58 fail_mds_gid 419587 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:41:27.884+0000 7fbc46895640  1 mon.b@1(leader).mds e61 fail_mds_gid 258467 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:44:26.128+0000 7fbc46895640  1 mon.b@1(leader).mds e69 fail_mds_gid 423271 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T08:44:27.885+0000 7fbc46895640  1 mon.b@1(leader).mds e70 fail_mds_gid 262358 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T09:07:31.032+0000 7fbc46895640  1 mon.b@1(leader).mds e119 fail_mds_gid 466452 mds.ocs-storagecluster-cephfilesystem-b role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T09:09:26.130+0000 7fbc46895640  1 mon.b@1(leader).mds e121 fail_mds_gid 424714 mds.ocs-storagecluster-cephfilesystem-a role 0
./ceph_daemon_log_sum2wkr01.sum2ocp.nokia.local/log/zcat.ceph-mon.b.log.1.gz:2024-06-18T09:10:49.602+0000 7fbc46895640  1 mon.b@1(leader).mds e129 fail_mds_gid 474632 mds.ocs-storagecluster-cephfilesystem-b role 0

There is no evidence in the logs for who sent the SIGTERM multiple times around following time.

---------------------------
khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" | awk '{print $1}'| awk -F "gz:" '{print $2}' | sort

2024-06-18T02:21:26.128+0000
2024-06-18T02:21:27.882+0000
2024-06-18T02:24:26.130+0000
2024-06-18T02:24:57.885+0000
2024-06-18T07:05:26.128+0000
2024-06-18T07:05:27.884+0000
2024-06-18T07:08:26.127+0000
2024-06-18T07:08:27.885+0000
2024-06-18T08:41:26.127+0000
2024-06-18T08:41:27.884+0000
2024-06-18T08:44:26.128+0000
2024-06-18T08:44:27.885+0000
2024-06-18T08:47:26.133+0000
2024-06-18T08:50:26.128+0000
2024-06-18T09:09:26.130+0000
2024-06-18T09:10:49.600+0000

I couldn't verify the rook operator logs around this time as the provided must gather didn't have rook-operator logs around this time. One of the suspect is that the rook operator could have
sent the SIGTERM ?

The ceph-mds crash is because the pthread_getname_p segfaulted because of invalid thread_id. I think this is because the thread no longer existed as the mds is shutdown.

Actions #13

Updated by Kotresh Hiremath Ravishankar over 1 year ago

  • Project changed from RADOS to CephFS
Actions #14

Updated by Venky Shankar over 1 year ago

Kotresh Hiremath Ravishankar wrote in #note-12:

I came across the mds crash in the Tracker Description on a downstream bz https://bugzilla.redhat.com/show_bug.cgi?id=2293509. Please find the analysis below for the same.

Analysing the mds logs, I see that the ceph-mds daemons have been terminated abruptly with the signal SIGTERM around 22 times. This resulted the mds to suicide and reach
STATE::DNE. Upon notifying the same to the monitor, monitor failed the mds which in turn blocklisted it and triggered re-spawn. In three of the 22
instances, the mds as crashed with segmentation fault without successful re-spawn with below traceback. I think the mds-crash is a race between mds and journal shutting down
and journal still writing to osd. The write failed because of Transport endpoint shutdown.

Traceback of mds crash from gdb
[...]

The following mds logs confirm the sequence of crash

[...]

Logs to confirm the ceph-mds got terminated by SIGTERM around 20 times (4 of them are repetitive, because of dump).

[...]

I scrubbed the monitor logs around the time of the SIGTERM.
Around the time the ceph-mds crashed, checking ceph-mon logs, monitor is not sending the SIGTERM.

[...]

Instances of ceph-mon failing mds:

[...]

There is no evidence in the logs for who sent the SIGTERM multiple times around following time.

---------------------------
khiremat@supportshell-1 ceph]$ find ceph_daemon_log_sum2wkr0* | grep ceph-mds | xargs zgrep "Terminated from Kernel" | awk '{print $1}'| awk -F "gz:" '{print $2}' | sort

[...]

I couldn't verify the rook operator logs around this time as the provided must gather didn't have rook-operator logs around this time. One of the suspect is that the rook operator could have
sent the SIGTERM ?

rook-operator could fail the MDS if it finds that the active MDS is unresponsive, but maybe not directly send a SIGTERM signal.

The ceph-mds crash is because the pthread_getname_p segfaulted because of invalid thread_id. I think this is because the thread no longer existed as the mds is shutdown.

Has the thread already joined? If not, the pthread call should still succeed.

Actions #15

Updated by Venky Shankar over 1 year ago

  • Related to Bug #66923: mds: crash with core dump during respawn added
Actions #16

Updated by Konstantin Shalygin over 1 year ago

  • Backport deleted (pacific)
Actions #17

Updated by Venky Shankar over 1 year ago

  • Category set to Correctness/Safety
  • Status changed from Need More Info to In Progress
  • Assignee set to Milind Changire
  • Target version changed from v17.0.0 to v20.0.0
  • Backport set to quincy,reef,squid
Actions #18

Updated by Venky Shankar over 1 year ago

Had a conversation with @Milind Changire. The logging infrastructure maintains an Entry structure

class Entry {
public:
  using time = log_time;

  Entry() = delete;
  Entry(short pr, short sub) :
    m_stamp(clock().now()),
    m_thread(pthread_self()),
    m_prio(pr),
    m_subsys(sub)
  {}
  Entry(const Entry &) = default;
  Entry& operator=(const Entry &) = default;
  Entry(Entry &&e) = default;
  Entry& operator=(Entry &&e) = default;
  virtual ~Entry() = default;

  virtual std::string_view strv() const = 0;
  virtual std::size_t size() const = 0;

  time m_stamp;
  pthread_t m_thread;
  short m_prio, m_subsys;

  static log_clock& clock() {
    static log_clock clock;
    return clock;
  }
};

It would suffice to include the thread name in this structure to avoid having to call pthread_getname_np() with the thread_id (m_thread) which is invalid (i.e., for a thread that has already terminated).

Actions #19

Updated by Milind Changire over 1 year ago

  • Pull request ID set to 60058

Pull request has been created

Actions #20

Updated by Venky Shankar over 1 year ago

  • Status changed from In Progress to Fix Under Review
Actions #21

Updated by Konstantin Shalygin over 1 year ago

  • Affected Versions v16.2.15, v17.2.7, v18.2.4, v19.2.1 added
  • Affected Versions deleted (v15.2.10, v15.2.11, v15.2.12, v15.2.13, v15.2.15, v15.2.4, v15.2.5, v15.2.7, v15.2.8, v16.2.1, v16.2.4, v16.2.5, v16.2.6, v16.2.7)
Actions #22

Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #23

Updated by Upkeep Bot over 1 year ago

Actions #24

Updated by Upkeep Bot over 1 year ago

Actions #25

Updated by Upkeep Bot over 1 year ago

Actions #26

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #27

Updated by Konstantin Shalygin about 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions #28

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 6e728815272e79e9934c24bb5a26d290b4aef65a
  • Fixed In set to v19.3.0-5578-g6e72881527
  • Upkeep Timestamp set to 2025-07-14T17:51:35+00:00
Actions #29

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~1782
  • Upkeep Timestamp changed from 2025-07-14T17:51:35+00:00 to 2025-11-01T01:28:38+00:00
Actions

Also available in: Atom PDF