Bug #39150
closedmon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum
0%
Description
2019-04-06T09:27:34.791 INFO:tasks.ceph.mds.b:Sent signal 15 2019-04-06T09:27:34.791 INFO:tasks.ceph.mon.a:Sent signal 15 2019-04-06T09:27:34.792 INFO:tasks.ceph.mon.c:Sent signal 15 2019-04-06T09:27:34.792 INFO:tasks.ceph.mon.b:Sent signal 15 2019-04-06T09:27:34.803 INFO:tasks.ceph.mon.a.smithi085.stderr:2019-04-06 09:27:34.801 7f854e356700 -1 received signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i a (PID: 17117) UID: 0 2019-04-06T09:27:34.803 INFO:tasks.ceph.mon.a.smithi085.stderr:2019-04-06 09:27:34.801 7f854e356700 -1 mon.a@0(electing) e1 *** Got Signal Terminated *** 2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.c.smithi180.stderr:2019-04-06 09:27:34.806 7f26a7a1e700 -1 received signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i c (PID: 17101) UID: 0 2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.c.smithi180.stderr:2019-04-06 09:27:34.806 7f26a7a1e700 -1 mon.c@2(electing) e1 *** Got Signal Terminated *** 2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.b.smithi180.stderr:2019-04-06 09:27:34.806 7fc14ddde700 -1 received signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i b (PID: 17099) UID: 0 2019-04-06T09:27:34.808 INFO:tasks.ceph.mon.b.smithi180.stderr:2019-04-06 09:27:34.806 7fc14ddde700 -1 mon.b@1(electing) e1 *** Got Signal Terminated *** 2019-04-06T09:27:34.808 INFO:tasks.ceph.mds.b.smithi180.stderr:2019-04-06 09:27:34.806 7f97d9dd8700 -1 received signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mds -f --cluster ceph -i b (PID: 19872) UID: 0 2019-04-06T09:27:34.808 INFO:tasks.ceph.mds.b.smithi180.stderr:2019-04-06 09:27:34.806 7f97d9dd8700 -1 mds.b *** got signal Terminated *** 2019-04-06T09:27:34.939 INFO:tasks.ceph.mon.c.smithi180.stderr:/build/ceph-15.0.0-122-gcf4d304/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f26b7a03340 time 2019-04-06 09:27:34.940966 2019-04-06T09:27:34.939 INFO:tasks.ceph.mon.c.smithi180.stderr:/build/ceph-15.0.0-122-gcf4d304/src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty()) 2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: ceph version 15.0.0-122-gcf4d304 (cf4d304f05231b6375986616bc965edc8181a4e1) octopus (dev) 2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f26aebba0d2] 2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f26aebba2ad] 2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 3: (Monitor::~Monitor()+0x962) [0x69dfc2] 2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 4: (Monitor::~Monitor()+0x9) [0x69e039] 2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 5: (main()+0x2801) [0x578df1] 2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 6: (__libc_start_main()+0xf0) [0x7f26ad0fd830] 2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 7: (_start()+0x29) [0x65bee9] 2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr:*** Caught signal (Aborted) ** 2019-04-06T09:27:34.943 INFO:tasks.ceph.mon.c.smithi180.stderr: in thread 7f26b7a03340 thread_name:ceph-mon
From: /ceph/teuthology-archive/pdonnell-2019-04-06_02:21:29-fs-wip-pdonnell-testing-20190405.231924-distro-basic-smithi/3814565/teuthology.log
Seems there were other issues with the mons during that run as well. Mons lost quorum around 08:58:28.846.
Updated by Greg Farnum almost 7 years ago
- Subject changed from mon: "FAILED ceph_assert(session_map.sessions.empty())" to mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum
- Priority changed from High to Normal
The monitor was out of quorum for 30 minutes; it probably has to do with holding on to client connections or else not cleaning up the session map from when it was last in. I'm not sure this is high priority though since it's a crash on shutdown in a failure scenario...
Updated by Patrick Donnelly almost 7 years ago
Greg Farnum wrote:
The monitor was out of quorum for 30 minutes; it probably has to do with holding on to client connections or else not cleaning up the session map from when it was last in. I'm not sure this is high priority though since it's a crash on shutdown in a failure scenario...
Unless you're looking at something different, the lost quorum happened during the test not at shutdown. The mds thrasher had just successfully thrashed (killed and standby took over successfully) an MDS around ~8 seconds earlier.
Updated by Greg Farnum almost 7 years ago
mon.c timeline:
2019-04-06 08:58:28.846 hits a lease timeout and triggers the election process
2019-04-06 08:58:28.846 first output of "probing" state
2019-04-06 08:58:28.850 first output of "electing" state
2019-04-06 09:27:34.942 crash output line
It does not output the "peon" or "leader" state again in those 29 minutes; it times out 291 elections and starts 294 during that time. I don't know why it happened but mon.c was out of quorum that whole time.
Updated by Sage Weil almost 7 years ago
mon.c is failing to connect to mon.a:
2019-04-06 09:19:20.484 7f269f20d700 1 --2- [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] >> [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] conn(0x3137200 0x2f77600 secure :-1 s=BANNER_CONNECTING pgs=3162 cs=280 l=0 rx=0x4171da0 tx=0x4b2b080)._handle_peer_banner_payload supported=0 required=0 2019-04-06 09:19:20.484 7f269f20d700 1 --2- [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] >> [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] conn(0x3137200 0x2f77600 secure :-1 s=START_CONNECT pgs=3162 cs=281 l=0 rx=0x4171da0 tx=0x4b2b080)._fault waiting 15.000000
same in the other direction:
2019-04-06 09:19:00.906 7f8546b47700 1 --2- [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] >> [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] conn(0x361b680 0x33f9b80 secure :-1 s=BANNER_CONNECTING pgs=581 cs=284 l=0 rx=0x8ed3ab0 tx=0x4eb4b80)._handle_peer_banner_payload supported=0 required=0 2019-04-06 09:19:00.906 7f8546b47700 1 --2- [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] >> [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] conn(0x361b680 0x33f9b80 secure :-1 s=START_CONNECT pgs=581 cs=285 l=0 rx=0x8ed3ab0 tx=0x4eb4b80)._fault waiting 15.000000
Updated by Sage Weil almost 7 years ago
(not surprisingly, MON_DOWN is in the ceph.log too, and the run would have failed with that had it not failed for some other reason. will keep an eye out for that!)
Updated by Patrick Donnelly almost 7 years ago
/ceph/teuthology-archive/pdonnell-2019-04-17_06:12:56-kcephfs-wip-pdonnell-testing-20190417.032809-distro-basic-smithi/3857629/teuthology.log
Updated by Neha Ojha almost 7 years ago
/a/yuriw-2019-06-07_19:41:42-rados-wip-yuri4-testing-2019-06-07-1600-nautilus-distro-basic-smithi/4012630/
Updated by Patrick Donnelly about 6 years ago
/ceph/teuthology-archive/pdonnell-2020-02-15_16:51:06-fs-wip-pdonnell-testing-20200215.033325-distro-basic-smithi/4767980/teuthology.log
Updated by Neha Ojha almost 5 years ago
/a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049676
Updated by Sridhar Seshasayee almost 5 years ago
/a/sseshasa-2021-05-17_11:08:21-rados-wip-sseshasa-testing-2021-05-17-1504-distro-basic-smithi/6118250
Updated by Neha Ojha almost 5 years ago
- Priority changed from Normal to Urgent
- Backport set to pacific
/a/yuriw-2021-06-02_18:33:05-rados-wip-yuri3-testing-2021-06-02-0826-pacific-distro-basic-smithi/6147408
Updated by Neha Ojha over 4 years ago
/a/yuriw-2021-06-28_17:32:48-rados-wip-yuri2-testing-2021-06-28-0858-pacific-distro-basic-smithi/6239590
Updated by Sage Weil over 4 years ago
- Has duplicate Bug #51882: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Updated by Neha Ojha over 4 years ago
- Backport changed from pacific to pacific, octopus
Updated by Neha Ojha over 4 years ago
/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324701
Updated by Telemetry Bot over 4 years ago
- Crash signature (v1) updated (diff)
- Crash signature (v2) updated (diff)
- Affected Versions v15.2.10, v15.2.11, v15.2.12, v15.2.13, v15.2.2, v15.2.3, v15.2.4, v15.2.5, v15.2.6, v15.2.7, v15.2.8, v15.2.9 added
Assert condition: session_map.sessions.empty()
Assert function: virtual Monitor::~Monitor()
Sanitized backtrace:
pthread_getname_np()
ceph::logging::Log::dump_recent()
Monitor::~Monitor()
Monitor::~Monitor()
main()
__libc_start_main()
_start()
Crash dump sample:
{
"assert_condition": "session_map.sessions.empty()",
"assert_file": "mon/Monitor.cc",
"assert_func": "virtual Monitor::~Monitor()",
"assert_line": 262,
"assert_msg": "mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f4ff8c8c6c0 time 2021-08-03T11:49:35.421508+0000\nmon/Monitor.cc: 262: FAILED ceph_assert(session_map.sessions.empty())",
"assert_thread_name": "ceph-mon",
"backtrace": [
"(()+0x12b20) [0x7f4fed96ab20]",
"(pthread_getname_np()+0x48) [0x7f4fed96bd98]",
"(ceph::logging::Log::dump_recent()+0x428) [0x7f4ff01c4978]",
"(()+0x4be2db) [0x555e399352db]",
"(()+0x12b20) [0x7f4fed96ab20]",
"(gsignal()+0x10f) [0x7f4fec5d27ff]",
"(abort()+0x127) [0x7f4fec5bcc35]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f4fefe74d61]",
"(()+0x27af2a) [0x7f4fefe74f2a]",
"(Monitor::~Monitor()+0xef6) [0x555e39704c26]",
"(Monitor::~Monitor()+0xd) [0x555e39704c7d]",
"(main()+0x565e) [0x555e396974ee]",
"(__libc_start_main()+0xf3) [0x7f4fec5be7b3]",
"(_start()+0x2e) [0x555e396c0d8e]"
],
"ceph_version": "15.2.13",
"crash_id": "2021-08-03T11:49:35.767310Z_62904f71-57d0-4a50-93a8-264c4cc6ff32",
"entity_name": "mon.465717d0783140bdb59100800078d74713f06fc3",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mon",
"stack_sig": "c7d7213859ab7cdabcc40049aff5482ebbf1b9e92d6e65a376ea1d5e89787cf6",
"timestamp": "2021-08-03T11:49:35.767310Z",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-17-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.194-3 (2021-07-18)"
}Updated by jianwei zhang over 4 years ago
{
"crash_id": "2021-08-26T03:38:46.109584Z_c0f5c111-a3bc-4210-8edd-e72cb5344590",
"timestamp": "2021-08-26T03:38:46.109584Z",
"process_name": "ceph-mon",
"entity_name": "mon.c",
"ceph_version": "v15.2.8.1.0.0",
"utsname_hostname": "node-102",
"utsname_sysname": "Linux",
"utsname_release": "3.10.0-862.el7.x86_64",
"utsname_version": "#1 SMP Fri Apr 20 16:44:24 UTC 2018",
"utsname_machine": "x86_64",
"os_name": "CentOS Linux",
"os_id": "centos",
"os_version_id": "7",
"os_version": "7 (Core)",
"assert_condition": "session_map.sessions.empty()",
"assert_func": "virtual Monitor::~Monitor()",
"assert_file": "/SDS-CICD/release/ceph15-tancz/rpmbuild/BUILD/ceph-15.2.8.1.0.0/src/mon/Monitor.cc",
"assert_line": 262,
"assert_thread_name": "ceph-mon",
"assert_msg": "src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f8e02893340 time 2021-08-26T11:38:46.105871+0800\nsrc/mon/Monitor.cc: 262: FAILED ceph_assert(session_map.sessions.empty())\n",
"backtrace": [
"(()+0xf5d0) [0x7f8df78ce5d0]",
"(gsignal()+0x37) [0x7f8df66c4207]",
"(abort()+0x148) [0x7f8df66c58f8]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x7f8df9ac4c9e]",
"(()+0x269e17) [0x7f8df9ac4e17]",
"(Monitor::~Monitor()+0x846) [0x557a6eceded6]",
"(Monitor::~Monitor()+0x9) [0x557a6ecedf29]",
"(main()+0x260a) [0x557a6ec7ba9a]",
"(__libc_start_main()+0xf5) [0x7f8df66b03d5]",
"(()+0x2304f0) [0x557a6ecac4f0]"
]
}
Updated by Neha Ojha over 4 years ago
- Has duplicate Bug #52199: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Updated by Neha Ojha over 4 years ago
- Has duplicate Bug #52198: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Updated by Neha Ojha over 4 years ago
- Has duplicate Bug #52142: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Updated by Deepika Upadhyay over 4 years ago
- Crash signature (v1) updated (diff)
021-10-02T17:30:34.842 INFO:tasks.ceph.mon.a.smithi063.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6-216-g6e2fe4ec/rpm/el8/BUILD/ceph-16.2.6-216-g6e2fe4ec/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 4045240 time 2021-10-02T17:30:34.839243+0000 2021-10-02T17:30:34.843 INFO:tasks.ceph.mon.a.smithi063.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6-216-g6e2fe4ec/rpm/el8/BUILD/ceph-16.2.6-216-g6e2fe4ec/src/mon/Monitor.cc: 287: FAILED ceph_assert(session_map.sessions.empty())
/ceph/teuthology-archive/yuriw-2021-10-02_15:03:31-rados-wip-yuri2-testing-2021-10-01-0902-pacific-distro-basic-smithi/641
7691/teuthology.log
Updated by Sage Weil over 4 years ago
/a/sage-2021-10-28_02:19:01-rados-wip-sage3-testing-2021-10-27-1300-distro-basic-smithi/6464204
with logs!
Updated by Aishwarya Mathuria over 4 years ago
/a/yuriw-2021-11-20_18:01:41-rados-wip-yuri8-testing-2021-11-20-0807-distro-basic-smithi/6516396
Updated by Sage Weil over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 44337
Updated by Sage Weil over 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 4 years ago
- Copied to Backport #53659: pacific: mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum added
Updated by Upkeep Bot over 4 years ago
- Copied to Backport #53660: octopus: mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum added
Updated by Telemetry Bot about 4 years ago
Updated by Telemetry Bot about 4 years ago
Updated by Telemetry Bot about 4 years ago
Updated by Telemetry Bot about 4 years ago
Updated by Telemetry Bot about 4 years ago
- Crash signature (v1) updated (diff)
- Affected Versions v14.2.0, v14.2.1, v14.2.10, v14.2.11, v14.2.13, v14.2.16, v14.2.4, v14.2.5, v14.2.7, v14.2.8, v15.2.0 added
Updated by Neha Ojha almost 4 years ago
- Status changed from Pending Backport to Resolved
- Crash signature (v1) updated (diff)
Updated by Telemetry Bot over 3 years ago
- Related to Bug #56192: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Updated by Upkeep Bot 8 months ago
- Merge Commit set to b55781d412f05e5ad99751cc4247a22d9ada5547
- Fixed In set to v17.0.0-9702-gb55781d412
- Released In set to v17.2.0~275
- Upkeep Timestamp set to 2025-07-18T16:11:29+00:00