Project

General

Profile

Actions

Bug #39150

closed

mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum

Added by Patrick Donnelly almost 7 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
% Done:

0%

Component(RADOS):
Monitor
Pull request ID:
Tags (freeform):
Fixed In:
v17.0.0-9702-gb55781d412
Released In:
v17.2.0~275
Upkeep Timestamp:
2025-07-18T16:11:29+00:00

Description

2019-04-06T09:27:34.791 INFO:tasks.ceph.mds.b:Sent signal 15
2019-04-06T09:27:34.791 INFO:tasks.ceph.mon.a:Sent signal 15
2019-04-06T09:27:34.792 INFO:tasks.ceph.mon.c:Sent signal 15
2019-04-06T09:27:34.792 INFO:tasks.ceph.mon.b:Sent signal 15
2019-04-06T09:27:34.803 INFO:tasks.ceph.mon.a.smithi085.stderr:2019-04-06 09:27:34.801 7f854e356700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i a  (PID: 17117) UID: 0
2019-04-06T09:27:34.803 INFO:tasks.ceph.mon.a.smithi085.stderr:2019-04-06 09:27:34.801 7f854e356700 -1 mon.a@0(electing) e1 *** Got Signal Terminated ***
2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.c.smithi180.stderr:2019-04-06 09:27:34.806 7f26a7a1e700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i c  (PID: 17101) UID: 0
2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.c.smithi180.stderr:2019-04-06 09:27:34.806 7f26a7a1e700 -1 mon.c@2(electing) e1 *** Got Signal Terminated ***
2019-04-06T09:27:34.807 INFO:tasks.ceph.mon.b.smithi180.stderr:2019-04-06 09:27:34.806 7fc14ddde700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mon -f --cluster ceph -i b  (PID: 17099) UID: 0
2019-04-06T09:27:34.808 INFO:tasks.ceph.mon.b.smithi180.stderr:2019-04-06 09:27:34.806 7fc14ddde700 -1 mon.b@1(electing) e1 *** Got Signal Terminated ***
2019-04-06T09:27:34.808 INFO:tasks.ceph.mds.b.smithi180.stderr:2019-04-06 09:27:34.806 7f97d9dd8700 -1 received  signal: Terminated from /usr/bin/python /usr/bin/daemon-helper kill ceph-mds -f --cluster ceph -i b  (PID: 19872) UID: 0
2019-04-06T09:27:34.808 INFO:tasks.ceph.mds.b.smithi180.stderr:2019-04-06 09:27:34.806 7f97d9dd8700 -1 mds.b *** got signal Terminated ***
2019-04-06T09:27:34.939 INFO:tasks.ceph.mon.c.smithi180.stderr:/build/ceph-15.0.0-122-gcf4d304/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f26b7a03340 time 2019-04-06 09:27:34.940966
2019-04-06T09:27:34.939 INFO:tasks.ceph.mon.c.smithi180.stderr:/build/ceph-15.0.0-122-gcf4d304/src/mon/Monitor.cc: 267: FAILED ceph_assert(session_map.sessions.empty())
2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: ceph version 15.0.0-122-gcf4d304 (cf4d304f05231b6375986616bc965edc8181a4e1) octopus (dev)
2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f26aebba0d2]
2019-04-06T09:27:34.941 INFO:tasks.ceph.mon.c.smithi180.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f26aebba2ad]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 3: (Monitor::~Monitor()+0x962) [0x69dfc2]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 4: (Monitor::~Monitor()+0x9) [0x69e039]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 5: (main()+0x2801) [0x578df1]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 6: (__libc_start_main()+0xf0) [0x7f26ad0fd830]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr: 7: (_start()+0x29) [0x65bee9]
2019-04-06T09:27:34.942 INFO:tasks.ceph.mon.c.smithi180.stderr:*** Caught signal (Aborted) **
2019-04-06T09:27:34.943 INFO:tasks.ceph.mon.c.smithi180.stderr: in thread 7f26b7a03340 thread_name:ceph-mon

From: /ceph/teuthology-archive/pdonnell-2019-04-06_02:21:29-fs-wip-pdonnell-testing-20190405.231924-distro-basic-smithi/3814565/teuthology.log

Seems there were other issues with the mons during that run as well. Mons lost quorum around 08:58:28.846.


Related issues 7 (1 open6 closed)

Related to RADOS - Bug #56192: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty())Pending BackportNitzan Mordechai

Actions
Has duplicate RADOS - Bug #51882: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty())Duplicate

Actions
Has duplicate RADOS - Bug #52199: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty())Duplicate

Actions
Has duplicate RADOS - Bug #52198: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty())Duplicate

Actions
Has duplicate RADOS - Bug #52142: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty())Duplicate

Actions
Copied to RADOS - Backport #53659: pacific: mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorumResolvedCory SnyderActions
Copied to RADOS - Backport #53660: octopus: mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorumResolvedCory SnyderActions
Actions #1

Updated by Greg Farnum almost 7 years ago

  • Subject changed from mon: "FAILED ceph_assert(session_map.sessions.empty())" to mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum
  • Priority changed from High to Normal

The monitor was out of quorum for 30 minutes; it probably has to do with holding on to client connections or else not cleaning up the session map from when it was last in. I'm not sure this is high priority though since it's a crash on shutdown in a failure scenario...

Actions #2

Updated by Patrick Donnelly almost 7 years ago

Greg Farnum wrote:

The monitor was out of quorum for 30 minutes; it probably has to do with holding on to client connections or else not cleaning up the session map from when it was last in. I'm not sure this is high priority though since it's a crash on shutdown in a failure scenario...

Unless you're looking at something different, the lost quorum happened during the test not at shutdown. The mds thrasher had just successfully thrashed (killed and standby took over successfully) an MDS around ~8 seconds earlier.

Actions #3

Updated by Greg Farnum almost 7 years ago

mon.c timeline:
2019-04-06 08:58:28.846 hits a lease timeout and triggers the election process
2019-04-06 08:58:28.846 first output of "probing" state
2019-04-06 08:58:28.850 first output of "electing" state
2019-04-06 09:27:34.942 crash output line

It does not output the "peon" or "leader" state again in those 29 minutes; it times out 291 elections and starts 294 during that time. I don't know why it happened but mon.c was out of quorum that whole time.

Actions #4

Updated by Sage Weil almost 7 years ago

mon.c is failing to connect to mon.a:

2019-04-06 09:19:20.484 7f269f20d700  1 --2- [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] >> [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] conn(0x3137200 0x2f77600 secure :-1 s=BANNER_CONNECTING pgs=3162 cs=280 l=0 rx=0x4171da0 tx=0x4b2b080)._handle_peer_banner_payload supported=0 required=0
2019-04-06 09:19:20.484 7f269f20d700  1 --2- [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] >> [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] conn(0x3137200 0x2f77600 secure :-1 s=START_CONNECT pgs=3162 cs=281 l=0 rx=0x4171da0 tx=0x4b2b080)._fault waiting 15.000000

same in the other direction:
2019-04-06 09:19:00.906 7f8546b47700  1 --2- [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] >> [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] conn(0x361b680 0x33f9b80 secure :-1 s=BANNER_CONNECTING pgs=581 cs=284 l=0 rx=0x8ed3ab0 tx=0x4eb4b80)._handle_peer_banner_payload supported=0 required=0
2019-04-06 09:19:00.906 7f8546b47700  1 --2- [v2:172.21.15.85:3300/0,v1:172.21.15.85:6789/0] >> [v2:172.21.15.180:3301/0,v1:172.21.15.180:6790/0] conn(0x361b680 0x33f9b80 secure :-1 s=START_CONNECT pgs=581 cs=285 l=0 rx=0x8ed3ab0 tx=0x4eb4b80)._fault waiting 15.000000

Actions #5

Updated by Sage Weil almost 7 years ago

(not surprisingly, MON_DOWN is in the ceph.log too, and the run would have failed with that had it not failed for some other reason. will keep an eye out for that!)

Actions #6

Updated by Patrick Donnelly almost 7 years ago

/ceph/teuthology-archive/pdonnell-2019-04-17_06:12:56-kcephfs-wip-pdonnell-testing-20190417.032809-distro-basic-smithi/3857629/teuthology.log

Actions #7

Updated by Neha Ojha almost 7 years ago

/a/yuriw-2019-06-07_19:41:42-rados-wip-yuri4-testing-2019-06-07-1600-nautilus-distro-basic-smithi/4012630/

Actions #8

Updated by Patrick Donnelly about 6 years ago

/ceph/teuthology-archive/pdonnell-2020-02-15_16:51:06-fs-wip-pdonnell-testing-20200215.033325-distro-basic-smithi/4767980/teuthology.log

Actions #9

Updated by Neha Ojha almost 5 years ago

/a/nojha-2021-04-15_20:05:27-rados-wip-50217-distro-basic-smithi/6049676

Actions #10

Updated by Sridhar Seshasayee almost 5 years ago

/a/sseshasa-2021-05-17_11:08:21-rados-wip-sseshasa-testing-2021-05-17-1504-distro-basic-smithi/6118250

Actions #11

Updated by Neha Ojha almost 5 years ago

  • Priority changed from Normal to Urgent
  • Backport set to pacific

/a/yuriw-2021-06-02_18:33:05-rados-wip-yuri3-testing-2021-06-02-0826-pacific-distro-basic-smithi/6147408

Actions #12

Updated by Neha Ojha over 4 years ago

/a/yuriw-2021-06-28_17:32:48-rados-wip-yuri2-testing-2021-06-28-0858-pacific-distro-basic-smithi/6239590

Actions #13

Updated by Sage Weil over 4 years ago

  • Has duplicate Bug #51882: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Actions #14

Updated by Neha Ojha over 4 years ago

  • Backport changed from pacific to pacific, octopus
Actions #15

Updated by Neha Ojha over 4 years ago

/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324701

Actions #16

Updated by Telemetry Bot over 4 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v15.2.10, v15.2.11, v15.2.12, v15.2.13, v15.2.2, v15.2.3, v15.2.4, v15.2.5, v15.2.6, v15.2.7, v15.2.8, v15.2.9 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=4d653e9c3ee37041dd2a1cf556ea466db3e74addb7a8d3efb38d8e8a268096d3

Assert condition: session_map.sessions.empty()
Assert function: virtual Monitor::~Monitor()

Sanitized backtrace:

    pthread_getname_np()
    ceph::logging::Log::dump_recent()
    Monitor::~Monitor()
    Monitor::~Monitor()
    main()
    __libc_start_main()
    _start()

Crash dump sample:
{
    "assert_condition": "session_map.sessions.empty()",
    "assert_file": "mon/Monitor.cc",
    "assert_func": "virtual Monitor::~Monitor()",
    "assert_line": 262,
    "assert_msg": "mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f4ff8c8c6c0 time 2021-08-03T11:49:35.421508+0000\nmon/Monitor.cc: 262: FAILED ceph_assert(session_map.sessions.empty())",
    "assert_thread_name": "ceph-mon",
    "backtrace": [
        "(()+0x12b20) [0x7f4fed96ab20]",
        "(pthread_getname_np()+0x48) [0x7f4fed96bd98]",
        "(ceph::logging::Log::dump_recent()+0x428) [0x7f4ff01c4978]",
        "(()+0x4be2db) [0x555e399352db]",
        "(()+0x12b20) [0x7f4fed96ab20]",
        "(gsignal()+0x10f) [0x7f4fec5d27ff]",
        "(abort()+0x127) [0x7f4fec5bcc35]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f4fefe74d61]",
        "(()+0x27af2a) [0x7f4fefe74f2a]",
        "(Monitor::~Monitor()+0xef6) [0x555e39704c26]",
        "(Monitor::~Monitor()+0xd) [0x555e39704c7d]",
        "(main()+0x565e) [0x555e396974ee]",
        "(__libc_start_main()+0xf3) [0x7f4fec5be7b3]",
        "(_start()+0x2e) [0x555e396c0d8e]" 
    ],
    "ceph_version": "15.2.13",
    "crash_id": "2021-08-03T11:49:35.767310Z_62904f71-57d0-4a50-93a8-264c4cc6ff32",
    "entity_name": "mon.465717d0783140bdb59100800078d74713f06fc3",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mon",
    "stack_sig": "c7d7213859ab7cdabcc40049aff5482ebbf1b9e92d6e65a376ea1d5e89787cf6",
    "timestamp": "2021-08-03T11:49:35.767310Z",
    "utsname_machine": "x86_64",
    "utsname_release": "4.19.0-17-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Debian 4.19.194-3 (2021-07-18)" 
}

Actions #17

Updated by jianwei zhang over 4 years ago

{
    "crash_id": "2021-08-26T03:38:46.109584Z_c0f5c111-a3bc-4210-8edd-e72cb5344590",
    "timestamp": "2021-08-26T03:38:46.109584Z",
    "process_name": "ceph-mon",
    "entity_name": "mon.c",
    "ceph_version": "v15.2.8.1.0.0",
    "utsname_hostname": "node-102",
    "utsname_sysname": "Linux",
    "utsname_release": "3.10.0-862.el7.x86_64",
    "utsname_version": "#1 SMP Fri Apr 20 16:44:24 UTC 2018",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Linux",
    "os_id": "centos",
    "os_version_id": "7",
    "os_version": "7 (Core)",
    "assert_condition": "session_map.sessions.empty()",
    "assert_func": "virtual Monitor::~Monitor()",
    "assert_file": "/SDS-CICD/release/ceph15-tancz/rpmbuild/BUILD/ceph-15.2.8.1.0.0/src/mon/Monitor.cc",
    "assert_line": 262,
    "assert_thread_name": "ceph-mon",
    "assert_msg": "src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 7f8e02893340 time 2021-08-26T11:38:46.105871+0800\nsrc/mon/Monitor.cc: 262: FAILED ceph_assert(session_map.sessions.empty())\n",
    "backtrace": [
        "(()+0xf5d0) [0x7f8df78ce5d0]",
        "(gsignal()+0x37) [0x7f8df66c4207]",
        "(abort()+0x148) [0x7f8df66c58f8]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19b) [0x7f8df9ac4c9e]",
        "(()+0x269e17) [0x7f8df9ac4e17]",
        "(Monitor::~Monitor()+0x846) [0x557a6eceded6]",
        "(Monitor::~Monitor()+0x9) [0x557a6ecedf29]",
        "(main()+0x260a) [0x557a6ec7ba9a]",
        "(__libc_start_main()+0xf5) [0x7f8df66b03d5]",
        "(()+0x2304f0) [0x557a6ecac4f0]" 
    ]
}
Actions #18

Updated by Neha Ojha over 4 years ago

  • Has duplicate Bug #52199: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Actions #19

Updated by Neha Ojha over 4 years ago

  • Has duplicate Bug #52198: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Actions #20

Updated by Neha Ojha over 4 years ago

  • Has duplicate Bug #52142: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Actions #21

Updated by Deepika Upadhyay over 4 years ago

  • Crash signature (v1) updated (diff)
021-10-02T17:30:34.842 INFO:tasks.ceph.mon.a.smithi063.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6-216-g6e2fe4ec/rpm/el8/BUILD/ceph-16.2.6-216-g6e2fe4ec/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 4045240 time 2021-10-02T17:30:34.839243+0000
2021-10-02T17:30:34.843 INFO:tasks.ceph.mon.a.smithi063.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6-216-g6e2fe4ec/rpm/el8/BUILD/ceph-16.2.6-216-g6e2fe4ec/src/mon/Monitor.cc: 287: FAILED ceph_assert(session_map.sessions.empty())

/ceph/teuthology-archive/yuriw-2021-10-02_15:03:31-rados-wip-yuri2-testing-2021-10-01-0902-pacific-distro-basic-smithi/641
7691/teuthology.log

Actions #22

Updated by Sage Weil over 4 years ago

/a/sage-2021-10-28_02:19:01-rados-wip-sage3-testing-2021-10-27-1300-distro-basic-smithi/6464204

with logs!

Actions #23

Updated by Aishwarya Mathuria over 4 years ago

/a/yuriw-2021-11-20_18:01:41-rados-wip-yuri8-testing-2021-11-20-0807-distro-basic-smithi/6516396

Actions #24

Updated by Sage Weil over 4 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 44337
Actions #25

Updated by Sage Weil over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #26

Updated by Upkeep Bot over 4 years ago

  • Copied to Backport #53659: pacific: mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum added
Actions #27

Updated by Upkeep Bot over 4 years ago

  • Copied to Backport #53660: octopus: mon: "FAILED ceph_assert(session_map.sessions.empty())" when out of quorum added
Actions #28

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v14.2.2, v15.2.14, v15.2.15 added
Actions #29

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
Actions #30

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.2.0, v16.2.2, v16.2.4, v16.2.5, v16.2.6, v16.2.7 added
Actions #31

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #32

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #33

Updated by Telemetry Bot about 4 years ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v14.2.0, v14.2.1, v14.2.10, v14.2.11, v14.2.13, v14.2.16, v14.2.4, v14.2.5, v14.2.7, v14.2.8, v15.2.0 added
Actions #34

Updated by Neha Ojha almost 4 years ago

  • Status changed from Pending Backport to Resolved
  • Crash signature (v1) updated (diff)
Actions #35

Updated by Telemetry Bot over 3 years ago

  • Related to Bug #56192: crash: virtual Monitor::~Monitor(): assert(session_map.sessions.empty()) added
Actions #36

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to b55781d412f05e5ad99751cc4247a22d9ada5547
  • Fixed In set to v17.0.0-9702-gb55781d412
  • Released In set to v17.2.0~275
  • Upkeep Timestamp set to 2025-07-18T16:11:29+00:00
Actions

Also available in: Atom PDF