Bug #64870: Health check failed: 1 osds down (OSD_DOWN)" in cluster log - Dashboard - Ceph

Actions

Copy link

Bug #64870

open

Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Added by Sridhar Seshasayee about 2 years ago. Updated 5 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

Target version:

% Done:

Source:

Backport:

squid, reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

58205

Tags (freeform):

backport_processed

Merge Commit:

3956c4278abb8f3f97e1c48924bf741e64a68d82

Fixed In:

v19.3.0-3795-g3956c4278a

Released In:

v20.2.0~2369

Upkeep Timestamp:

2025-11-01T00:58:01+00:00

Tags:

test-failure main-failures cluster-log-warning

Description

Description of problem¶

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587648

Test Description:
rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-comp-lz4} tasks/e2e}

The test 04-osds.e2e-spec.ts marks OSDs down due to which OSD_DOWN health warning is raised.
Logs show that the health warning is cleared in a few secs but the warning is logged due to which
the test is marked failed. All the dashboard tests passed. The warning should probably be added to
the ignorelist as OSD_DOWN event is expected.

Actual results¶

2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch
2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: pgmap v423: 1 pgs: 1 active+clean; 577 KiB data, 419 MiB used, 313 GiB / 313 GiB avail
2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: Health check failed: 1 osds down (OSD_DOWN)
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "osd down", "ids": ["3"]}]': finished
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: osdmap e40: 6 total, 5 up, 6 in
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: Monitor daemon marked osd.3 down, but it is still running
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: map e40 wrongly marked me down at e40
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: osd.3 marked itself dead as of e40
2024-03-10T00:32:14.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: osd.3 now down
2024-03-10T00:32:14.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: Removing daemon osd.3 from smithi186 -- ports []
2024-03-10T00:32:14.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: osdmap e41: 6 total, 5 up, 6 in
2024-03-10T00:32:15.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:15 smithi110 ceph-mon[29592]: pgmap v426: 1 pgs: 1 active+clean; 577 KiB data, 208 MiB used, 313 GiB / 313 GiB avail
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd=[{"prefix": "auth rm", "entity": "osd.3"}]: dispatch
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "auth rm", "entity": "osd.3"}]': finished
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd=[{"prefix": "osd purge-actual", "id": 3, "yes_i_really_mean_it": true}]: dispatch
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: Health check cleared: OSD_DOWN (was: 1 osds down)
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: Cluster is now healthy
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "osd purge-actual", "id": 3, "yes_i_really_mean_it": true}]': finished
2024-03-10T00:32:16.642 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: osdmap e42: 5 total, 5 up, 5 in

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Subject changed from mgr/dashboard: Health check failed: 1 osds down (OSD_DOWN)" in cluster log to Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Also found in an upgrade test:

description: rados/upgrade/parallel/{0-random-distro$/{ubuntu_22.04} 0-start 1-tasks
mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig
rbd_import_export test_rbd_api test_rbd_python}}
/a/yuriw-2024-04-10_14:17:51-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7650687

Actions

Copy link

Updated by Laura Flores almost 2 years ago

And in a cephadm test: /a/yuriw-2024-04-10_14:17:51-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7650670

Actions

Copy link

Updated by Laura Flores almost 2 years ago

More in this run:
https://pulpito.ceph.com/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Related to Bug #65521: Add expected warnings in cluster log to ignorelists added

Actions

Copy link

Updated by Sridhar Seshasayee almost 2 years ago · Edited

Seen on Squid:
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705413
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705425
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705440
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705454

Test Description:
1. 7705413 - rados/upgrade/parallel/{0-random-distro$/{centos_9.stream_runc} 0-start 1-tasks mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}

2. 7705425 - rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-hybrid} tasks/e2e}

3. 7705440 - rados/upgrade/parallel/{0-random-distro$/{centos_9.stream} 0-start 1-tasks mon_election/connectivity upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}

4. 7705454 - rados/dashboard/{0-single-container-host debug/mgr mon_election/classic random-objectstore$/{bluestore-bitmap} tasks/e2e}

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Assignee set to Nitzan Mordechai

@Nitzan Mordechai can you take a look? This should likely be added to the ignorelist.

Actions

Copy link

Updated by Nitzan Mordechai over 1 year ago

Tags set to test-failure

Actions

Copy link

Updated by Nitzan Mordechai over 1 year ago

Pull request ID set to 58205

Actions

Copy link

Updated by Aishwarya Mathuria over 1 year ago

/a/yuriw-2024-06-26_14:19:02-rados-wip-yuri13-testing-2024-06-25-1409-squid-distro-default-smithi/7773523 - rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-stupid} tasks/e2e}

Actions

Copy link

#10

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-07-05_14:04:08-rados-wip-yuri3-testing-2024-07-01-1610-distro-default-smithi/7788682

Actions

Copy link

#11

Updated by Aishwarya Mathuria over 1 year ago

Tags changed from test-failure to test-failure, main-failures

Actions

Copy link

#12

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-07-17_22:03:42-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7806662

Actions

Copy link

#13

Updated by Aishwarya Mathuria over 1 year ago

/a/yuriw-2024-07-16_01:05:51-rados-wip-yuri6-testing-2024-07-15-1335-distro-default-smithi/7803568

Actions

Copy link

#14

Updated by Laura Flores over 1 year ago

https://pulpito.ceph.com/yuriw-2024-07-23_19:38:12-rados-wip-yuri5-testing-2024-07-23-0804-distro-default-smithi/7814375

Actions

Copy link

#15

Updated by Yuri Weinstein over 1 year ago

https://github.com/ceph/ceph/pull/58205 merged

Actions

Copy link

#16

Updated by Aishwarya Mathuria over 1 year ago

https://pulpito.ceph.com/yuriw-2024-07-28_15:03:13-rados-wip-yuri10-testing-2024-07-25-0747-distro-default-smithi/7822291

Actions

Copy link

#17

Updated by Laura Flores over 1 year ago

https://pulpito.ceph.com/yuriw-2024-08-08_14:03:53-rados-squid-release-distro-default-smithi/7843716

Actions

Copy link

#18

Updated by Nitzan Mordechai over 1 year ago

Backport set to squid, reef

Actions

Copy link

#19

Updated by Nitzan Mordechai over 1 year ago

Status changed from New to Pending Backport

need to backport

Actions

Copy link

#20

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67472: reef: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added

Actions

Copy link

#21

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67473: squid: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added

Actions

Copy link

#22

Updated by Upkeep Bot over 1 year ago

Tags (freeform) set to backport_processed

Actions

Copy link

#23

Updated by Laura Flores over 1 year ago

/a/skanta-2024-09-07_00:00:30-rados-squid-release-distro-default-smithi/7893238

Actions

Copy link

#24

Updated by Laura Flores over 1 year ago

/a/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921700

Actions

Copy link

#25

Updated by Sridhar Seshasayee about 1 year ago

/a/skanta-2025-01-26_15:56:13-rados-wip-bharath13-testing-2025-01-25-2124-squid-distro-default-smithi/8094267

Test Description:
rados/upgrade/parallel/{0-random-distro$/{ubuntu_22.04} 0-start 1-tasks
mon_election/classic overrides/ignorelist_health upgrade-sequence workload/{ec-rados-default
rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}

OSD_DOWN cluster log generated due to OSD going down during the upgrade sequence:

2025-01-26T17:22:45.411+0000 7fcb84c8e640  1 -- [v2:172.21.15.123:3300/0,v1:172.21.15.123:6789/0] <== osd.0 v2:172.21.15.123:6802/2553244879 20 ==== MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) ==== 119+0+0 (secure 0 0 0) 0x558372053680 con 0x558372329800
2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 mon.a@0(leader) e4 _ms_dispatch existing session 0x558372047b00 for osd.0
2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 mon.a@0(leader) e4  entity_name osd.0 global_id 14229 (reclaim_ok) caps allow profile osd
2025-01-26T17:22:45.411+0000 7fcb84c8e640 10 mon.a@0(leader).paxosservice(osdmap 1..423) dispatch 0x558372053680 MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) from osd.0 v2:172.21.15.123:6802/2553244879 con 0x558372329800
2025-01-26T17:22:45.411+0000 7fcb84c8e640  5 mon.a@0(leader).paxos(paxos active c 1758..2505) is_readable = 1 - now=2025-01-26T17:22:45.414490+0000 lease_expire=2025-01-26T17:22:49.716355+0000 has v0 lc 2505
2025-01-26T17:22:45.411+0000 7fcb84c8e640 10 mon.a@0(leader).osd e423 preprocess_query MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) from osd.0 v2:172.21.15.123:6802/2553244879
2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 MonCap is_capable service=osd command= exec addr v2:172.21.15.123:6802/2553244879 on cap allow profile osd
2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 MonCap  allow so far , doing grant allow profile osd
2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 MonCap  match
2025-01-26T17:22:45.411+0000 7fcb84c8e640 10 mon.a@0(leader).osd e423 MOSDMarkMeDown for: osd.0 [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879]
2025-01-26T17:22:45.411+0000 7fcb84c8e640  7 mon.a@0(leader).osd e423 prepare_update MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) from osd.0 v2:172.21.15.123:6802/2553244879
2025-01-26T17:22:45.411+0000 7fcb84c8e640  0 log_channel(cluster) log [INF] : osd.0 marked itself down and dead

...

2025-01-26T17:22:45.703+0000 7fcb87493640  0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)

Actions

Copy link

#26

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-03-11_17:59:57-rados-wip-yuri11-testing-2025-03-11-0848-reef-distro-default-smithi/8181680

description: rados/cephadm/osds/{0-distro/ubuntu_22.04 0-nvme-loop 1-start 2-ops/rm-zap-add}
duration: 1388.6809134483337
failure_reason: '"2025-03-11T22:18:04.579236+0000 osd.1 (osd.1) 3 : cluster [WRN]
  Monitor daemon marked osd.1 down, but it is still running" in cluster log'

Actions

Copy link

#27

Updated by Laura Flores about 1 year ago

Tags changed from test-failure, main-failures to test-failure, main-failures, cluster-log-warning

Actions

Copy link

#28

Updated by Upkeep Bot 9 months ago

Merge Commit set to 3956c4278abb8f3f97e1c48924bf741e64a68d82
Fixed In set to v19.3.0-3795-g3956c4278ab
Upkeep Timestamp set to 2025-07-09T13:47:19+00:00

Actions

Copy link

#29

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-3795-g3956c4278ab to v19.3.0-3795-g3956c4278a
Upkeep Timestamp changed from 2025-07-09T13:47:19+00:00 to 2025-07-14T17:41:04+00:00

Actions

Copy link

#30

Updated by Jonathan Bailey 8 months ago

/a/skanta-2025-07-20_02:53:19-rados-wip-bharath6-testing-2025-07-20-0524-reef-distro-default-smithi/8397693

Actions

Copy link

#31

Updated by Upkeep Bot 5 months ago

Released In set to v20.2.0~2369
Upkeep Timestamp changed from 2025-07-14T17:41:04+00:00 to 2025-11-01T00:58:01+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr » Dashboard

Tags

Custom queries

Bug #64870

Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Description of problem¶

Actual results¶

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Sridhar Seshasayee almost 2 years ago · Edited

Updated by Laura Flores almost 2 years ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Aishwarya Mathuria over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Aishwarya Mathuria over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Aishwarya Mathuria over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Yuri Weinstein over 1 year ago

Updated by Aishwarya Mathuria over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Sridhar Seshasayee about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Upkeep Bot 9 months ago

Updated by Upkeep Bot 8 months ago

Updated by Jonathan Bailey 8 months ago

Updated by Upkeep Bot 5 months ago