Bug #64870
openHealth check failed: 1 osds down (OSD_DOWN)" in cluster log
0%
Description
Description of problem¶
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587648
Test Description:
rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-comp-lz4} tasks/e2e}
The test 04-osds.e2e-spec.ts marks OSDs down due to which OSD_DOWN health warning is raised.
Logs show that the health warning is cleared in a few secs but the warning is logged due to which
the test is marked failed. All the dashboard tests passed. The warning should probably be added to
the ignorelist as OSD_DOWN event is expected.
Actual results¶
2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: from='mon.0 -' entity='mon.' cmd=[{"prefix": "osd safe-to-destroy", "ids": ["3"]}]: dispatch
2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: pgmap v423: 1 pgs: 1 active+clean; 577 KiB data, 419 MiB used, 313 GiB / 313 GiB avail
2024-03-10T00:32:13.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: Health check failed: 1 osds down (OSD_DOWN)
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "osd down", "ids": ["3"]}]': finished
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: osdmap e40: 6 total, 5 up, 6 in
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: Monitor daemon marked osd.3 down, but it is still running
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: map e40 wrongly marked me down at e40
2024-03-10T00:32:13.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:13 smithi110 ceph-mon[29592]: osd.3 marked itself dead as of e40
2024-03-10T00:32:14.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: osd.3 now down
2024-03-10T00:32:14.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: Removing daemon osd.3 from smithi186 -- ports []
2024-03-10T00:32:14.961 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:14 smithi110 ceph-mon[29592]: osdmap e41: 6 total, 5 up, 6 in
2024-03-10T00:32:15.960 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:15 smithi110 ceph-mon[29592]: pgmap v426: 1 pgs: 1 active+clean; 577 KiB data, 208 MiB used, 313 GiB / 313 GiB avail
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd=[{"prefix": "auth rm", "entity": "osd.3"}]: dispatch
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "auth rm", "entity": "osd.3"}]': finished
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd=[{"prefix": "osd purge-actual", "id": 3, "yes_i_really_mean_it": true}]: dispatch
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: Health check cleared: OSD_DOWN (was: 1 osds down)
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: Cluster is now healthy
2024-03-10T00:32:16.641 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: from='mgr.14152 172.21.15.110:0/917657050' entity='mgr.a' cmd='[{"prefix": "osd purge-actual", "id": 3, "yes_i_really_mean_it": true}]': finished
2024-03-10T00:32:16.642 INFO:journalctl@ceph.mon.a.smithi110.stdout:Mar 10 00:32:16 smithi110 ceph-mon[29592]: osdmap e42: 5 total, 5 up, 5 in
Updated by Laura Flores almost 2 years ago
- Subject changed from mgr/dashboard: Health check failed: 1 osds down (OSD_DOWN)" in cluster log to Health check failed: 1 osds down (OSD_DOWN)" in cluster log
Also found in an upgrade test:
description: rados/upgrade/parallel/{0-random-distro$/{ubuntu_22.04} 0-start 1-tasks
mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig
rbd_import_export test_rbd_api test_rbd_python}}
/a/yuriw-2024-04-10_14:17:51-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7650687
Updated by Laura Flores almost 2 years ago
And in a cephadm test: /a/yuriw-2024-04-10_14:17:51-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7650670
Updated by Laura Flores almost 2 years ago
Updated by Laura Flores almost 2 years ago
- Related to Bug #65521: Add expected warnings in cluster log to ignorelists added
Updated by Sridhar Seshasayee almost 2 years ago ยท Edited
Seen on Squid:
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705413
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705425
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705440
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705454
Test Description:
1. 7705413 - rados/upgrade/parallel/{0-random-distro$/{centos_9.stream_runc} 0-start 1-tasks mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}
2. 7705425 - rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-hybrid} tasks/e2e}
3. 7705440 - rados/upgrade/parallel/{0-random-distro$/{centos_9.stream} 0-start 1-tasks mon_election/connectivity upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}
4. 7705454 - rados/dashboard/{0-single-container-host debug/mgr mon_election/classic random-objectstore$/{bluestore-bitmap} tasks/e2e}
Updated by Laura Flores almost 2 years ago
- Assignee set to Nitzan Mordechai
@Nitzan Mordechai can you take a look? This should likely be added to the ignorelist.
Updated by Aishwarya Mathuria over 1 year ago
/a/yuriw-2024-06-26_14:19:02-rados-wip-yuri13-testing-2024-06-25-1409-squid-distro-default-smithi/7773523 - rados/dashboard/{0-single-container-host debug/mgr mon_election/connectivity random-objectstore$/{bluestore-stupid} tasks/e2e}
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-07-05_14:04:08-rados-wip-yuri3-testing-2024-07-01-1610-distro-default-smithi/7788682
Updated by Aishwarya Mathuria over 1 year ago
- Tags changed from test-failure to test-failure, main-failures
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-07-17_22:03:42-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7806662
Updated by Aishwarya Mathuria over 1 year ago
/a/yuriw-2024-07-16_01:05:51-rados-wip-yuri6-testing-2024-07-15-1335-distro-default-smithi/7803568
Updated by Yuri Weinstein over 1 year ago
Updated by Nitzan Mordechai over 1 year ago
- Status changed from New to Pending Backport
need to backport
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67472: reef: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67473: squid: Health check failed: 1 osds down (OSD_DOWN)" in cluster log added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Laura Flores over 1 year ago
/a/skanta-2024-09-07_00:00:30-rados-squid-release-distro-default-smithi/7893238
Updated by Laura Flores over 1 year ago
/a/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921700
Updated by Sridhar Seshasayee about 1 year ago
/a/skanta-2025-01-26_15:56:13-rados-wip-bharath13-testing-2025-01-25-2124-squid-distro-default-smithi/8094267
Test Description:
rados/upgrade/parallel/{0-random-distro$/{ubuntu_22.04} 0-start 1-tasks
mon_election/classic overrides/ignorelist_health upgrade-sequence workload/{ec-rados-default
rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}
OSD_DOWN cluster log generated due to OSD going down during the upgrade sequence:
2025-01-26T17:22:45.411+0000 7fcb84c8e640 1 -- [v2:172.21.15.123:3300/0,v1:172.21.15.123:6789/0] <== osd.0 v2:172.21.15.123:6802/2553244879 20 ==== MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) ==== 119+0+0 (secure 0 0 0) 0x558372053680 con 0x558372329800 2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 mon.a@0(leader) e4 _ms_dispatch existing session 0x558372047b00 for osd.0 2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 mon.a@0(leader) e4 entity_name osd.0 global_id 14229 (reclaim_ok) caps allow profile osd 2025-01-26T17:22:45.411+0000 7fcb84c8e640 10 mon.a@0(leader).paxosservice(osdmap 1..423) dispatch 0x558372053680 MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) from osd.0 v2:172.21.15.123:6802/2553244879 con 0x558372329800 2025-01-26T17:22:45.411+0000 7fcb84c8e640 5 mon.a@0(leader).paxos(paxos active c 1758..2505) is_readable = 1 - now=2025-01-26T17:22:45.414490+0000 lease_expire=2025-01-26T17:22:49.716355+0000 has v0 lc 2505 2025-01-26T17:22:45.411+0000 7fcb84c8e640 10 mon.a@0(leader).osd e423 preprocess_query MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) from osd.0 v2:172.21.15.123:6802/2553244879 2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 MonCap is_capable service=osd command= exec addr v2:172.21.15.123:6802/2553244879 on cap allow profile osd 2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 MonCap allow so far , doing grant allow profile osd 2025-01-26T17:22:45.411+0000 7fcb84c8e640 20 MonCap match 2025-01-26T17:22:45.411+0000 7fcb84c8e640 10 mon.a@0(leader).osd e423 MOSDMarkMeDown for: osd.0 [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879] 2025-01-26T17:22:45.411+0000 7fcb84c8e640 7 mon.a@0(leader).osd e423 prepare_update MOSDMarkMeDown(request_ack=1, down_and_dead=1, osd.0, [v2:172.21.15.123:6802/2553244879,v1:172.21.15.123:6803/2553244879], fsid=74c81e80-dc06-11ef-bb7f-bd4984dce30f) from osd.0 v2:172.21.15.123:6802/2553244879 2025-01-26T17:22:45.411+0000 7fcb84c8e640 0 log_channel(cluster) log [INF] : osd.0 marked itself down and dead ... 2025-01-26T17:22:45.703+0000 7fcb87493640 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-03-11_17:59:57-rados-wip-yuri11-testing-2025-03-11-0848-reef-distro-default-smithi/8181680
description: rados/cephadm/osds/{0-distro/ubuntu_22.04 0-nvme-loop 1-start 2-ops/rm-zap-add}
duration: 1388.6809134483337
failure_reason: '"2025-03-11T22:18:04.579236+0000 osd.1 (osd.1) 3 : cluster [WRN]
Monitor daemon marked osd.1 down, but it is still running" in cluster log'
Updated by Laura Flores about 1 year ago
- Tags changed from test-failure, main-failures to test-failure, main-failures, cluster-log-warning
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 3956c4278abb8f3f97e1c48924bf741e64a68d82
- Fixed In set to v19.3.0-3795-g3956c4278ab
- Upkeep Timestamp set to 2025-07-09T13:47:19+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-3795-g3956c4278ab to v19.3.0-3795-g3956c4278a
- Upkeep Timestamp changed from 2025-07-09T13:47:19+00:00 to 2025-07-14T17:41:04+00:00
Updated by Jonathan Bailey 8 months ago
/a/skanta-2025-07-20_02:53:19-rados-wip-bharath6-testing-2025-07-20-0524-reef-distro-default-smithi/8397693
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2369
- Upkeep Timestamp changed from 2025-07-14T17:41:04+00:00 to 2025-11-01T00:58:01+00:00