Project

General

Profile

Actions

Bug #72892

open

rados/cephadm - rm-cluster hung after mgr daemon was recovered

Added by Lee Sanders 7 months ago. Updated 3 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

2025-08-19T21:57:56.259 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:57:55 smithi164 ceph-mon[34295]: Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)                                                  │
│2025-08-19T21:57:56.259 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:57:55 smithi164 ceph-mon[34295]: Health check failed: 1 osds down (OSD_DOWN)                                                                              │
│2025-08-19T21:57:56.259 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:57:55 smithi164 ceph-mon[34295]: Health check failed: 1 host (1 osds) down (OSD_HOST_DOWN)                                                                │
│2025-08-19T21:57:56.259 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:57:55 smithi164 ceph-mon[34295]: Health check failed: 1 root (1 osds) down (OSD_ROOT_DOWN)                                                                │
│2025-08-19T21:58:03.998 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:58:03 smithi164 ceph-mon[34295]: Health check cleared: OSD_DOWN (was: 1 osds down)                                                                        │
│2025-08-19T21:58:03.998 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:58:03 smithi164 ceph-mon[34295]: Health check cleared: OSD_HOST_DOWN (was: 1 host (1 osds) down)                                                          │
│2025-08-19T21:58:03.998 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:58:03 smithi164 ceph-mon[34295]: Health check cleared: OSD_ROOT_DOWN (was: 1 root (1 osds) down)                                                          │
│2025-08-19T21:58:06.570 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:58:06 smithi164 ceph-mon[34295]: Health check cleared: CEPHADM_FAILED_DAEMON (was: 1 failed cephadm daemon(s))                                            │
│2025-08-19T21:58:16.988 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:58:16 smithi164 ceph-mon[34295]: Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)                                                  │
│2025-08-19T21:58:22.258 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:58:22 smithi164 ceph-mon[34295]: Health check cleared: CEPHADM_FAILED_DAEMON (was: 1 failed cephadm daemon(s))                                            │
│2025-08-19T21:59:14.008 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:59:13 smithi164 ceph-mon[34295]: Health check failed: 1 osds down (OSD_DOWN)                                                                              │
│2025-08-19T21:59:22.078 INFO:journalctl@ceph.mon.a.smithi164.stdout:Aug 19 21:59:21 smithi164 ceph-mon[34295]: Health check cleared: OSD_DOWN (was: 1 osds down)

rados/cephadm test
found on main

2025-08-19T22:02:29.759 INFO:journalctl@ceph.osd.2.smithi164.stdout:Aug 19 22:02:29 smithi164 systemd[1]: Stopped Ceph osd.2 for 2f87e142-7d47-11f0-8741-adfe0268badd.
2025-08-19T22:02:29.759 INFO:journalctl@ceph.osd.2.smithi164.stdout:Aug 19 22:02:29 smithi164 systemd[1]: ceph-2f87e142-7d47-11f0-8741-adfe0268badd@osd.2.service: Consumed 3.089s CPU time.
2025-08-19T22:02:30.501 DEBUG:teuthology.orchestra.run:got remote process result: None
2025-08-19T22:02:30.501 INFO:tasks.cephadm.osd.2:Stopped osd.2
2025-08-19T22:02:30.502 DEBUG:teuthology.orchestra.run.smithi164:> sudo /home/ubuntu/cephtest/cephadm rm-cluster --fsid 2f87e142-7d47-11f0-8741-adfe0268badd --force --keep-logs
2025-08-20T05:45:00.800 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2025-08-20T05:45:00.858 DEBUG:teuthology.task.console_log:Killing console logger for smithi164
2025-08-20T05:45:00.859 DEBUG:teuthology.exit:Finished running handlers

rm-cluster issued after CEPHADM_FAILED_DAEMON was cleared (ie. after mgr daemon was recovered)
rm-cluster command hung

Not convinced this is a duplicate of https://tracker.ceph.com/issues/68586 or https://tracker.ceph.com/issues/69803
as the mgr daemon was up and running at the time.

/a/yuriw-2025-08-19_14:49:40-rados-wip-yuri-testing-2025-08-18-1127-distro-default-smithi/8451583
https://pulpito.ceph.com/yuriw-2025-08-19_14:49:40-rados-wip-yuri-testing-2025-08-18-1127-distro-default-smithi/8451583

Actions #1

Updated by Lee Sanders 4 months ago

  • Project changed from Ceph to Orchestrator

Another occurrence here:
/a/skanta-2025-11-01_01:12:28-rados-wip-bharath5-testing-2025-10-31-1454-distro-default-smithi/8578605

Actions #2

Updated by Lee Sanders 4 months ago

/a/skanta-2025-11-01_01:12:28-rados-wip-bharath5-testing-2025-10-31-1454-distro-default-smithi/8578605

Actions #3

Updated by Laura Flores about 2 months ago

  • Description updated (diff)
Actions #4

Updated by Laura Flores about 2 months ago

/a/lflores-2026-01-26_23:21:06-rados-wip-yuri12-testing-2026-01-22-2045-distro-default-trial/19091

Actions #5

Updated by Nitzan Mordechai about 2 months ago

/a/yuriw-2026-01-29_18:33:05-rados-wip-yuri2-testing-2026-01-28-1643-tentacle-distro-default-trial/26508

Actions #6

Updated by Nitzan Mordechai about 1 month ago

/a/yuriw-2026-02-04_23:08:40-rados-wip-yuri3-testing-2026-02-04-1948-tentacle-distro-default-trial/35396

Actions #7

Updated by Nitzan Mordechai about 1 month ago

/a/skanta-2026-02-02_23:43:28-rados-wip-bharath9-testing-2026-02-02-0839-distro-default-trial/30412

Actions #8

Updated by Nitzan Mordechai 3 days ago

/a/yaarit-2026-03-19_02:36:58-rados:cephadm-wip-rocky10-branch-of-the-day-2026-03-18-1773834065-tentacle-distro-default-trial/
7 jobs: ['108801', '108788', '108763', '108838', '108776', '108851', '108813']

Actions

Also available in: Atom PDF