Bug #65728: Daemon managed by cephadm in an unknown state (CEPHADM_FAILED_DAEMON) - Orchestrator - Ceph

Actions

Copy link

Bug #65728

open

Daemon managed by cephadm in an unknown state (CEPHADM_FAILED_DAEMON)

Added by Laura Flores almost 2 years ago. Updated 2 days ago.

Status:

New

Priority:

Normal

Assignee:

Adam King

Category:

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Tags:

main-failures cluster-log-warning

Description

/a/yuriw-2024-04-20_15:32:38-rados-wip-yuriw-testing-20240419.185239-main-distro-default-smithi/7664960/remote/smithi045/log/8d9a18e8-ff41-11ee-bc93-c7b262605968/ceph-mon.a.log.gz

2024-04-20T18:19:18.046+0000 7f3e74eae700 20 mon.a@0(leader).mgrstat health checks:
{
    "CEPHADM_FAILED_DAEMON": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 failed cephadm daemon(s)",
            "count": 1
        },
        "detail": [
            {
                "message": "daemon alertmanager.smithi104 on smithi104 is in unknown state" 
            }
        ]
    }
}

The cluster warning later cleared up.

2024-04-20T18:19:00.723654+0000 mon.a (mon.0) 774 : cluster [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)
2024-04-20T18:19:01.777510+0000 mgr.a (mgr.14427) 39 : cluster [DBG] pgmap v16: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:02.024168+0000 mgr.a (mgr.14427) 40 : cluster [DBG] pgmap v17: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:02.024389+0000 mgr.a (mgr.14427) 41 : cluster [DBG] pgmap v18: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:04.024929+0000 mgr.a (mgr.14427) 42 : cluster [DBG] pgmap v19: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:06.025320+0000 mgr.a (mgr.14427) 43 : cluster [DBG] pgmap v20: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:08.025932+0000 mgr.a (mgr.14427) 44 : cluster [DBG] pgmap v21: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:10.026479+0000 mgr.a (mgr.14427) 45 : cluster [DBG] pgmap v22: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:12.026820+0000 mgr.a (mgr.14427) 46 : cluster [DBG] pgmap v23: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:12.046177+0000 mgr.a (mgr.14427) 47 : cluster [DBG] pgmap v24: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:12.046297+0000 mgr.a (mgr.14427) 48 : cluster [DBG] pgmap v25: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:14.046559+0000 mgr.a (mgr.14427) 49 : cluster [DBG] pgmap v26: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:16.047048+0000 mgr.a (mgr.14427) 50 : cluster [DBG] pgmap v27: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:18.047381+0000 mgr.a (mgr.14427) 51 : cluster [DBG] pgmap v28: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:18.047535+0000 mgr.a (mgr.14427) 52 : cluster [DBG] pgmap v29: 1 pgs: 1 active+clean; 577 KiB data, 85 MiB used, 268 GiB / 268 GiB avail
2024-04-20T18:19:19.045348+0000 mon.a (mon.0) 792 : cluster [INF] Health check cleared: CEPHADM_FAILED_DAEMON (was: 1 failed cephadm daemon(s))
2024-04-20T18:19:19.045384+0000 mon.a (mon.0) 793 : cluster [INF] Cluster is now healthy

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Related to Bug #65521: Add expected warnings in cluster log to ignorelists added

Actions

Copy link

Updated by Laura Flores almost 2 years ago

Subject changed from Alertmanager in an unknown state to Daemon managed by cephadm in an unknown state (CEPHADM_FAILED_DAEMON)

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652483

2024-04-12T01:05:08.125+0000 7f5bbef19700 20 mon.a@0(leader).mgrstat health checks:
{
    "CEPHADM_FAILED_DAEMON": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 failed cephadm daemon(s)",
            "count": 1
        },
        "detail": [
            {
                "message": "daemon osd.0 on smithi073 is in unknown state" 
            }
        ]
    }
}

Actions

Copy link

Updated by Laura Flores almost 2 years ago

yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652484

Actions

Copy link

Updated by Matan Breizman almost 2 years ago · Edited

/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707712
/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707734
/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707751
/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707848
/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707867
/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707905
/a/yuriw-2024-05-15_21:09:29-rados-wip-yuri5-testing-2024-05-15-0804-distro-default-smithi/7707972

Actions

Copy link

Updated by Laura Flores over 1 year ago

Assignee set to Adam King

@Adam King mind having a look?

Actions

Copy link

Updated by Xiubo Li over 1 year ago

squid: https://pulpito.ceph.com/jcollin-2024-06-26_01:33:51-fs-wip-jcollin-testing-20240625.102731-squid-distro-default-smithi/7772418/

Actions

Copy link

Updated by Xiubo Li over 1 year ago

squid: https://pulpito.ceph.com/jcollin-2024-07-01_10:56:30-fs-wip-jcollin-testing-20240701.061036-squid-distro-default-smithi/7781097

Actions

Copy link

Updated by Laura Flores over 1 year ago

Tags set to main-failures

/a/yuriw-2024-07-05_14:04:08-rados-wip-yuri3-testing-2024-07-01-1610-distro-default-smithi/7788683

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2024-07-04_14:11:56-rados-wip-yuri4-testing-2024-07-02-0909-distro-default-smithi/7787184/

Actions

Copy link

#10

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-07-17_13:32:02-rados-wip-yuri12-testing-2024-07-16-1122-distro-default-smithi/7805728

Actions

Copy link

#11

Updated by Aishwarya Mathuria over 1 year ago

/a/yuriw-2024-07-16_01:05:51-rados-wip-yuri6-testing-2024-07-15-1335-distro-default-smithi/7803318

Actions

Copy link

#12

Updated by Laura Flores over 1 year ago

https://pulpito.ceph.com/yuriw-2024-07-23_19:38:12-rados-wip-yuri5-testing-2024-07-23-0804-distro-default-smithi/7814546

Actions

Copy link

#13

Updated by Nitzan Mordechai over 1 year ago

/a/yuriw-2024-07-31_14:27:44-rados-wip-yuri7-testing-2024-07-30-0859-distro-default-smithi/7828635
/a/yuriw-2024-07-31_14:27:44-rados-wip-yuri7-testing-2024-07-30-0859-distro-default-smithi/7828639
/a/yuriw-2024-07-31_14:27:44-rados-wip-yuri7-testing-2024-07-30-0859-distro-default-smithi/7828628

Actions

Copy link

#14

Updated by Laura Flores over 1 year ago

https://pulpito.ceph.com/yuriw-2024-08-28_23:20:36-rados-wip-yuri4-testing-2024-08-28-1359-distro-default-smithi/7879341

Actions

Copy link

#15

Updated by Laura Flores over 1 year ago

/a/skanta-2024-09-27_06:56:34-rados-wip-bharath14-testing-2024-09-26-2119-squid-distro-default-smithi/7921638

Actions

Copy link

#16

Updated by Sridhar Seshasayee about 1 year ago

/a/skanta-2025-01-26_15:56:13-rados-wip-bharath13-testing-2025-01-25-2124-squid-distro-default-smithi/8094286

Benign cluster warnings generated during osd creation caused test failure.

OSD.1 creation sequence with warnings:

2025-01-26T17:07:18.441+0000 7fa418064640 10 mon.smithi046@0(leader).osd e10 prepare_command_osd_new found id 1 to use
2025-01-26T17:07:18.441+0000 7fa418064640 10 mon.smithi046@0(leader).osd e10 prepare_command_osd_new id 1 uuid 0e3eb9b1-b671-44c0-bf55-b227d82229cc
2025-01-26T17:07:18.441+0000 7fa418064640 10 mon.smithi046@0(leader).osd e10 prepare_command_osd_new has lockbox 0 dmcrypt 0
2025-01-26T17:07:18.441+0000 7fa418064640 10 mon.smithi046@0(leader).osd e10 prepare_command_osd_new validate secrets using osd id 1
2025-01-26T17:07:18.441+0000 7fa418064640 10 mon.smithi046@0(leader).auth v7 validate_osd_new osd.1 uuid 0e3eb9b1-b671-44c0-bf55-b227d82229cc

...

2025-01-26T17:07:18.441+0000 7fa418064640  2 mon.smithi046@0(leader).osd e10  osd.1 IN

...

2025-01-26T17:07:19.102+0000 7fa418064640 10 mon.smithi046@0(leader).config refresh_config crush_location for remote_host smithi046 is {root=default}
2025-01-26T17:07:19.102+0000 7fa418064640 20 mon.smithi046@0(leader).config refresh_config osd.1 crush {root=default} device_class

...

2025-01-26T17:07:29.855+0000 7fa418064640 10 mon.smithi046@0(leader).log v121  logging 2025-01-26T17:07:29.095219+0000 mgr.smithi046.itzjxc (mgr.14210) 42 : cephadm [INF] Deploying daemon osd.1 on smithi046

...

2025-01-26T17:07:37.088+0000 7fa41a869640 20 mon.smithi046@0(leader).osd e11 osd.1 laggy halflife 3600 decay_k -0.000192541 down for 18.646313 decay 0.996416

...

2025-01-26T17:07:38.235+0000 7fa418064640 20 mon.smithi046@0(leader) e1  entity_name osd.1 global_id 14228 (new_ok) caps allow profile osd

...

2025-01-26T17:07:38.516+0000 7fa418064640 20 mon.smithi046@0(leader).mgrstat health checks:
{
    "CEPHADM_FAILED_DAEMON": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 failed cephadm daemon(s)",
            "count": 1
        },
        "detail": [
            {
                "message": "daemon osd.1 on smithi046 is in unknown state" 
            }
        ]
    }
}

...

2025-01-26T17:07:39.510+0000 7fa41a869640  0 log_channel(cluster) log [WRN] : Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)

...

2025-01-26T17:07:51.049+0000 7fa41a869640  0 log_channel(cluster) log [INF] : Health check cleared: CEPHADM_FAILED_DAEMON (was: 1 failed cephadm daemon(s))

Actions

Copy link

#17

Updated by Laura Flores about 1 year ago

Tags changed from main-failures to main-failures, cluster-log-warning

Actions

Copy link

#18

Updated by Sridhar Seshasayee 12 months ago

/a/skanta-2025-03-27_08:02:07-rados-wip-bharath10-testing-2025-03-27-0430-distro-default-smithi/8212866

Actions

Copy link

#19

Updated by Laura Flores 11 months ago · Edited

/a/skanta-2025-04-03_15:46:29-rados-wip-bharath5-testing-2025-04-03-1526-reef-distro-default-smithi/8222954

2025-04-03T18:31:27.755+0000 7fdd52e26640 10 --2- [v2:172.21.15.28:3300/0,v1:172.21.15.28:6789/0] >> [v2:172.21.15.136:3300/0,v1:172.21.15.136:6789/0] conn(0x55f46b5f5c00 0x55f46b7f4100 secure :-1 s=THROTTLE_DONE pgs=16 cs=0 l=0 rev1=1 crypto rx=0x55f46c53ce70 tx=0x55f46c86b800 comp rx=0 tx=0).handle_read_frame_dispatch tag=17
2025-04-03T18:31:27.755+0000 7fdd52e26640  5 --2- [v2:172.21.15.28:3300/0,v1:172.21.15.28:6789/0] >> [v2:172.21.15.136:3300/0,v1:172.21.15.136:6789/0] conn(0x55f46b5f5c00 0x55f46b7f4100 secure :-1 s=THROTTLE_DONE pgs=16 cs=0 l=0 rev1=1 crypto rx=0x55f46c53ce70 tx=0x55f46c86b800 comp rx=0 tx=0).handle_message got 1836 + 0 + 0 byte message. envelope type=46 src mon.1 off 0
2025-04-03T18:31:27.755+0000 7fdd52625640 20 mon.a@0(leader).mgrstat health checks:
{
    "CEPHADM_FAILED_DAEMON": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "1 failed cephadm daemon(s)",
            "count": 1
        },
        "detail": [
            {
                "message": "daemon mon.a on smithi028 is in unknown state" 
            }
        ]
    }
}

2025-04-03T18:31:33.666159+0000 mon.a (mon.0) 19 : cluster [INF] mon.a calling monitor election
2025-04-03T18:31:33.688281+0000 mon.a (mon.0) 20 : cluster [INF] mon.a is new leader, mons a,b in quorum (ranks 0,1)
2025-04-03T18:31:33.698290+0000 mon.a (mon.0) 21 : cluster [DBG] monmap e2: 2 mons at {a=[v2:172.21.15.28:3300/0,v1:172.21.15.28:6789/0],b=[v2:172.21.15.136:3300/0,v1:172.21.15.136:6789/0]} removed_ranks: {} disallowed_leaders: {}
2025-04-03T18:31:33.718150+0000 mon.a (mon.0) 22 : cluster [DBG] fsmap
2025-04-03T18:31:33.718202+0000 mon.a (mon.0) 23 : cluster [DBG] osdmap e14: 2 total, 2 up, 2 in
2025-04-03T18:31:33.718662+0000 mon.a (mon.0) 24 : cluster [DBG] mgrmap e14: a(active, since 5m), standbys: b
2025-04-03T18:31:33.718901+0000 mon.a (mon.0) 25 : cluster [WRN] Health detail: HEALTH_WARN 1 failed cephadm daemon(s)
2025-04-03T18:31:33.718931+0000 mon.a (mon.0) 26 : cluster [WRN] [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
2025-04-03T18:31:33.718953+0000 mon.a (mon.0) 27 : cluster [WRN]     daemon mon.a on smithi028 is in unknown state
2025-04-03T18:31:33.750674+0000 mgr.a (mgr.14150) 238 : cluster [DBG] pgmap v180: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:33.751049+0000 mgr.a (mgr.14150) 239 : cluster [DBG] pgmap v181: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:35.751526+0000 mgr.a (mgr.14150) 240 : cluster [DBG] pgmap v182: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:37.751900+0000 mgr.a (mgr.14150) 241 : cluster [DBG] pgmap v183: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:39.752368+0000 mgr.a (mgr.14150) 242 : cluster [DBG] pgmap v184: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:41.752882+0000 mgr.a (mgr.14150) 243 : cluster [DBG] pgmap v185: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:41.991448+0000 mgr.a (mgr.14150) 244 : cluster [DBG] pgmap v186: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:41.991753+0000 mgr.a (mgr.14150) 245 : cluster [DBG] pgmap v187: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:43.992068+0000 mgr.a (mgr.14150) 246 : cluster [DBG] pgmap v188: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:43.995251+0000 mgr.a (mgr.14150) 247 : cluster [DBG] pgmap v189: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:43.995405+0000 mgr.a (mgr.14150) 248 : cluster [DBG] pgmap v190: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:45.995867+0000 mgr.a (mgr.14150) 249 : cluster [DBG] pgmap v191: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:46.676834+0000 mgr.a (mgr.14150) 250 : cluster [DBG] pgmap v192: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:46.677282+0000 mgr.a (mgr.14150) 251 : cluster [DBG] pgmap v193: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:48.455586+0000 mgr.a (mgr.14150) 252 : cluster [DBG] pgmap v194: 0 pgs: ; 0 B data, 57 MiB used, 179 GiB / 179 GiB avail
2025-04-03T18:31:49.455932+0000 mon.a (mon.0) 51 : cluster [INF] Health check cleared: CEPHADM_FAILED_DAEMON (was: 1 failed cephadm daemon(s))
2025-04-03T18:31:49.455977+0000 mon.a (mon.0) 52 : cluster [INF] Cluster is now healthy

Actions

Copy link

#20

Updated by Laura Flores 11 months ago

Related to Bug #67869: qa: cluster [WRN] Health check failed: 1 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON) with quiesce and fs/misc added

Actions

Copy link

#21

Updated by Laura Flores 6 months ago

/a/yuriw-2025-09-12_19:42:42-rados-wip-yuri3-testing-2025-09-12-0906-distro-default-smithi/8496826

Actions

Copy link

#22

Updated by Laura Flores 4 months ago

/a/lflores-2025-12-02_17:29:40-rados-wip-lflores-testing-4-2025-12-01-1527-distro-default-smithi/8636035

Actions

Copy link

#23

Updated by Laura Flores about 1 month ago

/a/yaarit-2026-02-05_17:05:15-rados:cephadm-wip-rocky10-branch-of-the-day-2026-02-03-1770151121-distro-default-trial/37003

Actions

Copy link

#24

Updated by Nitzan Mordechai 25 days ago

/a/nmordech-2026-02-25_11:35:39-rados:cephadm-wip-rocky10-branch-of-the-day-2026-02-24-1771941190-distro-default-trial/69855

Actions

Copy link

#25

Updated by Laura Flores 23 days ago

/a/nmordech-2026-02-25_11:36:23-rados-wip-rocky10-branch-of-the-day-2026-02-24-1771941190-distro-default-trial/70007

Actions

Copy link

#26

Updated by Nitzan Mordechai 20 days ago

/a/yaarit-2026-02-26_20:19:38-rados:cephadm-wip-rocky10-branch-of-the-day-2026-02-26-1772108951-distro-default-trial/72747

Actions

Copy link

#27

Updated by Nitzan Mordechai 20 days ago

/a/yaarit-2026-02-26_20:20:34-rados-wip-rocky10-branch-of-the-day-2026-02-26-1772108951-distro-default-trial/
6 jobs: ['73027', '72858', '72899', '72998', '73007', '72766']

Actions

Copy link

#28

Updated by Nitzan Mordechai 18 days ago

/a/yaarit-2026-03-04_01:18:18-rados:cephadm-wip-rocky10-branch-of-the-day-2026-03-03-1772558532-distro-default-trial/79890

Actions

Copy link

#29

Updated by Laura Flores 16 days ago

/a/yaarit-2026-03-05_02:43:32-rados-wip-rocky10-branch-of-the-day-2026-03-04-1772633736-distro-default-trial/86361

Actions

Copy link

#30

Updated by Laura Flores 12 days ago

Description: upgrade/tentacle-x/stress-split/{0-distro/ubuntu_22.04 0-roles 1-start 2-first-half-tasks/snaps-few-objects 3-stress-tasks/{radosbench rbd-cls rbd-import-export rbd_api readwrite snaps-few-objects} 4-second-half-tasks/rbd-import-export mon_election/connectivity overrides/ignorelist_health}
/a/yuriw-2026-03-06_21:35:20-upgrade-wip-rocky10-branch-of-the-day-2026-03-04-1772633736-distro-default-trial/91875

Actions

Copy link

#31