Project

General

Profile

Actions

Bug #75331

open

teuthology: CEPHADM_APPLY_SPEC_FAIL because 3/4 daemon did not come up after being removed together

Added by Vallari Agrawal 17 days ago. Updated 15 days ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

Usually CEPHADM_APPLY_SPEC_FAIL is raised and quickly resolved in few seconds.

This one looks different:
https://pulpito.ceph.com/vallariag-2026-03-04_09:20:03-nvmeof-wip-tomer-revive-nvme-module-centos9-only-distro-default-trial/81579

It seems like daemons didn't come up on their own after 3 out of 4 gws were removed with "ceph orch daemon rm <>" (but it only happened when the test removed 3 daemons). It seems like a niche problem for when 3 daemons are removed in same iteration.

2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: Traceback (most recent call last):
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 115, in wrapper
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     return OrchResult(f(*args, **kwargs))
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 2735, in daemon_action
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     d = self.cache.get_daemon(daemon_name)
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/cephadm/inventory.py", line 1319, in get_daemon
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     raise orchestrator.OrchestratorError(f'Unable to find {daemon_name} daemon(s)')
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: orchestrator._interface.OrchestratorError: Unable to find nvmeof.nvmeof.a daemon(s)
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: 2026-03-04T10:08:30.660+0000 7f21460bf640 -1 mgr.server reply reply (22) Invalid argument Unable to find nvmeof.nvmeof.a daemon(s)
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: 2026-03-04T10:08:30.880+0000 7f21460bf640 -1 log_channel(cephadm) log [ERR] : Unable to find nvmeof.nvmeof.b daemon(s)
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: Traceback (most recent call last):
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 115, in wrapper
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     return OrchResult(f(*args, **kwargs))
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 2735, in daemon_action
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     d = self.cache.get_daemon(daemon_name)
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/cephadm/inventory.py", line 1319, in get_daemon
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     raise orchestrator.OrchestratorError(f'Unable to find {daemon_name} daemon(s)')
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: orchestrator._interface.OrchestratorError: Unable to find nvmeof.nvmeof.b daemon(s)
2026-03-04T10:08:30.955 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: 2026-03-04T10:08:30.880+0000 7f21460bf640 -1 mgr.server reply reply (22) Invalid argument Unable to find nvmeof.nvmeof.b daemon(s)

Related issues 1 (1 open0 closed)

Related to nvme-of - Bug #75383: teuthology: delayed recreations of daemons after "ceph orch daemon rm" NewVallari Agrawal

Actions
Actions #1

Updated by Vallari Agrawal 17 days ago

  • Description updated (diff)
Actions #2

Updated by Vallari Agrawal 17 days ago

  • Description updated (diff)
Actions #3

Updated by Vallari Agrawal 17 days ago

  • Description updated (diff)
Actions #4

Updated by Vallari Agrawal 17 days ago

  • Subject changed from teuthology: "Failed to apply 1 service(s): nvmeof.mypool.mygroup0 (CEPHADM_APPLY_SPEC_FAIL)" in cluster log to teuthology: CEPHADM_APPLY_SPEC_FAIL because 3/4 daemon did not come up after being removed together
Actions #6

Updated by Vallari Agrawal 15 days ago

Let's look at what happened chronologically (only for 1 of the three gateways):

daemon killed:

2026-03-04T10:04:48.758 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.a
2026-03-04T10:04:48.758 DEBUG:teuthology.orchestra.run.trial032:> ceph orch daemon rm nvmeof.nvmeof.a

2026-03-04T10:05:28.431 INFO:tasks.nvmeof.[nvmeof.thrasher]:waiting for 89 secs before reviving

2026-03-04T10:06:57.432 INFO:tasks.nvmeof.[nvmeof.thrasher]:done waiting before reviving - iteration #4: thrashed- nvmeof.a, nvmeof.b, nvmeof.d (by daemon_remove); 
2026-03-04T10:06:57.432 INFO:tasks.nvmeof.[nvmeof.thrasher]:display and verify stats:
2026-03-04T10:06:57.433 DEBUG:teuthology.orchestra.run.trial032:> sudo systemctl status ceph-0a26c200-17ad-11f1-8222-d404e6e7d460@nvmeof.nvmeof.a
2026-03-04T10:06:57.467 INFO:teuthology.orchestra.run.trial032.stdout:â—‹ ceph-0a26c200-17ad-11f1-8222-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for 0a26c200-17ad-11f1-8222-d404e6e7d460
2026-03-04T10:06:57.467 INFO:teuthology.orchestra.run.trial032.stdout:     Loaded: loaded (/etc/systemd/system/ceph-0a26c200-17ad-11f1-8222-d404e6e7d460@.service; disabled; preset: disabled)
2026-03-04T10:06:57.467 INFO:teuthology.orchestra.run.trial032.stdout:     Active: inactive (dead)
2026-03-04T10:06:57.467 INFO:teuthology.orchestra.run.trial032.stdout:

daemon revival with "ceph orch daemon start" (mostly a formality - we expected daemon to be back up automatically by now), but this command caused a string of errors/Tracebacks:

2026-03-04T10:08:30.470 DEBUG:teuthology.orchestra.run.trial032:> ceph orch daemon start nvmeof.nvmeof.a
2026-03-04T10:08:30.662 INFO:teuthology.orchestra.run.trial032.stderr:Unable to find nvmeof.nvmeof.a daemon(s)

04T10:08:30.660+0000 7f21460bf640 -1 log_channel(cephadm) log [ERR] : Unable to find nvmeof.nvmeof.a daemon(s)
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: Traceback (most recent call last):
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 115, in wrapper
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     return OrchResult(f(*args, **kwargs))
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 2735, in daemon_action
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     d = self.cache.get_daemon(daemon_name)
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:   File "/usr/share/ceph/mgr/cephadm/inventory.py", line 1319, in get_daemon
2026-03-04T10:08:30.953 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]:     raise orchestrator.OrchestratorError(f'Unable to find {daemon_name} daemon(s)')
2026-03-04T10:08:30.954 INFO:journalctl@ceph.mgr.x.trial032.stdout:Mar 04 10:08:30 trial032 ceph-0a26c200-17ad-11f1-8222-d404e6e7d460-mgr-x[40335]: orchestrator._interface.OrchestratorError: Unable to find nvmeof.nvmeof.a daemon(s)

2026-03-04T10:08:32.058 INFO:journalctl@ceph.mon.a.trial032.stdout:Mar 04 10:08:31 trial032 ceph-mon[529536]: Unable to find nvmeof.nvmeof.a daemon(s)
2026-03-04T10:08:32.058 INFO:journalctl@ceph.mon.a.trial032.stdout:                                           Traceback (most recent call last):
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                             File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 115, in wrapper
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                               return OrchResult(f(*args, **kwargs))
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                             File "/usr/share/ceph/mgr/cephadm/module.py", line 2735, in daemon_action
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                               d = self.cache.get_daemon(daemon_name)
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                             File "/usr/share/ceph/mgr/cephadm/inventory.py", line 1319, in get_daemon
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                               raise orchestrator.OrchestratorError(f'Unable to find {daemon_name} daemon(s)')
2026-03-04T10:08:32.059 INFO:journalctl@ceph.mon.a.trial032.stdout:                                           orchestrator._interface.OrchestratorError: Unable to find nvmeof.nvmeof.a daemon(s)

2026-03-04T10:09:48.092 INFO:tasks.nvmeof.[nvmeof.thrasher]:done waiting before thrashing - everything should be up now
2026-03-04T10:09:48.092 INFO:tasks.nvmeof.[nvmeof.thrasher]:display and verify stats:
2026-03-04T10:09:48.092 DEBUG:teuthology.orchestra.run.trial032:> sudo systemctl status ceph-0a26c200-17ad-11f1-8222-d404e6e7d460@nvmeof.nvmeof.a
2026-03-04T10:09:48.126 INFO:teuthology.orchestra.run.trial032.stdout:â—‹ ceph-0a26c200-17ad-11f1-8222-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for 0a26c200-17ad-11f1-8222-d404e6e7d460
2026-03-04T10:09:48.126 INFO:teuthology.orchestra.run.trial032.stdout:     Loaded: loaded (/etc/systemd/system/ceph-0a26c200-17ad-11f1-8222-d404e6e7d460@.service; disabled; preset: disabled)
2026-03-04T10:09:48.126 INFO:teuthology.orchestra.run.trial032.stdout:     Active: inactive (dead)
2026-03-04T10:09:48.126 INFO:teuthology.orchestra.run.trial032.stdout:

Failed to apply spec is also maybe one of the errors caused due to this - because "nvmeof.a" isn't back up yet:


2026-03-04T10:11:49.219 INFO:journalctl@ceph.mon.b.trial043.stdout:Mar 04 10:11:48 trial043 ceph-mon[499871]: Failed to apply nvmeof.mypool.mygroup0 spec NvmeofServiceSpec.from_json(yaml.safe_load('''service_type: nvmeof
2026-03-04T10:11:49.219 INFO:journalctl@ceph.mon.b.trial043.stdout:                                           service_id: mypool.mygroup0
2026-03-04T10:11:49.219 INFO:journalctl@ceph.mon.b.trial043.stdout:                                           service_name: nvmeof.mypool.mygroup0
2026-03-04T10:11:49.219 INFO:journalctl@ceph.mon.b.trial043.stdout:                                           placement:
2026-03-04T10:11:49.220 INFO:journalctl@ceph.mon.b.trial043.stdout:                                             hosts:
2026-03-04T10:11:49.220 INFO:journalctl@ceph.mon.b.trial043.stdout:                                             - trial032=nvmeof.a
...
2026-03-04T10:11:49.229 INFO:journalctl@ceph.mon.b.trial043.stdout:                                             verify_nqns: true
2026-03-04T10:11:49.229 INFO:journalctl@ceph.mon.b.trial043.stdout:                                           ''')): Cannot place <NvmeofServiceSpec for service_name=nvmeof.mypool.mygroup0> on trial105: Unknown hosts

daemon finally restarts:

2026-03-04T10:11:52.082 INFO:journalctl@ceph.mon.a.trial032.stdout:Mar 04 10:11:52 trial032 ceph-mon[529536]: from='mgr.14152 10.20.193.32:0/861149802' entity='mgr.x' cmd={"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]} : dispatch
2026-03-04T10:11:52.082 INFO:journalctl@ceph.mon.a.trial032.stdout:Mar 04 10:11:52 trial032 ceph-mon[529536]: from='mgr.14152 ' entity='mgr.x' cmd={"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]} : dispatch
2026-03-04T10:11:52.082 INFO:journalctl@ceph.mon.a.trial032.stdout:Mar 04 10:11:52 trial032 ceph-mon[529536]: from='mgr.14152 ' entity='mgr.x' cmd='[{"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]}]': finished

2026-03-04T10:11:52.368 INFO:journalctl@ceph.nvmeof.nvmeof.a.trial032.stdout:Mar 04 10:11:52 trial032 systemd[1]: Starting Ceph nvmeof.nvmeof.a for 0a26c200-17ad-11f1-8222-d404e6e7d460...

The gateway nvmeof.a was killed at 10:04:48 and it came back up at 10:11:52 - which is almost 7mins later! It could be so delayed here because:
1. 3 daemons being removed together
2. mon thrasher running in parallel with nvmeof thrasher
3. 900 namespaces configured

So its worth testing with two modifications:
1. Bigger thrashing/reviving wait time
2. Instead of "ceph orch daemon start" for reviving removed daemons, we simply wait on "systemctl status" to show running and wait for it to be running.

Actions #7

Updated by Vallari Agrawal 15 days ago · Edited

Another thing to fix:
After "ceph orch daemon start nvmeof.nvmeof.a" command failed, the nvmeof_mon_thrash test didn't raise an exception and stopped the test. Whereas, nvmeof_tharsh test did stop the test immediately.

Actions #9

Updated by Vallari Agrawal 15 days ago

  • Related to Bug #75383: teuthology: delayed recreations of daemons after "ceph orch daemon rm" added
Actions

Also available in: Atom PDF