Bug #75383: teuthology: delayed recreations of daemons after "ceph orch daemon rm" - nvme-of - Ceph

Actions

Copy link

Bug #75383

open

teuthology: delayed recreations of daemons after "ceph orch daemon rm"

Added by Vallari Agrawal 16 days ago. Updated 10 days ago.

Status:

New

Priority:

Normal

Assignee:

Vallari Agrawal

Target version:

% Done:

Source:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

https://pulpito.ceph.com/yuriw-2026-03-05_16:03:53-nvmeof-wip-rocky10-branch-of-the-day-2026-03-04-1772633736-distro-default-trial/89441

Gateways did not come up automatically after "ceph orch daemon rm"?

2026-03-05T23:04:19.596 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.a
2026-03-05T23:04:19.596 DEBUG:teuthology.orchestra.run.trial049:> ceph orch daemon rm nvmeof.nvmeof.a
2026-03-05T23:04:28.912 INFO:teuthology.orchestra.run.trial049.stdout:Removed nvmeof.nvmeof.a from host 'trial049'

2026-03-05T23:04:28.929 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.b
2026-03-05T23:04:28.929 DEBUG:teuthology.orchestra.run.trial156:> ceph orch daemon rm nvmeof.nvmeof.b
2026-03-05T23:04:38.888 INFO:teuthology.orchestra.run.trial156.stdout:Removed nvmeof.nvmeof.b from host 'trial156'

2026-03-05T23:04:38.904 INFO:tasks.nvmeof.[nvmeof.thrasher]:waiting for 60 secs before reviving

2026-03-05T23:05:38.905 INFO:tasks.nvmeof.[nvmeof.thrasher]:done waiting before reviving - iteration #4: thrashed- nvmeof.a, nvmeof.b (by daemon_remove); 
2026-03-05T23:05:38.905 INFO:tasks.nvmeof.[nvmeof.thrasher]:display and verify stats:
2026-03-05T23:05:38.905 DEBUG:teuthology.orchestra.run.trial049:> sudo systemctl status ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:â—‹ ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for ee248e3a-18e2-11f1-8cb2-d404e6e7d460
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:     Loaded: loaded (/etc/systemd/system/ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@.service; disabled; preset: disabled)
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:     Active: inactive (dead)
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:
2026-03-05T23:05:38.953 DEBUG:teuthology.orchestra.run.trial156:> sudo systemctl status ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.b
2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:â—‹ ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.b.service - Ceph nvmeof.nvmeof.b for ee248e3a-18e2-11f1-8cb2-d404e6e7d460
2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:     Loaded: loaded (/etc/systemd/system/ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@.service; disabled; preset: disabled)
2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:     Active: inactive (dead)
2026-03-05T23:05:38.997 INFO:teuthology.orchestra.run.trial156.stdout:
2026-03-05T23:05:39.328 DEBUG:teuthology.orchestra.run.trial185:> ceph orch ps --daemon-type nvmeof
2026-03-05T23:05:39.437 INFO:journalctl@ceph.mon.b.trial156.stdout:Mar 05 23:05:39 trial156 ceph-mon[35873]: pgmap v1092: 33 pgs: 33 active+clean; 32 GiB data, 94 GiB used, 5.4 TiB / 5.5 TiB avail; 53 MiB/s rd, 53 MiB/s wr, 5.12k op/s
2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:NAME             HOST      PORTS                   STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID      CONTAINER ID
2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:nvmeof.nvmeof.c  trial167  *:5500,4420,8009,10008  running (31m)     9m ago  32m    1142M        -  1.7.1      c72920b1f78f  b0192de07db2
2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:nvmeof.nvmeof.d  trial185  *:5500,4420,8009,10008  error             6m ago  17m        -        -  <unknown>  <unknown>     <unknown>

2026-03-05T23:07:11.889 DEBUG:teuthology.orchestra.run.trial049:> ceph orch daemon start nvmeof.nvmeof.a
2026-03-05T23:07:12.074 INFO:teuthology.orchestra.run.trial049.stderr:Error EINVAL: Unable to find nvmeof.nvmeof.a daemon(s)
2026-03-05T23:07:12.079 DEBUG:teuthology.orchestra.run:got remote process result: 22
05T23:07:12.073+0000 7f271c3be6c0 -1 mgr.server reply reply (22) Invalid argument Unable to find nvmeof.nvmeof.a daemon(s)

At this point, the thrasher errors and stops (i.e. no more stop/start commands are run). But in logs, we see 38seconds after the "start" command failed, nvmeof.a automatically was re-added (as expected after "ceph orch daemon rm"):

2026-03-05T23:07:50.356 INFO:journalctl@ceph.mon.c.trial167.stdout:Mar 05 23:07:50 trial167 ceph-mon[36179]: from='mgr.14150 10.20.193.49:0/1931802865' entity='mgr.x' cmd={"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]} : dispatch
2026-03-05T23:07:50.356 INFO:journalctl@ceph.mon.c.trial167.stdout:Mar 05 23:07:50 trial167 ceph-mon[36179]: from='mgr.14150 10.20.193.49:0/1931802865' entity='mgr.x' cmd='[{"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]}]': finished

2026-03-05T23:07:51.012 INFO:journalctl@ceph.nvmeof.nvmeof.a.trial049.stdout:Mar 05 23:07:50 trial049 systemd[1]: Starting ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for ee248e3a-18e2-11f1-8cb2-d404e6e7d460...

This means that after removing nvmeof.a at 23:04:19, it restarts at 23:07:51 - taking ~3mins to restart (given the gateway has 900 namespaces). Hence we should probably increase the wait time between thrashing/reviving iteration.

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Vallari Agrawal 16 days ago

Subject changed from teuthology: "ceph orch daemon rm" to teuthology: delayed recreations of daemons after "ceph orch daemon rm"
Description updated (diff)

Actions

Copy link