Project

General

Profile

Actions

Bug #75383

open

teuthology: delayed recreations of daemons after "ceph orch daemon rm"

Added by Vallari Agrawal 16 days ago. Updated 10 days ago.

Status:
New
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

https://pulpito.ceph.com/yuriw-2026-03-05_16:03:53-nvmeof-wip-rocky10-branch-of-the-day-2026-03-04-1772633736-distro-default-trial/89441

Gateways did not come up automatically after "ceph orch daemon rm"?

2026-03-05T23:04:19.596 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.a
2026-03-05T23:04:19.596 DEBUG:teuthology.orchestra.run.trial049:> ceph orch daemon rm nvmeof.nvmeof.a
2026-03-05T23:04:28.912 INFO:teuthology.orchestra.run.trial049.stdout:Removed nvmeof.nvmeof.a from host 'trial049'

2026-03-05T23:04:28.929 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.b
2026-03-05T23:04:28.929 DEBUG:teuthology.orchestra.run.trial156:> ceph orch daemon rm nvmeof.nvmeof.b
2026-03-05T23:04:38.888 INFO:teuthology.orchestra.run.trial156.stdout:Removed nvmeof.nvmeof.b from host 'trial156'

2026-03-05T23:04:38.904 INFO:tasks.nvmeof.[nvmeof.thrasher]:waiting for 60 secs before reviving

2026-03-05T23:05:38.905 INFO:tasks.nvmeof.[nvmeof.thrasher]:done waiting before reviving - iteration #4: thrashed- nvmeof.a, nvmeof.b (by daemon_remove); 
2026-03-05T23:05:38.905 INFO:tasks.nvmeof.[nvmeof.thrasher]:display and verify stats:
2026-03-05T23:05:38.905 DEBUG:teuthology.orchestra.run.trial049:> sudo systemctl status ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:â—‹ ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for ee248e3a-18e2-11f1-8cb2-d404e6e7d460
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:     Loaded: loaded (/etc/systemd/system/ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@.service; disabled; preset: disabled)
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:     Active: inactive (dead)
2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:
2026-03-05T23:05:38.953 DEBUG:teuthology.orchestra.run.trial156:> sudo systemctl status ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.b
2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:â—‹ ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.b.service - Ceph nvmeof.nvmeof.b for ee248e3a-18e2-11f1-8cb2-d404e6e7d460
2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:     Loaded: loaded (/etc/systemd/system/ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@.service; disabled; preset: disabled)
2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:     Active: inactive (dead)
2026-03-05T23:05:38.997 INFO:teuthology.orchestra.run.trial156.stdout:
2026-03-05T23:05:39.328 DEBUG:teuthology.orchestra.run.trial185:> ceph orch ps --daemon-type nvmeof
2026-03-05T23:05:39.437 INFO:journalctl@ceph.mon.b.trial156.stdout:Mar 05 23:05:39 trial156 ceph-mon[35873]: pgmap v1092: 33 pgs: 33 active+clean; 32 GiB data, 94 GiB used, 5.4 TiB / 5.5 TiB avail; 53 MiB/s rd, 53 MiB/s wr, 5.12k op/s
2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:NAME             HOST      PORTS                   STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID      CONTAINER ID
2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:nvmeof.nvmeof.c  trial167  *:5500,4420,8009,10008  running (31m)     9m ago  32m    1142M        -  1.7.1      c72920b1f78f  b0192de07db2
2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:nvmeof.nvmeof.d  trial185  *:5500,4420,8009,10008  error             6m ago  17m        -        -  <unknown>  <unknown>     <unknown>

2026-03-05T23:07:11.889 DEBUG:teuthology.orchestra.run.trial049:> ceph orch daemon start nvmeof.nvmeof.a
2026-03-05T23:07:12.074 INFO:teuthology.orchestra.run.trial049.stderr:Error EINVAL: Unable to find nvmeof.nvmeof.a daemon(s)
2026-03-05T23:07:12.079 DEBUG:teuthology.orchestra.run:got remote process result: 22
05T23:07:12.073+0000 7f271c3be6c0 -1 mgr.server reply reply (22) Invalid argument Unable to find nvmeof.nvmeof.a daemon(s)

At this point, the thrasher errors and stops (i.e. no more stop/start commands are run). But in logs, we see 38seconds after the "start" command failed, nvmeof.a automatically was re-added (as expected after "ceph orch daemon rm"):

2026-03-05T23:07:50.356 INFO:journalctl@ceph.mon.c.trial167.stdout:Mar 05 23:07:50 trial167 ceph-mon[36179]: from='mgr.14150 10.20.193.49:0/1931802865' entity='mgr.x' cmd={"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]} : dispatch
2026-03-05T23:07:50.356 INFO:journalctl@ceph.mon.c.trial167.stdout:Mar 05 23:07:50 trial167 ceph-mon[36179]: from='mgr.14150 10.20.193.49:0/1931802865' entity='mgr.x' cmd='[{"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]}]': finished

2026-03-05T23:07:51.012 INFO:journalctl@ceph.nvmeof.nvmeof.a.trial049.stdout:Mar 05 23:07:50 trial049 systemd[1]: Starting ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for ee248e3a-18e2-11f1-8cb2-d404e6e7d460...

This means that after removing nvmeof.a at 23:04:19, it restarts at 23:07:51 - taking ~3mins to restart (given the gateway has 900 namespaces). Hence we should probably increase the wait time between thrashing/reviving iteration.


Related issues 1 (1 open0 closed)

Related to nvme-of - Bug #75331: teuthology: CEPHADM_APPLY_SPEC_FAIL because 3/4 daemon did not come up after being removed togetherNewVallari Agrawal

Actions
Actions #1

Updated by Vallari Agrawal 16 days ago

  • Subject changed from teuthology: "ceph orch daemon rm" to teuthology: delayed recreations of daemons after "ceph orch daemon rm"
  • Description updated (diff)
Actions #2

Updated by Vallari Agrawal 16 days ago

  • Description updated (diff)
Actions #3

Updated by Vallari Agrawal 16 days ago

  • Related to Bug #75331: teuthology: CEPHADM_APPLY_SPEC_FAIL because 3/4 daemon did not come up after being removed together added
Actions

Also available in: Atom PDF