Actions
Bug #75383
openteuthology: delayed recreations of daemons after "ceph orch daemon rm"
% Done:
0%
Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:
Description
Gateways did not come up automatically after "ceph orch daemon rm"?
2026-03-05T23:04:19.596 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.a 2026-03-05T23:04:19.596 DEBUG:teuthology.orchestra.run.trial049:> ceph orch daemon rm nvmeof.nvmeof.a 2026-03-05T23:04:28.912 INFO:teuthology.orchestra.run.trial049.stdout:Removed nvmeof.nvmeof.a from host 'trial049' 2026-03-05T23:04:28.929 INFO:tasks.nvmeof.[nvmeof.thrasher]:kill nvmeof.b 2026-03-05T23:04:28.929 DEBUG:teuthology.orchestra.run.trial156:> ceph orch daemon rm nvmeof.nvmeof.b 2026-03-05T23:04:38.888 INFO:teuthology.orchestra.run.trial156.stdout:Removed nvmeof.nvmeof.b from host 'trial156' 2026-03-05T23:04:38.904 INFO:tasks.nvmeof.[nvmeof.thrasher]:waiting for 60 secs before reviving 2026-03-05T23:05:38.905 INFO:tasks.nvmeof.[nvmeof.thrasher]:done waiting before reviving - iteration #4: thrashed- nvmeof.a, nvmeof.b (by daemon_remove); 2026-03-05T23:05:38.905 INFO:tasks.nvmeof.[nvmeof.thrasher]:display and verify stats: 2026-03-05T23:05:38.905 DEBUG:teuthology.orchestra.run.trial049:> sudo systemctl status ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a 2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout:â—‹ ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for ee248e3a-18e2-11f1-8cb2-d404e6e7d460 2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout: Loaded: loaded (/etc/systemd/system/ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@.service; disabled; preset: disabled) 2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout: Active: inactive (dead) 2026-03-05T23:05:38.950 INFO:teuthology.orchestra.run.trial049.stdout: 2026-03-05T23:05:38.953 DEBUG:teuthology.orchestra.run.trial156:> sudo systemctl status ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.b 2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout:â—‹ ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.b.service - Ceph nvmeof.nvmeof.b for ee248e3a-18e2-11f1-8cb2-d404e6e7d460 2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout: Loaded: loaded (/etc/systemd/system/ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@.service; disabled; preset: disabled) 2026-03-05T23:05:38.996 INFO:teuthology.orchestra.run.trial156.stdout: Active: inactive (dead) 2026-03-05T23:05:38.997 INFO:teuthology.orchestra.run.trial156.stdout: 2026-03-05T23:05:39.328 DEBUG:teuthology.orchestra.run.trial185:> ceph orch ps --daemon-type nvmeof 2026-03-05T23:05:39.437 INFO:journalctl@ceph.mon.b.trial156.stdout:Mar 05 23:05:39 trial156 ceph-mon[35873]: pgmap v1092: 33 pgs: 33 active+clean; 32 GiB data, 94 GiB used, 5.4 TiB / 5.5 TiB avail; 53 MiB/s rd, 53 MiB/s wr, 5.12k op/s 2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID 2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:nvmeof.nvmeof.c trial167 *:5500,4420,8009,10008 running (31m) 9m ago 32m 1142M - 1.7.1 c72920b1f78f b0192de07db2 2026-03-05T23:05:39.506 INFO:teuthology.orchestra.run.trial185.stdout:nvmeof.nvmeof.d trial185 *:5500,4420,8009,10008 error 6m ago 17m - - <unknown> <unknown> <unknown> 2026-03-05T23:07:11.889 DEBUG:teuthology.orchestra.run.trial049:> ceph orch daemon start nvmeof.nvmeof.a 2026-03-05T23:07:12.074 INFO:teuthology.orchestra.run.trial049.stderr:Error EINVAL: Unable to find nvmeof.nvmeof.a daemon(s) 2026-03-05T23:07:12.079 DEBUG:teuthology.orchestra.run:got remote process result: 22 05T23:07:12.073+0000 7f271c3be6c0 -1 mgr.server reply reply (22) Invalid argument Unable to find nvmeof.nvmeof.a daemon(s)
At this point, the thrasher errors and stops (i.e. no more stop/start commands are run). But in logs, we see 38seconds after the "start" command failed, nvmeof.a automatically was re-added (as expected after "ceph orch daemon rm"):
2026-03-05T23:07:50.356 INFO:journalctl@ceph.mon.c.trial167.stdout:Mar 05 23:07:50 trial167 ceph-mon[36179]: from='mgr.14150 10.20.193.49:0/1931802865' entity='mgr.x' cmd={"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]} : dispatch
2026-03-05T23:07:50.356 INFO:journalctl@ceph.mon.c.trial167.stdout:Mar 05 23:07:50 trial167 ceph-mon[36179]: from='mgr.14150 10.20.193.49:0/1931802865' entity='mgr.x' cmd='[{"prefix": "auth get-or-create", "entity": "client.nvmeof.nvmeof.a", "caps": ["mon", "profile rbd", "osd", "profile rbd"]}]': finished
2026-03-05T23:07:51.012 INFO:journalctl@ceph.nvmeof.nvmeof.a.trial049.stdout:Mar 05 23:07:50 trial049 systemd[1]: Starting ceph-ee248e3a-18e2-11f1-8cb2-d404e6e7d460@nvmeof.nvmeof.a.service - Ceph nvmeof.nvmeof.a for ee248e3a-18e2-11f1-8cb2-d404e6e7d460...
This means that after removing nvmeof.a at 23:04:19, it restarts at 23:07:51 - taking ~3mins to restart (given the gateway has 900 namespaces). Hence we should probably increase the wait time between thrashing/reviving iteration.
Updated by Vallari Agrawal 16 days ago
- Subject changed from teuthology: "ceph orch daemon rm" to teuthology: delayed recreations of daemons after "ceph orch daemon rm"
- Description updated (diff)
Updated by Vallari Agrawal 16 days ago
- Related to Bug #75331: teuthology: CEPHADM_APPLY_SPEC_FAIL because 3/4 daemon did not come up after being removed together added
Actions