Bug #68586
opencephadm: task/test_iscsi_container test hits max timeout
0%
Description
/a/yuriw-2024-10-15_14:06:51-rados-wip-yuri8-testing-2024-10-14-1103-distro-default-smithi/7948722
2024-10-15T23:18:27.687 DEBUG:teuthology.orchestra.run.smithi156:> sudo pkill -f 'journalctl -f -n 0 -u ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a@osd.2.service' 2024-10-15T23:18:27.980 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 podman[103248]: 2024-10-15 23:18:27.649430921 +0000 UTC m=+0.567753660 container remove 9e8b3c285794692d6e42b3db8efad4f3ee9a44704a7ce8908d54b10f997a6077 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph@sha256:d7bbcb972af274fc68d56157a33d96ea068f505e42b2d9bd756605a2ddb5b1f2, name=ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a-osd-2-deactivate, CEPH_SHA1=1cb63f9c06f1a683e9d663e89d7306865bd51e03, org.label-schema.license=GPLv2, org.label-schema.schema-version=1.0, org.opencontainers.image.authors=Ceph Release Team <ceph-maintainers@ceph.io>, org.label-schema.name=CentOS Stream 9 Base Image, CEPH_GIT_REPO=https://github.com/ceph/ceph-ci.git, FROM_IMAGE=quay.io/centos/centos:stream9, GANESHA_REPO_BASEURL=https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, io.buildah.version=1.37.2, OSD_FLAVOR=default, org.label-schema.vendor=CentOS, CEPH_REF=wip-yuri8-testing-2024-10-14-1103, org.label-schema.build-date=20241008, org.opencontainers.image.documentation=https://docs.ceph.com/) 2024-10-15T23:18:27.980 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 systemd[1]: ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a@osd.2.service: Deactivated successfully. 2024-10-15T23:18:27.981 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 systemd[1]: Stopped Ceph osd.2 for b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a. 2024-10-15T23:18:27.981 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 systemd[1]: ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a@osd.2.service: Consumed 3.380s CPU time. 2024-10-15T23:18:28.841 DEBUG:teuthology.orchestra.run:got remote process result: None 2024-10-15T23:18:28.841 INFO:tasks.cephadm.osd.2:Stopped osd.2 2024-10-15T23:18:28.841 DEBUG:teuthology.orchestra.run.smithi156:> sudo /home/ubuntu/cephtest/cephadm rm-cluster --fsid b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a --force --keep-logs 2024-10-16T07:00:06.067 DEBUG:teuthology.exit:Got signal 15; running 1 handler... 2024-10-16T07:00:06.091 DEBUG:teuthology.task.console_log:Killing console logger for smithi156 2024-10-16T07:00:06.091 DEBUG:teuthology.exit:Finished running handlers
Updated by Laura Flores over 1 year ago
- Related to Bug #67225: cephadm TLS/SSL connection has been closed added
Updated by Naveen Naidu over 1 year ago
/a/skanta-2024-09-25_00:14:38-rados-wip-bharath4-testing-2024-09-24-1154-distro-default-smithi/7918830
/a/skanta-2024-09-25_00:14:38-rados-wip-bharath4-testing-2024-09-24-1154-distro-default-smithi/7919099
Updated by Aishwarya Mathuria over 1 year ago
/a/yuriw-2024-10-13_19:06:13-rados-wip-yuri4-testing-2024-10-13-0836-distro-default-smithi/7944901
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-10-23_23:17:32-rados-wip-yuri13-testing-2024-10-23-0743-distro-default-smithi/7963779
Updated by Ilya Dryomov over 1 year ago · Edited
- Project changed from rbd to mgr
- Category set to orchestrator
Hi Laura,
"cephadm rm-cluster" might be hanging because tcmu-runner daemon continues to run, spinning on an error that has to do with the rest of the cluster getting destroyed:
2024-10-24T01:56:33.080 INFO:tasks.workunit:Stopping ['cephadm/test_iscsi_pids_limit.sh', 'cephadm/test_iscsi_etc_hosts.sh', 'cephadm/test_iscsi_setup.sh'] on client.0... 2024-10-24T01:56:33.080 DEBUG:teuthology.orchestra.run.smithi153:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2024-10-24T01:56:33.436 DEBUG:teuthology.parallel:result is None 2024-10-24T01:56:33.436 DEBUG:teuthology.orchestra.run.smithi153:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0 2024-10-24T01:56:33.462 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0 2024-10-24T01:56:33.462 DEBUG:teuthology.orchestra.run.smithi153:> rmdir -- /home/ubuntu/cephtest/mnt.0 2024-10-24T01:56:33.545 INFO:tasks.workunit:Deleted artificial mount point /home/ubuntu/cephtest/mnt.0/client.0 2024-10-24T01:56:33.545 DEBUG:teuthology.run_tasks:Unwinding manager cephadm
2024-10-24T01:56:35.340 INFO:journalctl@ceph.mon.a.smithi153.stdout:Oct 24 01:56:34 smithi153 systemd[1]: Stopped Ceph mon.a for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a. 2024-10-24T01:56:37.324 INFO:journalctl@ceph.mgr.a.smithi153.stdout:Oct 24 01:56:37 smithi153 systemd[1]: Stopped Ceph mgr.a for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a. 2024-10-24T01:56:47.090 INFO:journalctl@ceph.osd.0.smithi153.stdout:Oct 24 01:56:46 smithi153 systemd[1]: Stopped Ceph osd.0 for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a. 2024-10-24T01:56:57.090 INFO:journalctl@ceph.osd.1.smithi153.stdout:Oct 24 01:56:56 smithi153 systemd[1]: Stopped Ceph osd.1 for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a. 2024-10-24T01:57:06.340 INFO:journalctl@ceph.osd.2.smithi153.stdout:Oct 24 01:57:05 smithi153 systemd[1]: Stopped Ceph osd.2 for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a.
2024-10-24 01:56:27.921 7 [INFO] tcmu_acquire_dev_lock:490 rbd/foo.disk_2: Read lock acquisition successful
2024-10-24 01:57:23.536 7 [ERROR] tcmu_rbd_handle_timedout_cmd:1266 rbd/foo.disk_1: Timing out cmd.
2024-10-24 01:57:23.536 7 [ERROR] __tcmu_notify_conn_lost:218 rbd/foo.disk_1: Handler connection lost (lock state 3)
2024-10-24 01:57:23.543 7 [INFO] tgt_port_grp_recovery_work_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.
2024-10-24 01:57:53.545 7 [INFO] tcmu_rbd_close:1239 rbd/foo.disk_1: appended blocklist entry: {172.21.15.153:0/2131618207}
2024-10-24 02:02:53.550 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 02:07:54.557 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 02:12:55.562 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
...
2024-10-24 09:29:23.033 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 09:34:24.039 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 09:39:25.044 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
I think the issue is that qa/workunits/cephadm/test_iscsi_setup.sh doesn't clean up after itself. It adds two disks to the target, but never removes them.
@Adam King Moving this to cephadm.
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-11-13_00:17:56-rados-wip-yuri6-testing-2024-11-12-1317-distro-default-smithi/7992301
Updated by Laura Flores over 1 year ago
- Project changed from mgr to rbd
- Category deleted (
orchestrator) - Assignee deleted (
Adam King)
Think this should actually be under rbd.
Updated by Ilya Dryomov over 1 year ago
Laura Flores wrote in #note-10:
Think this should actually be under rbd.
Hi Laura,
I moved it to mgr/orchestrator and tagged Adam in https://tracker.ceph.com/issues/68586#note-7 on purpose. qa/workunits/cephadm/test_iscsi_setup.sh doesn't really test iSCSI, but rather how the iSCSI container is set up (whether the settings that cephadm applies to it are as expected and whether the container engine respects them). The issue appears to be that qa/workunits/cephadm/test_iscsi_setup.sh is missing a cleanup step. Even though the fix should be trivial, the RBD component doesn't own this script. It's not part of rbd suite, only cephadm and by extension rados suite.
Updated by Laura Flores over 1 year ago
- Project changed from rbd to Orchestrator
- Priority changed from Normal to High
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-11-20_16:10:40-rados-wip-yuri2-testing-2024-11-15-0902-distro-default-smithi/8001822
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-12-03_16:16:51-rados-wip-yuri6-testing-2024-12-02-1528-distro-default-smithi/8019067
Updated by Shraddha Agrawal over 1 year ago
/a/skanta-2024-12-12_03:27:32-rados-wip-bharath11-testing-2024-12-11-1511-distro-default-smithi/8031435
Updated by Laura Flores about 1 year ago
/a/yuriw-2024-12-18_15:56:21-rados-wip-yuri6-testing-2024-12-17-1653-distro-default-smithi/8043420
Updated by Naveen Naidu about 1 year ago
/a/skanta-2024-12-11_23:59:30-rados-wip-bharath9-testing-2024-12-10-1652-distro-default-smithi/8031139
Updated by Laura Flores about 1 year ago
/a/yuriw-2024-11-18_15:14:17-rados-wip-yuri3-testing-2024-11-14-0857-distro-default-smithi/7998075
Updated by Shraddha Agrawal about 1 year ago
/a/skanta-2024-10-24_23:59:35-rados-wip-bharath3-testing-2024-10-23-1509-distro-default-smithi/7965775
Updated by Shraddha Agrawal about 1 year ago
/a/skanta-2024-12-26_01:49:37-rados-wip-bharath12-testing-2024-12-24-0842-distro-default-smithi/8053521
Updated by Laura Flores about 1 year ago
/a/skanta-2024-12-22_00:38:58-rados-wip-bharath8-testing-2024-12-21-1707-distro-default-smithi/8047109
Updated by Laura Flores about 1 year ago
- Assignee set to Adam King
Hey @Adam King can you take a look?
Updated by Aishwarya Mathuria about 1 year ago
/a/skanta-2025-01-26_07:44:24-rados-wip-bharath9-testing-2025-01-25-0527-distro-default-smithi/8094203
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-01-31_15:46:33-rados-wip-yuri5-testing-2025-01-30-1311-distro-default-smithi/8107032
Updated by Laura Flores about 1 year ago
/a/lflores-2025-01-31_21:01:32-rados-wip-bharath8-testing-2025-01-31-1851-distro-default-smithi/8107615
Updated by Laura Flores about 1 year ago
/a/lflores-2025-02-05_22:51:38-rados-wip-yuri5-testing-2025-01-30-1311-distro-default-smithi/8117742
Updated by Shraddha Agrawal about 1 year ago
/a/skanta-2025-02-05_10:08:28-rados-wip-bharath3-testing-2025-02-03-2127-distro-default-smithi/8115725
Updated by Adam King about 1 year ago
This appears to be an issue of ceph-volume inventory hanging and then `cephadm rm-cluster` gets stuck waiting on the cephadm lock that the ceph-volume process is holding. On a node hitting this I saw
root 101811 0.1 0.1 40360 33620 ? D 20:44 0:00 /usr/bin/python3 -s /usr/sbin/ceph-volume inventory --format=json
showing a ceph-volume process stuck in D state. dmesg reported
[ 1838.865126] iSCSI Login negotiation failed. [ 1838.865136] connection1:0: detected conn error (1020) [ 1838.865194] connection1:0: detected conn error (1020) [ 1840.865626] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw [ 1840.865650] iSCSI Login negotiation failed. [ 1840.865658] connection1:0: detected conn error (1020) [ 1840.865682] connection1:0: detected conn error (1020) [ 1842.866178] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw [ 1842.866202] iSCSI Login negotiation failed. [ 1842.866212] connection1:0: detected conn error (1020) [ 1842.866229] connection1:0: detected conn error (1020) [ 1844.866628] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw [ 1844.866656] iSCSI Login negotiation failed. [ 1844.866666] connection1:0: detected conn error (1020) [ 1844.866691] connection1:0: detected conn error (1020) [ 1845.458700] INFO: task ceph-volume:101811 blocked for more than 122 seconds. [ 1845.458711] Not tainted 5.14.0-559.el9.x86_64 #68563
unsure if the iscsi stuff matters at that point, but you can see dmesg reporting about ceph-volume being blocked. Unsure at this time why ceph-volume is getting stuck like that.
Updated by Jaya Prakash about 1 year ago
/a/akupczyk-2025-01-27_13:10:42-rados-aclamk-testing-ganymede-2025-01-24-0925-distro-default-smithi/8096466
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-02-05_21:36:43-rados-wip-yuri8-testing-2025-02-04-1046-distro-default-smithi/8117273
Updated by Aishwarya Mathuria about 1 year ago
/a/lflores-2025-02-07_20:42:43-rados-wip-yuri2-testing-2025-01-31-2325-distro-default-smithi/8120815
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-02-12_15:37:38-rados-wip-yuri8-testing-2025-02-10-2350-distro-default-smithi/8127067
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-02-21_18:09:35-rados-wip-pdonnell-testing-20250218.200348-debug-distro-default-smithi/8147350
Updated by Laura Flores about 1 year ago
/a/skanta-2025-03-01_01:42:21-rados-wip-bharath3-testing-2025-03-01-0356-distro-default-smithi/8162128
Updated by Laura Flores about 1 year ago
/a/skanta-2025-03-02_12:28:27-rados-wip-bharath8-testing-2025-03-02-0552-distro-default-smithi/8164443
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-02-27_21:53:52-rados-wip-yuri3-testing-2025-02-27-0658-distro-default-smithi/8159381
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-03-14_20:32:49-rados-wip-yuri7-testing-2025-03-11-0847-distro-default-smithi/8190487
Updated by Laura Flores about 1 year ago
/a/yuriw-2025-03-14_20:21:57-rados-wip-yuri13-testing-2025-03-14-0922-distro-default-smithi/8190139
Updated by Laura Flores 12 months ago
/a/yuriw-2025-03-20_14:48:30-rados-wip-yuri3-testing-2025-03-18-0732-distro-default-smithi/8199784
Updated by Laura Flores 12 months ago
/a/yuriw-2025-03-21_20:26:29-rados-wip-yuri7-testing-2025-03-21-0821-distro-default-smithi/8202022
Updated by Laura Flores 12 months ago
/a/yuriw-2025-03-27_15:03:25-rados-wip-yuri7-testing-2025-03-26-1605-distro-default-smithi/8213377
Updated by Jaya Prakash 12 months ago
/a/yuriw-2025-03-22_14:06:08-rados-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi/8202571
Updated by Jaya Prakash 12 months ago
/a/yuriw-2025-03-22_14:06:07-orch-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi
3 jobs: ['8202588', '8202550', '8202576']
Updated by Jaya Prakash 12 months ago
/a/yuriw-2025-03-21_20:27:17-orch-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi
3 jobs: ['8201516', '8202023', '8201778']
Updated by Jaya Prakash 12 months ago
/a/yuriw-2025-03-21_20:27:12-rados-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi/8202111
Updated by Laura Flores 12 months ago
/a/skanta-2025-04-04_06:10:17-rados-wip-bharath10-testing-2025-04-03-2112-distro-default-smithi/8223663
Updated by Laura Flores 11 months ago
/a/skanta-2025-04-05_15:49:33-rados-wip-bharath8-testing-2025-04-05-1439-distro-default-smithi/8225406
Updated by Kamoltat (Junior) Sirivadhna 11 months ago
/a/skanta-2025-04-09_05:31:19-rados-wip-bharath17-testing-2025-04-08-0602-distro-default-smithi/8233198/
Updated by Laura Flores 11 months ago
/a/lflores-2025-04-11_19:10:45-rados-wip-lflores-testing-3-2025-04-11-1140-distro-default-smithi/8236117
Updated by Kamoltat (Junior) Sirivadhna 11 months ago
suite watch: bump, ping @Adam King
Updated by Sridhar Seshasayee 11 months ago
/a/skanta-2025-04-22_23:21:15-rados-wip-bharath1-testing-2025-04-21-0529-distro-default-smithi/8254494
Updated by Aishwarya Mathuria 11 months ago
/a/yuriw-2025-04-14_18:07:07-rados-wip-yuri10-testing-2025-04-08-0710-distro-default-smithi/
['8239963', '8239944']
Updated by Aishwarya Mathuria 9 months ago
/a/skanta-2025-06-07_23:26:52-rados-wip-bharath5-testing-2025-06-02-2047-distro-default-smithi/8313574
Updated by Kamoltat (Junior) Sirivadhna 9 months ago
/a/skanta-2025-06-07_04:15:58-rados-wip-bharath8-testing-2025-06-02-1508-distro-default-smithi/8312622
Updated by Kamoltat (Junior) Sirivadhna 9 months ago
suite watch: bump, ping @Adam King
Updated by Sridhar Seshasayee 9 months ago
/a/skanta-2025-06-16_03:59:33-rados-wip-bharath10-testing-2025-06-15-0841-distro-default-smithi/8330490
Updated by Sridhar Seshasayee 9 months ago
/a/yuriw-2025-07-01_21:01:48-rados-wip-yuri11-testing-2025-07-01-1146-tentacle-distro-default-smithi/8365657
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-07-04_23:32:34-rados-wip-bharath13-testing-2025-07-04-0559-distro-default-smithi/8370619
Updated by Shraddha Agrawal 9 months ago
/a/skanta-2025-07-03_10:29:59-rados-wip-bharath5-testing-2025-06-30-2106-distro-default-smithi/8368528
Updated by Laura Flores 8 months ago
- Has duplicate Bug #69803: cephadm hangs trying to contact mgr that is down added
Updated by Shraddha Agrawal 8 months ago
/a/yuriw-2025-07-10_01:00:46-rados-wip-yuri-testing-2025-07-09-1458-tentacle-distro-default-smithi/8379410
Updated by Laura Flores 8 months ago
/a/yuriw-2025-07-10_23:00:33-rados-wip-yuri5-testing-2025-07-10-0913-distro-default-smithi/8381004
Updated by Aishwarya Mathuria 8 months ago
/a/skanta-2025-06-29_15:00:39-rados-wip-bharath1-testing-2025-06-28-2149-distro-default-smithi/8356814
Updated by Shraddha Agrawal 8 months ago
/a/skanta-2025-07-13_23:08:24-rados-wip-bharath4-testing-2025-07-13-0539-distro-default-smithi/8384540
Updated by Connor Fawcett 8 months ago
/a/skanta-2025-07-19_23:59:58-rados-wip-bharath5-testing-2025-07-18-0518-distro-default-smithi/8397510
Updated by Laura Flores 8 months ago
/a/skanta-2025-07-26_06:22:18-rados-wip-bharath9-testing-2025-07-26-0628-distro-default-smithi/8407538
Updated by Shraddha Agrawal 8 months ago
/a/skanta-2025-07-26_22:27:26-rados-wip-bharath7-testing-2025-07-26-0611-tentacle-distro-default-smithi/8409774
Updated by Laura Flores 8 months ago
/a/yuriw-2025-07-28_23:36:09-rados-tentacle-release-distro-default-smithi/8413652
Updated by Connor Fawcett 7 months ago
/a/skanta-2025-08-14_03:18:47-rados-wip-bharath4-testing-2025-08-13-0949-tentacle-distro-default-smithi/8442204
Updated by Aishwarya Mathuria 7 months ago
/a/skanta-2025-08-14_20:27:05-rados-wip-bharath5-testing-2025-08-13-0959-distro-default-smithi/8443390
Updated by Connor Fawcett 7 months ago
/a/skanta-2025-08-24_15:53:17-rados-wip-bharath4-testing-2025-08-24-0454-distro-default-smithi/8460753
Updated by Sridhar Seshasayee 7 months ago
/a/skanta-2025-08-24_23:24:05-rados-wip-bharath9-testing-2025-08-24-1258-tentacle-distro-default-smithi/8461798
Updated by Aishwarya Mathuria 7 months ago
/a/skanta-2025-08-21_23:24:45-rados-wip-bharath7-testing-2025-08-19-0959-distro-default-smithi/8457147
Updated by Connor Fawcett 7 months ago
/a/skanta-2025-08-31_23:44:30-rados-wip-bharath4-testing-2025-08-31-1138-distro-default-smithi/8474711
Updated by Jonathan Bailey 7 months ago
/a/skanta-2025-08-05_23:48:19-rados-wip-bharath1-testing-2025-08-05-0512-distro-default-smithi/8427217
/a/skanta-2025-08-05_10:12:24-rados-wip-bharath1-testing-2025-08-05-0512-distro-default-smithi/8424820
/a/skanta-2025-08-28_03:20:37-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8467901
/a/skanta-2025-08-27_01:46:19-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8466377
Updated by Connor Fawcett 6 months ago
/a/yuriw-2025-09-06_15:55:33-rados-wip-yuri3-testing-2025-09-04-1437-tentacle-distro-default-smithi/8484429
Updated by Laura Flores 6 months ago
/a/yuriw-2025-09-15_20:16:05-rados-wip-yuri-testing-2025-09-15-1029-tentacle-distro-default-smithi/8501803
Updated by Laura Flores 6 months ago
/a/skanta-2025-09-17_23:08:58-rados-wip-bharath8-testing-2025-09-17-1539-tentacle-distro-default-smithi/8507445
Updated by Aishwarya Mathuria 6 months ago
/a/yuriw-2025-09-24_18:46:32-rados-wip-yuri8-testing-2025-09-24-0752-tentacle-distro-default-smithi/8518480
Updated by Laura Flores 6 months ago
/a/yuriw-2025-09-18_21:29:32-rados-tentacle-release-distro-default-smithi/8510253
Updated by Aishwarya Mathuria 5 months ago
/a/skanta-2025-10-07_22:45:50-rados-wip-bharath1-testing-2025-10-06-2038-distro-default-smithi/8540424
Updated by Nitzan Mordechai 5 months ago
/a/skanta-2025-10-09_23:11:22-rados-wip-bharath3-testing-2025-10-09-0519-distro-default-smithi/8543846
Updated by Laura Flores 5 months ago
/a/yuriw-2025-10-15_20:55:26-rados-tentacle-release-distro-default-smithi/8554142
Updated by Laura Flores 5 months ago
/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8544002
Updated by Sridhar Seshasayee 5 months ago
/a/skanta-2025-10-24_12:45:03-rados-wip-bharath9-testing-2025-10-14-1426-tentacle-distro-default-smithi/8567564
Updated by Aishwarya Mathuria 5 months ago
/a/skanta-2025-11-01_01:03:27-rados-wip-bharath1-testing-2025-10-31-0445-distro-default-smithi/8578575
Updated by Lee Sanders 4 months ago
/a/skanta-2025-10-31_12:25:51-rados-wip-bharath5-testing-2025-10-31-1454-distro-default-smithi/8577900
How we get some focus on this issue please @Laura Flores @Adam King @Kamoltat (Junior) Sirivadhna ?
Updated by Laura Flores 4 months ago · Edited
Lee Sanders wrote in #note-89:
/a/skanta-2025-10-31_12:25:51-rados-wip-bharath5-testing-2025-10-31-1454-distro-default-smithi/8577900
How we get some focus on this issue please @Laura Flores @Adam King @Kamoltat (Junior) Sirivadhna ?
Attempting some analysis:
After checking the job descriptions on all the "task/test_iscsi_container" tests (on this ticket and from the nightlies), I saw that the tests only fail when "agent=on", such as in this description: rados/cephadm/workunits/{0-distro/centos_9.stream_runc agent/on mon_election/connectivity task/test_iscsi_container/{centos_9.stream test_iscsi_container}}
Here is a passing test with "agent/off": https://pulpito.ceph.com/teuthology-2025-11-09_20:00:24-rados-main-distro-default-smithi/8590621/
Here is a failing test with "agent/on": https://pulpito.ceph.com/teuthology-2025-11-02_20:00:23-rados-main-distro-default-smithi/8580054
Looking at this failed test:
/a/teuthology-2025-11-02_20:00:23-rados-main-distro-default-smithi/8580054
In cephadm.log, we send metadata to cephadm's agent endpoint successfully throughout the test:
2025-11-02 21:41:44,475 7fd1c2c97740 DEBUG sending query to https://172.21.15.88:7150/data 2025-11-02 21:41:44,498 7fd1c2c97740 INFO Received mgr response: "Successfully processed metadata." 0.023491 seconds after sending request. 2025-11-02 21:41:51,478 7fd1bbfff640 DEBUG Using specified config: /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config 2025-11-02 21:41:51,479 7fd1bbfff640 DEBUG Using specified fsid: a690d7e0-b833-11f0-8778-adfe0268badd
However, once we disable the cephadm module, the connection starts refusing:
2025-11-02 21:41:57,902 7f8ff7e36740 DEBUG -------------------------------------------------------------------------------- cephadm ['--image', 'quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907', 'shell', '-c', '/etc/ceph/ceph.conf', '-k', '/etc/ceph/ceph.client.admin.keyring', '--fsid', 'a690d7e0-b833-11f0-8778-adfe0268badd', '--', 'ceph', 'mgr', 'module', 'disable', 'cephadm'] 2025-11-02 21:41:57,932 7f8ff7e36740 INFO Inferring config /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config 2025-11-02 21:41:57,933 7f8ff7e36740 DEBUG Using specified fsid: a690d7e0-b833-11f0-8778-adfe0268badd 2025-11-02 21:41:57,933 7f8ff7e36740 DEBUG Using specified config: /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config 2025-11-02 21:41:57,933 7f8ff7e36740 DEBUG Running command (timeout=None): /bin/podman run --rm --ipc=host --net=host --privileged --group-add=disk --init -i -e CONTAINER_IMAGE=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907 -e NODE_NAME=smithi088 -v /var/run/ceph/a690d7e0-b833-11f0-8778-adfe0268badd:/var/run/ceph:z -v /var/log/ceph/a690d7e0-b833-11f0-8778-adfe0268badd:/var/log/ceph:z -v /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /etc/hosts:/etc/hosts:ro -v /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config:/etc/ceph/ceph.conf:z -v /etc/ceph/ceph.client.admin.keyring:/etc/ceph/ceph.keyring:z --entrypoint ceph quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907 mgr module disable cephadm 2025-11-02 21:42:02,069 7fd1c0949640 DEBUG /usr/bin/podman: stdout a7722c22d3e5a38ba24675f606df7a33ca739cfa20e7e9da8f4bcb06c3d55526,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-osd-0 2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout c038d5074c4d3f40c945a6c7078c94aa9bfbd51e68607d62b534d3ec4beff8ad,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-osd-1 2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout b39ce01316d4ba63487504184e5f88f393d8cea42a8b8558e8dc125d9d22e496,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-osd-2 2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout f998d880cca6dc268e7de6b88dbb9b18902dfacf236df2b1f6db675b474d5501,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-iscsi-foo-smithi088-odbkoz 2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout dd8453c0bfd1ec2e04807f1d9d5f11d66907b4dce1b2495eae23d47e2eb6681f,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-iscsi-foo-smithi088-odbkoz-tcmu 2025-11-02 21:42:02,174 7fd1c0949640 INFO Change detected in state of daemons. Running full daemon ls 2025-11-02 21:42:03,913 7fd1c2c97740 DEBUG sending query to https://172.21.15.88:7150/data 2025-11-02 21:42:03,915 7fd1c2c97740 DEBUG [Errno 111] Connection refused 2025-11-02 21:42:03,915 7fd1c2c97740 ERROR HTTP error -1 while querying agent endpoint: [Errno 111] Connection refused 2025-11-02 21:42:03,915 7fd1c2c97740 ERROR Failed to send metadata to mgr: non-200 response <-1> from agent endpoint: [Errno 111] Connection refused
So, why are we are disabling the cephadm module when we do?
In teuthology.log, it looks like we intentionally disable the cephadm module after all the workunits have finished as a part of the normal teardown process:
2025-11-02T21:41:57.140 INFO:tasks.workunit:Stopping ['cephadm/test_iscsi_pids_limit.sh', 'cephadm/test_iscsi_etc_hosts.sh', 'cephadm/test_iscsi_setup.sh'] on client.0... 2025-11-02T21:41:57.140 DEBUG:teuthology.orchestra.run.smithi088:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2025-11-02T21:41:57.206 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:56 smithi088 ceph-mon[34057]: pgmap v228: 33 pgs: 33 active+clean; 580 KiB data, 83 MiB used, 268 GiB / 268 GiB avail; 70 KiB/s rd, 3.3 KiB/s wr, 75 op/s 2025-11-02T21:41:57.495 DEBUG:teuthology.parallel:result is None 2025-11-02T21:41:57.496 DEBUG:teuthology.orchestra.run.smithi088:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0 2025-11-02T21:41:57.561 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0 2025-11-02T21:41:57.561 DEBUG:teuthology.orchestra.run.smithi088:> rmdir -- /home/ubuntu/cephtest/mnt.0 2025-11-02T21:41:57.628 INFO:tasks.workunit:Deleted artificial mount point /home/ubuntu/cephtest/mnt.0/client.0 2025-11-02T21:41:57.628 DEBUG:teuthology.run_tasks:Unwinding manager cephadm 2025-11-02T21:41:57.641 INFO:tasks.cephadm:Teardown begin 2025-11-02T21:41:57.642 DEBUG:teuthology.orchestra.run.smithi088:> sudo rm -f /etc/ceph/ceph.conf /etc/ceph/ceph.client.admin.keyring 2025-11-02T21:41:57.694 INFO:tasks.cephadm:Disabling cephadm mgr module 2025-11-02T21:41:57.694 DEBUG:teuthology.orchestra.run.smithi088:> sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid a690d7e0-b833-11f0-8778-adfe0268badd -- ceph mgr module disable cephadm
The attempts to send metadata to the agent endpoint continue for quite awhile in the cephadam log during this teardown step. We need to understand why we are continuing to send metadata during teardown.
Right after disabling the cephadm module, we begin stopping all the daemons:
2025-11-02T21:41:58.007 INFO:tasks.cephadm:Stopping all daemons... 2025-11-02T21:41:58.007 INFO:tasks.cephadm.mon.a:Stopping mon.a... 2025-11-02T21:41:58.008 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@mon.a 2025-11-02T21:41:58.357 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:58 smithi088 systemd[1]: Stopping Ceph mon.a for a690d7e0-b833-11f0-8778-adfe0268badd... 2025-11-02T21:41:58.664 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:58 smithi088 ceph-a690d7e0-b833-11f0-8778-adfe0268badd-mon-a[34053]: 2025-11-02T21:41:58.355+0000 7f4d450f2640 -1 received signal: Terminated from /run/podman-init -- /usr/bin/ceph-mon -n mon.a -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-journald=true --default-mon-cluster-log-to-stderr=false (PID: 1) UID: 0 2025-11-02T21:41:58.664 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:58 smithi088 ceph-a690d7e0-b833-11f0-8778-adfe0268badd-mon-a[34053]: 2025-11-02T21:41:58.355+0000 7f4d450f2640 -1 mon.a@0(leader) e1 *** Got Signal Terminated *** ...
Eventually, the test fails when cephadm tries to remove the cluster:
2025-11-02T21:42:26.283 DEBUG:teuthology.orchestra.run.smithi088:> sudo /home/ubuntu/cephtest/cephadm rm-cluster --fsid a690d7e0-b833-11f0-8778-adfe0268badd --force --keep-logs 2025-11-03T05:23:05.792 DEBUG:teuthology.exit:Got signal 15; running 1 handler... 2025-11-03T05:23:05.823 DEBUG:teuthology.task.console_log:Killing console logger for smithi088 2025-11-03T05:23:05.825 DEBUG:teuthology.exit:Finished running handlers
I suspect that we’re stopping other services during teardown, but not the agent service, and the endpoint connection failures are interrupting teardown:
$ cat teuthology.log | grep "sudo systemctl stop" ... 2025-11-02T21:41:58.008 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@mon.a 2025-11-02T21:42:00.277 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@mgr.a 2025-11-02T21:42:02.250 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@osd.0 2025-11-02T21:42:10.171 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@osd.1 2025-11-02T21:42:18.148 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@osd.2
Here's where we stop all the daemons in qa/tasks/cephadm.py:
log.info('Stopping all daemons...')
# this doesn't block until they are all stopped...
#ctx.cluster.run(args=['sudo', 'systemctl', 'stop', 'ceph.target'])
# stop the daemons we know
for role in ctx.daemons.resolve_role_list(None, CEPH_ROLE_TYPES, True):
cluster, type_, id_ = teuthology.split_role(role)
try:
ctx.daemons.get_daemon(type_, id_, cluster).stop()
except Exception:
log.exception(f'Failed to stop "{role}"')
raise
We're stopping all the daemons that we know via CEPH_ROLE_TYPES. This is defined at the top of the file:
CEPH_ROLE_TYPES = ['mon', 'mgr', 'osd', 'mds', 'rgw', 'prometheus']
I think the issue is that we're missing "agent" here. So, the task is unaware that it needs to stop the agent service during teardown.
I created a test branch here: https://github.com/ceph/ceph/compare/main...ljflores:ceph:wip-fix-iscsi-test
And I scheduled a test here: https://pulpito.ceph.com/lflores-2025-11-13_21:17:32-rados:cephadm:workunits-main-distro-default-smithi/8601828/
There might be more going on, but this is my hunch based on analyzing what I could see. At the time of writing this comment, the test has yet to finish, so I'll follow up to see the result and go from there.
Updated by Laura Flores 4 months ago
The test still failed in the same way (note that I stopped it early since the test normally takes up to 8 hours to time out, which is a problem in itself). But, the test still got stuck on removing the cluster. The agent service was not stopped either with the fix I tried, so I may not have gotten it right about how to stop it.
Updated by Laura Flores 4 months ago · Edited
I scheduled a new test here: https://pulpito.ceph.com/lflores-2025-11-13_21:59:24-rados:cephadm:workunits-main-distro-default-smithi/8601889/
When the test got stuck, I ssh'ed to the machine and manually stopped the agent with:
[lflores@smithi077 ~]$ sudo systemctl list-units | grep "ceph-6d31444a-c0de-11f0-877d-adfe0268badd" ceph-6d31444a-c0de-11f0-877d-adfe0268badd-sidecar@iscsi.foo.smithi077.qfiazg:tcmu.service loaded active running Ceph sidecar iscsi.foo.smithi077.qfiazg:tcmu for 6d31444a-c0de-11f0-877d-adfe0268badd ceph-6d31444a-c0de-11f0-877d-adfe0268badd@agent.smithi077.service loaded active running cephadm agent for cluster 6d31444a-c0de-11f0-877d-adfe0268badd ceph-6d31444a-c0de-11f0-877d-adfe0268badd@iscsi.foo.smithi077.qfiazg.service loaded active running Ceph iscsi.foo.smithi077.qfiazg for 6d31444a-c0de-11f0-877d-adfe0268badd system-ceph\x2d6d31444a\x2dc0de\x2d11f0\x2d877d\x2dadfe0268badd.slice loaded active active Slice /system/ceph-6d31444a-c0de-11f0-877d-adfe0268badd system-ceph\x2d6d31444a\x2dc0de\x2d11f0\x2d877d\x2dadfe0268badd\x2dsidecar.slice loaded active active Slice /system/ceph-6d31444a-c0de-11f0-877d-adfe0268badd-sidecar ceph-6d31444a-c0de-11f0-877d-adfe0268badd.target loaded active active Ceph cluster 6d31444a-c0de-11f0-877d-adfe0268badd [lflores@smithi077 ~]$ sudo systemctl stop ceph-6d31444a-c0de-11f0-877d-adfe0268badd@agent.smithi077.service
This instantly made the test pass, so that was definitely the problem. Now, it's just a matter of making the cephadm task do it properly.
Updated by Redouane Kachach Elhicou 4 months ago · Edited
Just to add that the agent is not a "normal service". To disable the agent you have to set the following config param:
ceph config set mgr mgr/cephadm/use_agent false
Once set to false, the agent service will be stopped and the daemons will be removed automatically by cephadm.
Updated by Sridhar Seshasayee 4 months ago
/a/skanta-2025-11-13_10:26:04-rados-wip-bharath3-testing-2025-11-12-2038-distro-default-smithi/8601373
Updated by Aishwarya Mathuria 4 months ago
/a/yuriw-2025-12-03_15:44:36-rados-wip-yuri5-testing-2025-12-02-1256-distro-default-smithi/8639549
Updated by Laura Flores 3 months ago
- Status changed from New to Fix Under Review
- Assignee changed from Adam King to Laura Flores
- Pull request ID set to 66613
Updated by Laura Flores about 2 months ago
- Has duplicate Bug #74578: test_iscsi_container is running continuously without a defined exit criterion. added
Updated by Aishwarya Mathuria about 2 months ago
/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial/28569
Updated by Aishwarya Mathuria about 1 month ago
/a/skanta-2026-02-07_00:02:26-rados-wip-bharath7-testing-2026-02-06-0906-distro-default-trial/39114
Updated by Lee Sanders about 1 month ago
/a/skanta-2026-01-29_02:19:11-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/24699
Updated by Lee Sanders about 1 month ago
/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25713
Updated by Lee Sanders about 1 month ago
/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25728
Updated by Aishwarya Mathuria about 1 month ago
/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial/35652
Updated by Lee Sanders 26 days ago
/a/skanta-2026-02-07_14:54:11-rados-wip-bharath5-testing-2026-02-06-2052-distro-default-trial/39478
Updated by Sridhar Seshasayee 19 days ago
/a/sseshasa-2026-02-26_14:56:45-rados-wip-sseshasa-testing-2026-02-26-1772100687-distro-default-trial/72379
Updated by Aishwarya Mathuria 11 days ago
/a/skanta-2026-03-07_15:39:05-rados-wip-bharath4-testing-2026-03-05-1456-tentacle-distro-default-trial/93044
Updated by Nitzan Mordechai 11 days ago
/a/skanta-2026-03-08_04:44:53-rados-wip-bharath5-testing-2026-03-07-1422-distro-default-trial/94078
Updated by Sridhar Seshasayee 8 days ago
/a/skanta-2026-03-04_23:53:38-rados-wip-bharath1-testing-2026-03-04-1011-distro-default-trial/85629