Project

General

Profile

Actions

Bug #68586

open

cephadm: task/test_iscsi_container test hits max timeout

Added by Laura Flores over 1 year ago. Updated 8 days ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

/a/yuriw-2024-10-15_14:06:51-rados-wip-yuri8-testing-2024-10-14-1103-distro-default-smithi/7948722

2024-10-15T23:18:27.687 DEBUG:teuthology.orchestra.run.smithi156:> sudo pkill -f 'journalctl -f -n 0 -u ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a@osd.2.service'
2024-10-15T23:18:27.980 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 podman[103248]: 2024-10-15 23:18:27.649430921 +0000 UTC m=+0.567753660 container remove 9e8b3c285794692d6e42b3db8efad4f3ee9a44704a7ce8908d54b10f997a6077 (image=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph@sha256:d7bbcb972af274fc68d56157a33d96ea068f505e42b2d9bd756605a2ddb5b1f2, name=ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a-osd-2-deactivate, CEPH_SHA1=1cb63f9c06f1a683e9d663e89d7306865bd51e03, org.label-schema.license=GPLv2, org.label-schema.schema-version=1.0, org.opencontainers.image.authors=Ceph Release Team <ceph-maintainers@ceph.io>, org.label-schema.name=CentOS Stream 9 Base Image, CEPH_GIT_REPO=https://github.com/ceph/ceph-ci.git, FROM_IMAGE=quay.io/centos/centos:stream9, GANESHA_REPO_BASEURL=https://buildlogs.centos.org/centos/$releasever-stream/storage/$basearch/nfsganesha-5/, io.buildah.version=1.37.2, OSD_FLAVOR=default, org.label-schema.vendor=CentOS, CEPH_REF=wip-yuri8-testing-2024-10-14-1103, org.label-schema.build-date=20241008, org.opencontainers.image.documentation=https://docs.ceph.com/)
2024-10-15T23:18:27.980 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 systemd[1]: ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a@osd.2.service: Deactivated successfully.
2024-10-15T23:18:27.981 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 systemd[1]: Stopped Ceph osd.2 for b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a.
2024-10-15T23:18:27.981 INFO:journalctl@ceph.osd.2.smithi156.stdout:Oct 15 23:18:27 smithi156 systemd[1]: ceph-b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a@osd.2.service: Consumed 3.380s CPU time.
2024-10-15T23:18:28.841 DEBUG:teuthology.orchestra.run:got remote process result: None
2024-10-15T23:18:28.841 INFO:tasks.cephadm.osd.2:Stopped osd.2
2024-10-15T23:18:28.841 DEBUG:teuthology.orchestra.run.smithi156:> sudo /home/ubuntu/cephtest/cephadm rm-cluster --fsid b48153fa-8b4a-11ef-bb99-d5e06f7e0c9a --force --keep-logs
2024-10-16T07:00:06.067 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2024-10-16T07:00:06.091 DEBUG:teuthology.task.console_log:Killing console logger for smithi156
2024-10-16T07:00:06.091 DEBUG:teuthology.exit:Finished running handlers


Related issues 3 (1 open2 closed)

Related to mgr - Bug #67225: cephadm TLS/SSL connection has been closedNewAdam King

Actions
Has duplicate Orchestrator - Bug #69803: cephadm hangs trying to contact mgr that is downDuplicate

Actions
Has duplicate RADOS - Bug #74578: test_iscsi_container is running continuously without a defined exit criterion.DuplicateLaura Flores

Actions
Actions #1

Updated by Laura Flores over 1 year ago

  • Tags set to main-failures
Actions #2

Updated by Laura Flores over 1 year ago

  • Related to Bug #67225: cephadm TLS/SSL connection has been closed added
Actions #3

Updated by Naveen Naidu over 1 year ago

/a/skanta-2024-09-25_00:14:38-rados-wip-bharath4-testing-2024-09-24-1154-distro-default-smithi/7918830
/a/skanta-2024-09-25_00:14:38-rados-wip-bharath4-testing-2024-09-24-1154-distro-default-smithi/7919099

Actions #4

Updated by Aishwarya Mathuria over 1 year ago

/a/yuriw-2024-10-13_19:06:13-rados-wip-yuri4-testing-2024-10-13-0836-distro-default-smithi/7944901

Actions #5

Updated by Laura Flores over 1 year ago

  • Project changed from RADOS to rbd
Actions #6

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-10-23_23:17:32-rados-wip-yuri13-testing-2024-10-23-0743-distro-default-smithi/7963779

Actions #7

Updated by Ilya Dryomov over 1 year ago · Edited

  • Project changed from rbd to mgr
  • Category set to orchestrator

Hi Laura,

"cephadm rm-cluster" might be hanging because tcmu-runner daemon continues to run, spinning on an error that has to do with the rest of the cluster getting destroyed:

2024-10-24T01:56:33.080 INFO:tasks.workunit:Stopping ['cephadm/test_iscsi_pids_limit.sh', 'cephadm/test_iscsi_etc_hosts.sh', 'cephadm/test_iscsi_setup.sh'] on client.0...
2024-10-24T01:56:33.080 DEBUG:teuthology.orchestra.run.smithi153:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2024-10-24T01:56:33.436 DEBUG:teuthology.parallel:result is None
2024-10-24T01:56:33.436 DEBUG:teuthology.orchestra.run.smithi153:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0
2024-10-24T01:56:33.462 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0
2024-10-24T01:56:33.462 DEBUG:teuthology.orchestra.run.smithi153:> rmdir -- /home/ubuntu/cephtest/mnt.0
2024-10-24T01:56:33.545 INFO:tasks.workunit:Deleted artificial mount point /home/ubuntu/cephtest/mnt.0/client.0
2024-10-24T01:56:33.545 DEBUG:teuthology.run_tasks:Unwinding manager cephadm
2024-10-24T01:56:35.340 INFO:journalctl@ceph.mon.a.smithi153.stdout:Oct 24 01:56:34 smithi153 systemd[1]: Stopped Ceph mon.a for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a.
2024-10-24T01:56:37.324 INFO:journalctl@ceph.mgr.a.smithi153.stdout:Oct 24 01:56:37 smithi153 systemd[1]: Stopped Ceph mgr.a for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a.
2024-10-24T01:56:47.090 INFO:journalctl@ceph.osd.0.smithi153.stdout:Oct 24 01:56:46 smithi153 systemd[1]: Stopped Ceph osd.0 for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a.
2024-10-24T01:56:57.090 INFO:journalctl@ceph.osd.1.smithi153.stdout:Oct 24 01:56:56 smithi153 systemd[1]: Stopped Ceph osd.1 for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a.
2024-10-24T01:57:06.340 INFO:journalctl@ceph.osd.2.smithi153.stdout:Oct 24 01:57:05 smithi153 systemd[1]: Stopped Ceph osd.2 for 319f79ec-91aa-11ef-bb9a-d5e06f7e0c9a.
2024-10-24 01:56:27.921 7 [INFO] tcmu_acquire_dev_lock:490 rbd/foo.disk_2: Read lock acquisition successful
2024-10-24 01:57:23.536 7 [ERROR] tcmu_rbd_handle_timedout_cmd:1266 rbd/foo.disk_1: Timing out cmd.
2024-10-24 01:57:23.536 7 [ERROR] __tcmu_notify_conn_lost:218 rbd/foo.disk_1: Handler connection lost (lock state 3)
2024-10-24 01:57:23.543 7 [INFO] tgt_port_grp_recovery_work_fn:245: Disabled iscsi/iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw/tpgt_1.
2024-10-24 01:57:53.545 7 [INFO] tcmu_rbd_close:1239 rbd/foo.disk_1: appended blocklist entry: {172.21.15.153:0/2131618207}
2024-10-24 02:02:53.550 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 02:07:54.557 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 02:12:55.562 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
...
2024-10-24 09:29:23.033 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 09:34:24.039 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)
2024-10-24 09:39:25.044 7 [ERROR] tcmu_rbd_image_open:640 rbd/foo.disk_1: Could not connect to cluster. (Err -110)

I think the issue is that qa/workunits/cephadm/test_iscsi_setup.sh doesn't clean up after itself. It adds two disks to the target, but never removes them.

@Adam King Moving this to cephadm.

Actions #8

Updated by Laura Flores over 1 year ago

  • Assignee set to Adam King

Hey Adam, mind having a look?

Actions #9

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-11-13_00:17:56-rados-wip-yuri6-testing-2024-11-12-1317-distro-default-smithi/7992301

Actions #10

Updated by Laura Flores over 1 year ago

  • Project changed from mgr to rbd
  • Category deleted (orchestrator)
  • Assignee deleted (Adam King)

Think this should actually be under rbd.

Actions #11

Updated by Ilya Dryomov over 1 year ago

Laura Flores wrote in #note-10:

Think this should actually be under rbd.

Hi Laura,

I moved it to mgr/orchestrator and tagged Adam in https://tracker.ceph.com/issues/68586#note-7 on purpose. qa/workunits/cephadm/test_iscsi_setup.sh doesn't really test iSCSI, but rather how the iSCSI container is set up (whether the settings that cephadm applies to it are as expected and whether the container engine respects them). The issue appears to be that qa/workunits/cephadm/test_iscsi_setup.sh is missing a cleanup step. Even though the fix should be trivial, the RBD component doesn't own this script. It's not part of rbd suite, only cephadm and by extension rados suite.

Actions #12

Updated by Laura Flores over 1 year ago

  • Project changed from rbd to Orchestrator
  • Priority changed from Normal to High
Actions #13

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-11-20_16:10:40-rados-wip-yuri2-testing-2024-11-15-0902-distro-default-smithi/8001822

Actions #14

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-12-03_16:16:51-rados-wip-yuri6-testing-2024-12-02-1528-distro-default-smithi/8019067

Actions #15

Updated by Shraddha Agrawal over 1 year ago

/a/skanta-2024-12-12_03:27:32-rados-wip-bharath11-testing-2024-12-11-1511-distro-default-smithi/8031435

Actions #16

Updated by Laura Flores about 1 year ago

/a/yuriw-2024-12-18_15:56:21-rados-wip-yuri6-testing-2024-12-17-1653-distro-default-smithi/8043420

Actions #17

Updated by Naveen Naidu about 1 year ago

/a/skanta-2024-12-11_23:59:30-rados-wip-bharath9-testing-2024-12-10-1652-distro-default-smithi/8031139

Actions #18

Updated by Laura Flores about 1 year ago

/a/yuriw-2024-11-18_15:14:17-rados-wip-yuri3-testing-2024-11-14-0857-distro-default-smithi/7998075

Actions #19

Updated by Shraddha Agrawal about 1 year ago

/a/skanta-2024-10-24_23:59:35-rados-wip-bharath3-testing-2024-10-23-1509-distro-default-smithi/7965775

Actions #20

Updated by Shraddha Agrawal about 1 year ago

/a/skanta-2024-12-26_01:49:37-rados-wip-bharath12-testing-2024-12-24-0842-distro-default-smithi/8053521

Actions #21

Updated by Laura Flores about 1 year ago

/a/skanta-2024-12-22_00:38:58-rados-wip-bharath8-testing-2024-12-21-1707-distro-default-smithi/8047109

Actions #22

Updated by Laura Flores about 1 year ago

  • Assignee set to Adam King

Hey @Adam King can you take a look?

Actions #23

Updated by Aishwarya Mathuria about 1 year ago

/a/skanta-2025-01-26_07:44:24-rados-wip-bharath9-testing-2025-01-25-0527-distro-default-smithi/8094203

Actions #24

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-01-31_15:46:33-rados-wip-yuri5-testing-2025-01-30-1311-distro-default-smithi/8107032

Actions #25

Updated by Laura Flores about 1 year ago

/a/lflores-2025-01-31_21:01:32-rados-wip-bharath8-testing-2025-01-31-1851-distro-default-smithi/8107615

Actions #26

Updated by Laura Flores about 1 year ago

/a/lflores-2025-02-05_22:51:38-rados-wip-yuri5-testing-2025-01-30-1311-distro-default-smithi/8117742

Actions #27

Updated by Shraddha Agrawal about 1 year ago

/a/skanta-2025-02-05_10:08:28-rados-wip-bharath3-testing-2025-02-03-2127-distro-default-smithi/8115725

Actions #28

Updated by Adam King about 1 year ago

This appears to be an issue of ceph-volume inventory hanging and then `cephadm rm-cluster` gets stuck waiting on the cephadm lock that the ceph-volume process is holding. On a node hitting this I saw

root      101811  0.1  0.1  40360 33620 ?        D    20:44   0:00 /usr/bin/python3 -s /usr/sbin/ceph-volume inventory --format=json

showing a ceph-volume process stuck in D state. dmesg reported

[ 1838.865126] iSCSI Login negotiation failed.
[ 1838.865136]  connection1:0: detected conn error (1020)
[ 1838.865194]  connection1:0: detected conn error (1020)
[ 1840.865626] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
[ 1840.865650] iSCSI Login negotiation failed.
[ 1840.865658]  connection1:0: detected conn error (1020)
[ 1840.865682]  connection1:0: detected conn error (1020)
[ 1842.866178] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
[ 1842.866202] iSCSI Login negotiation failed.
[ 1842.866212]  connection1:0: detected conn error (1020)
[ 1842.866229]  connection1:0: detected conn error (1020)
[ 1844.866628] Unable to locate Target Portal Group on iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw
[ 1844.866656] iSCSI Login negotiation failed.
[ 1844.866666]  connection1:0: detected conn error (1020)
[ 1844.866691]  connection1:0: detected conn error (1020)
[ 1845.458700] INFO: task ceph-volume:101811 blocked for more than 122 seconds.
[ 1845.458711]       Not tainted 5.14.0-559.el9.x86_64 #68563 

unsure if the iscsi stuff matters at that point, but you can see dmesg reporting about ceph-volume being blocked. Unsure at this time why ceph-volume is getting stuck like that.

Actions #29

Updated by Jaya Prakash about 1 year ago

/a/akupczyk-2025-01-27_13:10:42-rados-aclamk-testing-ganymede-2025-01-24-0925-distro-default-smithi/8096466

Actions #30

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-02-05_21:36:43-rados-wip-yuri8-testing-2025-02-04-1046-distro-default-smithi/8117273

Actions #31

Updated by Aishwarya Mathuria about 1 year ago

/a/lflores-2025-02-07_20:42:43-rados-wip-yuri2-testing-2025-01-31-2325-distro-default-smithi/8120815

Actions #32

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-02-12_15:37:38-rados-wip-yuri8-testing-2025-02-10-2350-distro-default-smithi/8127067

Actions #33

Updated by Laura Flores about 1 year ago

@Adam King is this in @guits's realm?

Actions #34

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-02-21_18:09:35-rados-wip-pdonnell-testing-20250218.200348-debug-distro-default-smithi/8147350

Actions #35

Updated by Laura Flores about 1 year ago

/a/skanta-2025-03-01_01:42:21-rados-wip-bharath3-testing-2025-03-01-0356-distro-default-smithi/8162128

Actions #36

Updated by Laura Flores about 1 year ago

/a/skanta-2025-03-02_12:28:27-rados-wip-bharath8-testing-2025-03-02-0552-distro-default-smithi/8164443

Actions #37

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-02-27_21:53:52-rados-wip-yuri3-testing-2025-02-27-0658-distro-default-smithi/8159381

Actions #38

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-03-14_20:32:49-rados-wip-yuri7-testing-2025-03-11-0847-distro-default-smithi/8190487

Actions #39

Updated by Laura Flores about 1 year ago

/a/yuriw-2025-03-14_20:21:57-rados-wip-yuri13-testing-2025-03-14-0922-distro-default-smithi/8190139

Actions #40

Updated by Laura Flores 12 months ago

/a/yuriw-2025-03-20_14:48:30-rados-wip-yuri3-testing-2025-03-18-0732-distro-default-smithi/8199784

Actions #41

Updated by Laura Flores 12 months ago

/a/yuriw-2025-03-21_20:26:29-rados-wip-yuri7-testing-2025-03-21-0821-distro-default-smithi/8202022

Actions #42

Updated by Laura Flores 12 months ago

/a/yuriw-2025-03-27_15:03:25-rados-wip-yuri7-testing-2025-03-26-1605-distro-default-smithi/8213377

Actions #43

Updated by Jaya Prakash 12 months ago

/a/yuriw-2025-03-22_14:06:08-rados-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi/8202571

Actions #44

Updated by Jaya Prakash 12 months ago

/a/yuriw-2025-03-22_14:06:07-orch-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi
3 jobs: ['8202588', '8202550', '8202576']

Actions #45

Updated by Jaya Prakash 12 months ago

/a/yuriw-2025-03-21_20:27:17-orch-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi
3 jobs: ['8201516', '8202023', '8201778']

Actions #46

Updated by Jaya Prakash 12 months ago

/a/yuriw-2025-03-21_20:27:12-rados-wip-yuri2-testing-2025-03-21-0820-distro-default-smithi/8202111

Actions #47

Updated by Laura Flores 12 months ago

/a/skanta-2025-04-04_06:10:17-rados-wip-bharath10-testing-2025-04-03-2112-distro-default-smithi/8223663

Actions #48

Updated by Laura Flores 11 months ago

/a/skanta-2025-04-05_15:49:33-rados-wip-bharath8-testing-2025-04-05-1439-distro-default-smithi/8225406

Actions #49

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

/a/skanta-2025-04-09_05:31:19-rados-wip-bharath17-testing-2025-04-08-0602-distro-default-smithi/8233198/

Actions #50

Updated by Laura Flores 11 months ago

/a/lflores-2025-04-11_19:10:45-rados-wip-lflores-testing-3-2025-04-11-1140-distro-default-smithi/8236117

Actions #51

Updated by Kamoltat (Junior) Sirivadhna 11 months ago

suite watch: bump, ping @Adam King

Actions #52

Updated by Sridhar Seshasayee 11 months ago

/a/skanta-2025-04-22_23:21:15-rados-wip-bharath1-testing-2025-04-21-0529-distro-default-smithi/8254494

Actions #53

Updated by Aishwarya Mathuria 11 months ago

/a/yuriw-2025-04-14_18:07:07-rados-wip-yuri10-testing-2025-04-08-0710-distro-default-smithi/
['8239963', '8239944']

Actions #54

Updated by Aishwarya Mathuria 9 months ago

/a/skanta-2025-06-07_23:26:52-rados-wip-bharath5-testing-2025-06-02-2047-distro-default-smithi/8313574

Actions #55

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

/a/skanta-2025-06-07_04:15:58-rados-wip-bharath8-testing-2025-06-02-1508-distro-default-smithi/8312622

Actions #56

Updated by Kamoltat (Junior) Sirivadhna 9 months ago

suite watch: bump, ping @Adam King

Actions #57

Updated by Sridhar Seshasayee 9 months ago

/a/skanta-2025-06-16_03:59:33-rados-wip-bharath10-testing-2025-06-15-0841-distro-default-smithi/8330490

Actions #58

Updated by Sridhar Seshasayee 9 months ago

/a/yuriw-2025-07-01_21:01:48-rados-wip-yuri11-testing-2025-07-01-1146-tentacle-distro-default-smithi/8365657

Actions #59

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-07-04_23:32:34-rados-wip-bharath13-testing-2025-07-04-0559-distro-default-smithi/8370619

Actions #60

Updated by Shraddha Agrawal 9 months ago

/a/skanta-2025-07-03_10:29:59-rados-wip-bharath5-testing-2025-06-30-2106-distro-default-smithi/8368528

Actions #61

Updated by Laura Flores 8 months ago

  • Has duplicate Bug #69803: cephadm hangs trying to contact mgr that is down added
Actions #62

Updated by Shraddha Agrawal 8 months ago

/a/yuriw-2025-07-10_01:00:46-rados-wip-yuri-testing-2025-07-09-1458-tentacle-distro-default-smithi/8379410

Actions #63

Updated by Laura Flores 8 months ago

/a/yuriw-2025-07-10_23:00:33-rados-wip-yuri5-testing-2025-07-10-0913-distro-default-smithi/8381004

Actions #64

Updated by Aishwarya Mathuria 8 months ago

/a/skanta-2025-06-29_15:00:39-rados-wip-bharath1-testing-2025-06-28-2149-distro-default-smithi/8356814

Actions #65

Updated by Shraddha Agrawal 8 months ago

/a/skanta-2025-07-13_23:08:24-rados-wip-bharath4-testing-2025-07-13-0539-distro-default-smithi/8384540

Actions #66

Updated by Connor Fawcett 8 months ago

/a/skanta-2025-07-19_23:59:58-rados-wip-bharath5-testing-2025-07-18-0518-distro-default-smithi/8397510

Actions #67

Updated by Aishwarya Mathuria 8 months ago

suite watch: bump, ping @Adam King

Actions #68

Updated by Laura Flores 8 months ago

/a/skanta-2025-07-26_06:22:18-rados-wip-bharath9-testing-2025-07-26-0628-distro-default-smithi/8407538

Actions #69

Updated by Shraddha Agrawal 8 months ago

/a/skanta-2025-07-26_22:27:26-rados-wip-bharath7-testing-2025-07-26-0611-tentacle-distro-default-smithi/8409774

Actions #70

Updated by Laura Flores 8 months ago

/a/yuriw-2025-07-28_23:36:09-rados-tentacle-release-distro-default-smithi/8413652

Actions #71

Updated by Connor Fawcett 7 months ago

/a/skanta-2025-08-14_03:18:47-rados-wip-bharath4-testing-2025-08-13-0949-tentacle-distro-default-smithi/8442204

Actions #72

Updated by Aishwarya Mathuria 7 months ago

/a/skanta-2025-08-14_20:27:05-rados-wip-bharath5-testing-2025-08-13-0959-distro-default-smithi/8443390

Actions #73

Updated by Connor Fawcett 7 months ago

/a/skanta-2025-08-24_15:53:17-rados-wip-bharath4-testing-2025-08-24-0454-distro-default-smithi/8460753

Actions #74

Updated by Sridhar Seshasayee 7 months ago

/a/skanta-2025-08-24_23:24:05-rados-wip-bharath9-testing-2025-08-24-1258-tentacle-distro-default-smithi/8461798

Actions #75

Updated by Aishwarya Mathuria 7 months ago

/a/skanta-2025-08-21_23:24:45-rados-wip-bharath7-testing-2025-08-19-0959-distro-default-smithi/8457147

Actions #76

Updated by Connor Fawcett 7 months ago

/a/skanta-2025-08-31_23:44:30-rados-wip-bharath4-testing-2025-08-31-1138-distro-default-smithi/8474711

Actions #77

Updated by Jonathan Bailey 7 months ago

/a/skanta-2025-08-05_23:48:19-rados-wip-bharath1-testing-2025-08-05-0512-distro-default-smithi/8427217
/a/skanta-2025-08-05_10:12:24-rados-wip-bharath1-testing-2025-08-05-0512-distro-default-smithi/8424820
/a/skanta-2025-08-28_03:20:37-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8467901
/a/skanta-2025-08-27_01:46:19-rados-wip-bharath1-testing-2025-08-26-1433-distro-default-smithi/8466377

Actions #78

Updated by Connor Fawcett 6 months ago

/a/yuriw-2025-09-06_15:55:33-rados-wip-yuri3-testing-2025-09-04-1437-tentacle-distro-default-smithi/8484429

Actions #79

Updated by Laura Flores 6 months ago

/a/yuriw-2025-09-15_20:16:05-rados-wip-yuri-testing-2025-09-15-1029-tentacle-distro-default-smithi/8501803

Actions #80

Updated by Laura Flores 6 months ago

/a/skanta-2025-09-17_23:08:58-rados-wip-bharath8-testing-2025-09-17-1539-tentacle-distro-default-smithi/8507445

Actions #81

Updated by Aishwarya Mathuria 6 months ago

/a/yuriw-2025-09-24_18:46:32-rados-wip-yuri8-testing-2025-09-24-0752-tentacle-distro-default-smithi/8518480

Actions #82

Updated by Laura Flores 6 months ago

/a/yuriw-2025-09-18_21:29:32-rados-tentacle-release-distro-default-smithi/8510253

Actions #83

Updated by Aishwarya Mathuria 5 months ago

/a/skanta-2025-10-07_22:45:50-rados-wip-bharath1-testing-2025-10-06-2038-distro-default-smithi/8540424

Actions #84

Updated by Nitzan Mordechai 5 months ago

/a/skanta-2025-10-09_23:11:22-rados-wip-bharath3-testing-2025-10-09-0519-distro-default-smithi/8543846

Actions #85

Updated by Laura Flores 5 months ago

/a/yuriw-2025-10-15_20:55:26-rados-tentacle-release-distro-default-smithi/8554142

Actions #86

Updated by Laura Flores 5 months ago

/a/skanta-2025-10-09_23:38:36-rados-wip-bharath7-testing-2025-10-09-2128-distro-default-smithi/8544002

Actions #87

Updated by Sridhar Seshasayee 5 months ago

/a/skanta-2025-10-24_12:45:03-rados-wip-bharath9-testing-2025-10-14-1426-tentacle-distro-default-smithi/8567564

Actions #88

Updated by Aishwarya Mathuria 5 months ago

/a/skanta-2025-11-01_01:03:27-rados-wip-bharath1-testing-2025-10-31-0445-distro-default-smithi/8578575

Actions #89

Updated by Lee Sanders 4 months ago

/a/skanta-2025-10-31_12:25:51-rados-wip-bharath5-testing-2025-10-31-1454-distro-default-smithi/8577900

How we get some focus on this issue please @Laura Flores @Adam King @Kamoltat (Junior) Sirivadhna ?

Actions #90

Updated by Laura Flores 4 months ago · Edited

Lee Sanders wrote in #note-89:

/a/skanta-2025-10-31_12:25:51-rados-wip-bharath5-testing-2025-10-31-1454-distro-default-smithi/8577900

How we get some focus on this issue please @Laura Flores @Adam King @Kamoltat (Junior) Sirivadhna ?

Attempting some analysis:

After checking the job descriptions on all the "task/test_iscsi_container" tests (on this ticket and from the nightlies), I saw that the tests only fail when "agent=on", such as in this description: rados/cephadm/workunits/{0-distro/centos_9.stream_runc agent/on mon_election/connectivity task/test_iscsi_container/{centos_9.stream test_iscsi_container}}

Here is a passing test with "agent/off": https://pulpito.ceph.com/teuthology-2025-11-09_20:00:24-rados-main-distro-default-smithi/8590621/
Here is a failing test with "agent/on": https://pulpito.ceph.com/teuthology-2025-11-02_20:00:23-rados-main-distro-default-smithi/8580054

Looking at this failed test:
/a/teuthology-2025-11-02_20:00:23-rados-main-distro-default-smithi/8580054

In cephadm.log, we send metadata to cephadm's agent endpoint successfully throughout the test:

2025-11-02 21:41:44,475 7fd1c2c97740 DEBUG sending query to https://172.21.15.88:7150/data
2025-11-02 21:41:44,498 7fd1c2c97740 INFO Received mgr response: "Successfully processed metadata." 0.023491 seconds after sending request.
2025-11-02 21:41:51,478 7fd1bbfff640 DEBUG Using specified config: /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config
2025-11-02 21:41:51,479 7fd1bbfff640 DEBUG Using specified fsid: a690d7e0-b833-11f0-8778-adfe0268badd

However, once we disable the cephadm module, the connection starts refusing:

2025-11-02 21:41:57,902 7f8ff7e36740 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907', 'shell', '-c', '/etc/ceph/ceph.conf', '-k', '/etc/ceph/ceph.client.admin.keyring', '--fsid', 'a690d7e0-b833-11f0-8778-adfe0268badd', '--', 'ceph', 'mgr', 'module', 'disable', 'cephadm']
2025-11-02 21:41:57,932 7f8ff7e36740 INFO Inferring config /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config
2025-11-02 21:41:57,933 7f8ff7e36740 DEBUG Using specified fsid: a690d7e0-b833-11f0-8778-adfe0268badd
2025-11-02 21:41:57,933 7f8ff7e36740 DEBUG Using specified config: /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config
2025-11-02 21:41:57,933 7f8ff7e36740 DEBUG Running command (timeout=None): /bin/podman run --rm --ipc=host --net=host --privileged --group-add=disk --init -i -e CONTAINER_IMAGE=quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907 -e NODE_NAME=smithi088 -v /var/run/ceph/a690d7e0-b833-11f0-8778-adfe0268badd:/var/run/ceph:z -v /var/log/ceph/a690d7e0-b833-11f0-8778-adfe0268badd:/var/log/ceph:z -v /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/crash:/var/lib/ceph/crash:z -v /run/systemd/journal:/run/systemd/journal -v /etc/hosts:/etc/hosts:ro -v /var/lib/ceph/a690d7e0-b833-11f0-8778-adfe0268badd/mon.a/config:/etc/ceph/ceph.conf:z -v /etc/ceph/ceph.client.admin.keyring:/etc/ceph/ceph.keyring:z --entrypoint ceph quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907 mgr module disable cephadm
2025-11-02 21:42:02,069 7fd1c0949640 DEBUG /usr/bin/podman: stdout a7722c22d3e5a38ba24675f606df7a33ca739cfa20e7e9da8f4bcb06c3d55526,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-osd-0
2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout c038d5074c4d3f40c945a6c7078c94aa9bfbd51e68607d62b534d3ec4beff8ad,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-osd-1
2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout b39ce01316d4ba63487504184e5f88f393d8cea42a8b8558e8dc125d9d22e496,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-osd-2
2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout f998d880cca6dc268e7de6b88dbb9b18902dfacf236df2b1f6db675b474d5501,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-iscsi-foo-smithi088-odbkoz
2025-11-02 21:42:02,070 7fd1c0949640 DEBUG /usr/bin/podman: stdout dd8453c0bfd1ec2e04807f1d9d5f11d66907b4dce1b2495eae23d47e2eb6681f,ceph-a690d7e0-b833-11f0-8778-adfe0268badd-iscsi-foo-smithi088-odbkoz-tcmu
2025-11-02 21:42:02,174 7fd1c0949640 INFO Change detected in state of daemons. Running full daemon ls
2025-11-02 21:42:03,913 7fd1c2c97740 DEBUG sending query to https://172.21.15.88:7150/data
2025-11-02 21:42:03,915 7fd1c2c97740 DEBUG [Errno 111] Connection refused
2025-11-02 21:42:03,915 7fd1c2c97740 ERROR HTTP error -1 while querying agent endpoint: [Errno 111] Connection refused
2025-11-02 21:42:03,915 7fd1c2c97740 ERROR Failed to send metadata to mgr: non-200 response <-1> from agent endpoint: [Errno 111] Connection refused

So, why are we are disabling the cephadm module when we do?

In teuthology.log, it looks like we intentionally disable the cephadm module after all the workunits have finished as a part of the normal teardown process:

2025-11-02T21:41:57.140 INFO:tasks.workunit:Stopping ['cephadm/test_iscsi_pids_limit.sh', 'cephadm/test_iscsi_etc_hosts.sh', 'cephadm/test_iscsi_setup.sh'] on client.0...
2025-11-02T21:41:57.140 DEBUG:teuthology.orchestra.run.smithi088:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2025-11-02T21:41:57.206 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:56 smithi088 ceph-mon[34057]: pgmap v228: 33 pgs: 33 active+clean; 580 KiB data, 83 MiB used, 268 GiB / 268 GiB avail; 70 KiB/s rd, 3.3 KiB/s wr, 75 op/s
2025-11-02T21:41:57.495 DEBUG:teuthology.parallel:result is None
2025-11-02T21:41:57.496 DEBUG:teuthology.orchestra.run.smithi088:> sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0
2025-11-02T21:41:57.561 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0
2025-11-02T21:41:57.561 DEBUG:teuthology.orchestra.run.smithi088:> rmdir -- /home/ubuntu/cephtest/mnt.0
2025-11-02T21:41:57.628 INFO:tasks.workunit:Deleted artificial mount point /home/ubuntu/cephtest/mnt.0/client.0
2025-11-02T21:41:57.628 DEBUG:teuthology.run_tasks:Unwinding manager cephadm
2025-11-02T21:41:57.641 INFO:tasks.cephadm:Teardown begin
2025-11-02T21:41:57.642 DEBUG:teuthology.orchestra.run.smithi088:> sudo rm -f /etc/ceph/ceph.conf /etc/ceph/ceph.client.admin.keyring
2025-11-02T21:41:57.694 INFO:tasks.cephadm:Disabling cephadm mgr module
2025-11-02T21:41:57.694 DEBUG:teuthology.orchestra.run.smithi088:> sudo /home/ubuntu/cephtest/cephadm --image quay-quay-quay.apps.os.sepia.ceph.com/ceph-ci/ceph:986f61cbfa1c86c4982b34421a7ecc2c21a03907 shell -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring --fsid a690d7e0-b833-11f0-8778-adfe0268badd -- ceph mgr module disable cephadm

The attempts to send metadata to the agent endpoint continue for quite awhile in the cephadam log during this teardown step. We need to understand why we are continuing to send metadata during teardown.

Right after disabling the cephadm module, we begin stopping all the daemons:

2025-11-02T21:41:58.007 INFO:tasks.cephadm:Stopping all daemons...
2025-11-02T21:41:58.007 INFO:tasks.cephadm.mon.a:Stopping mon.a...
2025-11-02T21:41:58.008 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@mon.a
2025-11-02T21:41:58.357 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:58 smithi088 systemd[1]: Stopping Ceph mon.a for a690d7e0-b833-11f0-8778-adfe0268badd...
2025-11-02T21:41:58.664 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:58 smithi088 ceph-a690d7e0-b833-11f0-8778-adfe0268badd-mon-a[34053]: 2025-11-02T21:41:58.355+0000 7f4d450f2640 -1 received  signal: Terminated from /run/podman-init -- /usr/bin/ceph-mon -n mon.a -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-journald=true --default-mon-cluster-log-to-stderr=false  (PID: 1) UID: 0
2025-11-02T21:41:58.664 INFO:journalctl@ceph.mon.a.smithi088.stdout:Nov 02 21:41:58 smithi088 ceph-a690d7e0-b833-11f0-8778-adfe0268badd-mon-a[34053]: 2025-11-02T21:41:58.355+0000 7f4d450f2640 -1 mon.a@0(leader) e1 *** Got Signal Terminated ***
...

Eventually, the test fails when cephadm tries to remove the cluster:

2025-11-02T21:42:26.283 DEBUG:teuthology.orchestra.run.smithi088:> sudo /home/ubuntu/cephtest/cephadm rm-cluster --fsid a690d7e0-b833-11f0-8778-adfe0268badd --force --keep-logs
2025-11-03T05:23:05.792 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2025-11-03T05:23:05.823 DEBUG:teuthology.task.console_log:Killing console logger for smithi088
2025-11-03T05:23:05.825 DEBUG:teuthology.exit:Finished running handlers

I suspect that we’re stopping other services during teardown, but not the agent service, and the endpoint connection failures are interrupting teardown:

$ cat teuthology.log | grep "sudo systemctl stop" 
...
2025-11-02T21:41:58.008 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@mon.a
2025-11-02T21:42:00.277 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@mgr.a
2025-11-02T21:42:02.250 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@osd.0
2025-11-02T21:42:10.171 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@osd.1
2025-11-02T21:42:18.148 DEBUG:teuthology.orchestra.run.smithi088:> sudo systemctl stop ceph-a690d7e0-b833-11f0-8778-adfe0268badd@osd.2

Here's where we stop all the daemons in qa/tasks/cephadm.py:

log.info('Stopping all daemons...')

        # this doesn't block until they are all stopped...
        #ctx.cluster.run(args=['sudo', 'systemctl', 'stop', 'ceph.target'])

        # stop the daemons we know
        for role in ctx.daemons.resolve_role_list(None, CEPH_ROLE_TYPES, True):
            cluster, type_, id_ = teuthology.split_role(role)
            try:
                ctx.daemons.get_daemon(type_, id_, cluster).stop()
            except Exception:
                log.exception(f'Failed to stop "{role}"')
                raise

We're stopping all the daemons that we know via CEPH_ROLE_TYPES. This is defined at the top of the file:

CEPH_ROLE_TYPES = ['mon', 'mgr', 'osd', 'mds', 'rgw', 'prometheus']

I think the issue is that we're missing "agent" here. So, the task is unaware that it needs to stop the agent service during teardown.

I created a test branch here: https://github.com/ceph/ceph/compare/main...ljflores:ceph:wip-fix-iscsi-test
And I scheduled a test here: https://pulpito.ceph.com/lflores-2025-11-13_21:17:32-rados:cephadm:workunits-main-distro-default-smithi/8601828/

There might be more going on, but this is my hunch based on analyzing what I could see. At the time of writing this comment, the test has yet to finish, so I'll follow up to see the result and go from there.

Actions #91

Updated by Laura Flores 4 months ago

The test still failed in the same way (note that I stopped it early since the test normally takes up to 8 hours to time out, which is a problem in itself). But, the test still got stuck on removing the cluster. The agent service was not stopped either with the fix I tried, so I may not have gotten it right about how to stop it.

Actions #92

Updated by Laura Flores 4 months ago · Edited

I scheduled a new test here: https://pulpito.ceph.com/lflores-2025-11-13_21:59:24-rados:cephadm:workunits-main-distro-default-smithi/8601889/

When the test got stuck, I ssh'ed to the machine and manually stopped the agent with:

[lflores@smithi077 ~]$ sudo systemctl list-units | grep "ceph-6d31444a-c0de-11f0-877d-adfe0268badd" 
  ceph-6d31444a-c0de-11f0-877d-adfe0268badd-sidecar@iscsi.foo.smithi077.qfiazg:tcmu.service                        loaded active running   Ceph sidecar iscsi.foo.smithi077.qfiazg:tcmu for 6d31444a-c0de-11f0-877d-adfe0268badd
  ceph-6d31444a-c0de-11f0-877d-adfe0268badd@agent.smithi077.service                                                loaded active running   cephadm agent for cluster 6d31444a-c0de-11f0-877d-adfe0268badd
  ceph-6d31444a-c0de-11f0-877d-adfe0268badd@iscsi.foo.smithi077.qfiazg.service                                     loaded active running   Ceph iscsi.foo.smithi077.qfiazg for 6d31444a-c0de-11f0-877d-adfe0268badd
  system-ceph\x2d6d31444a\x2dc0de\x2d11f0\x2d877d\x2dadfe0268badd.slice                                            loaded active active    Slice /system/ceph-6d31444a-c0de-11f0-877d-adfe0268badd
  system-ceph\x2d6d31444a\x2dc0de\x2d11f0\x2d877d\x2dadfe0268badd\x2dsidecar.slice                                 loaded active active    Slice /system/ceph-6d31444a-c0de-11f0-877d-adfe0268badd-sidecar
  ceph-6d31444a-c0de-11f0-877d-adfe0268badd.target                                                                 loaded active active    Ceph cluster 6d31444a-c0de-11f0-877d-adfe0268badd
[lflores@smithi077 ~]$ sudo systemctl stop ceph-6d31444a-c0de-11f0-877d-adfe0268badd@agent.smithi077.service

This instantly made the test pass, so that was definitely the problem. Now, it's just a matter of making the cephadm task do it properly.

Actions #93

Updated by Redouane Kachach Elhicou 4 months ago · Edited

Just to add that the agent is not a "normal service". To disable the agent you have to set the following config param:

ceph config set mgr mgr/cephadm/use_agent false

Once set to false, the agent service will be stopped and the daemons will be removed automatically by cephadm.

Actions #94

Updated by Sridhar Seshasayee 4 months ago

/a/skanta-2025-11-13_10:26:04-rados-wip-bharath3-testing-2025-11-12-2038-distro-default-smithi/8601373

Actions #95

Updated by Aishwarya Mathuria 4 months ago

/a/yuriw-2025-12-03_15:44:36-rados-wip-yuri5-testing-2025-12-02-1256-distro-default-smithi/8639549

Actions #96

Updated by Laura Flores 3 months ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Adam King to Laura Flores
  • Pull request ID set to 66613
Actions #97

Updated by Laura Flores about 2 months ago

  • Has duplicate Bug #74578: test_iscsi_container is running continuously without a defined exit criterion. added
Actions #98

Updated by Aishwarya Mathuria about 2 months ago

/a/skanta-2026-01-30_23:46:16-rados-wip-bharath7-testing-2026-01-29-2016-distro-default-trial/28569

Actions #99

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-07_00:02:26-rados-wip-bharath7-testing-2026-02-06-0906-distro-default-trial/39114

Actions #100

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_02:19:11-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/24699

Actions #101

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25713

Actions #102

Updated by Lee Sanders about 1 month ago

/a/skanta-2026-01-29_13:05:02-rados-wip-bharath5-testing-2026-01-28-2018-distro-default-trial/25728

Actions #103

Updated by Aishwarya Mathuria about 1 month ago

/a/skanta-2026-02-05_03:38:32-rados-wip-bharath2-testing-2026-02-03-0542-distro-default-trial/35652

Actions #104

Updated by Lee Sanders 26 days ago

/a/skanta-2026-02-07_14:54:11-rados-wip-bharath5-testing-2026-02-06-2052-distro-default-trial/39478

Actions #105

Updated by Sridhar Seshasayee 19 days ago

/a/sseshasa-2026-02-26_14:56:45-rados-wip-sseshasa-testing-2026-02-26-1772100687-distro-default-trial/72379

Actions #106

Updated by Aishwarya Mathuria 11 days ago

/a/skanta-2026-03-07_15:39:05-rados-wip-bharath4-testing-2026-03-05-1456-tentacle-distro-default-trial/93044

Actions #107

Updated by Nitzan Mordechai 11 days ago

/a/skanta-2026-03-08_04:44:53-rados-wip-bharath5-testing-2026-03-07-1422-distro-default-trial/94078

Actions #108

Updated by Sridhar Seshasayee 8 days ago

/a/skanta-2026-03-04_23:53:38-rados-wip-bharath1-testing-2026-03-04-1011-distro-default-trial/85629

Actions

Also available in: Atom PDF