Bug #59380
closedrados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"
0%
Description
2023-03-31T00:25:12.169 INFO:tasks.rgw.client.0.smithi157.stdout:2023-03-31T00:25:12.167+0000 7f470012f700 -1 received signal: Terminated from /usr/bin/python3 /bin/daemon-helper term radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground (PID: 110884) UID: 0
2023-03-31T00:25:12.169 INFO:tasks.rgw.client.0.smithi157.stdout:2023-03-31T00:25:12.167+0000 7f47041857c0 -1 shutting down
2023-03-31T00:25:14.065 INFO:tasks.ceph.mon.a.smithi157.stderr:2023-03-31T00:25:14.063+0000 7f8bfe441700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL)
2023-03-31T00:25:15.662 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2023-03-31T00:25:17.523 INFO:tasks.ceph.mds.a.smithi157.stderr:2023-03-31T00:25:17.522+0000 7f1f542c9700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2023-03-31T00:25:18.270 INFO:tasks.rgw.client.0:Stopped
2023-03-31T00:25:18.271 DEBUG:teuthology.orchestra.run.smithi157:> rm -f /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock
2023-03-31T00:25:18.298 DEBUG:teuthology.orchestra.run.smithi157:> sudo rm -f /etc/ceph/vault-root-token
2023-03-31T00:25:18.372 DEBUG:teuthology.orchestra.run.smithi157:> radosgw-admin gc process --include-all
2023-03-31T00:25:21.168 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored
2023-03-31T12:11:11.597 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2023-03-31T12:11:11.634 DEBUG:teuthology.task.console_log:Killing console logger for smithi157
2023-03-31T12:11:11.635 DEBUG:teuthology.exit:Finished running handlers
/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227455
/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231345
Updated by Laura Flores almost 3 years ago
Also in the logs:
/a/yuriw-2023-04-04_15:22:56-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7231083
2023-04-04T18:50:13.147 INFO:tasks.rgw.client.0.smithi191.stdout:2023-04-04T18:50:13.142+0000 7ff108317700 -1 received signal: Terminated from /usr/bin/python3 /usr/bin/daemon-helper term radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground (PID: 29378) UID: 0
2023-04-04T18:50:13.147 INFO:tasks.rgw.client.0.smithi191.stdout:2023-04-04T18:50:13.142+0000 7ff10bb62f80 -1 shutting down
2023-04-04T18:50:13.997 INFO:tasks.ceph.mon.a.smithi191.stderr:2023-04-04T18:50:13.994+0000 7fe05488a700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2023-04-04T18:50:17.024 INFO:tasks.ceph.mon.a.smithi191.stderr:2023-04-04T18:50:17.022+0000 7fe05488a700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL)
2023-04-04T18:50:18.211 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2023-04-04T18:50:19.020 INFO:tasks.ceph.mds.a.smithi191.stderr:2023-04-04T18:50:19.018+0000 7fecb7b6d700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2023-04-04T18:50:19.249 INFO:tasks.rgw.client.0:Stopped
2023-04-04T18:50:19.249 DEBUG:teuthology.orchestra.run.smithi191:> rm -f /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock
2023-04-04T18:50:19.258 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -f /etc/ceph/vault-root-token
2023-04-04T18:50:19.274 DEBUG:teuthology.orchestra.run.smithi191:> radosgw-admin gc process --include-all
2023-04-04T18:50:23.714 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored
2023-04-05T06:39:49.154 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2023-04-05T06:39:49.186 DEBUG:teuthology.task.console_log:Killing console logger for smithi191
2023-04-05T06:39:49.187 DEBUG:teuthology.exit:Finished running handlers
Updated by Laura Flores almost 3 years ago
- Subject changed from rados/singleton-nomsgr: Health check failed: 1 full osd(s) (OSD_FULL) to rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253420
Updated by Nitzan Mordechai almost 3 years ago
- Project changed from RADOS to rgw
it doesn't look like it is OSD_FULL issue, we didn't got to the point that we are scanning for log-ignore messages, the job got stuck for 12 hours before teuthology killed it.
i think its rgw client that we are waiting for that had some core dump that we didn't got:
-306> 2023-04-04T22:32:57.496+0000 7f8b3005f7c0 5 rgw main: tl::expected<std::pair<boost::container::flat_map<long unsigned int, logback_generation>, obj_version>, boost::system::error_code> logback_generations: :read(const DoutPrefixProvider*, optional_yield):410: oid=data_loggenerations_metadata not found -305> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding auth protocol: cephx -304> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding auth protocol: cephx -303> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding auth protocol: cephx -302> 2023-04-04T22:32:57.526+0000 7f8b11fd4700 2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start -301> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure -300> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc -299> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure -298> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc -297> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure -296> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc -295> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure -294> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc -293> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure -292> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc -291> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure -290> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 2 auth: KeyRing::load: loaded key file /etc/ceph/ceph.client.0.keyring -289> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 asok(0x55f851b3e000) register_command cache list hook 0x55f851854050 -288> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 asok(0x55f851b3e000) register_command cache inspect hook 0x55f851854050 -287> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 asok(0x55f851b3e000) register_command cache erase hook 0x55f851854050 -286> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 5 asok(0x55f851b3e000) register_command cache zap hook 0x55f851854050 -285> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 10 monclient: get_monmap_and_config -284> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 10 monclient: build_initial_monmap -283> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 1 build_initial for_mkfs: 0 -282> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 10 monclient: monmap:
If anyone from RGW can check that
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250243
Updated by Casey Bodley almost 3 years ago
i don't see evidence of an rgw crash. but if an OSD is full, couldn't that cause radosgw to hang waiting on requests to it?
Updated by Nitzan Mordechai almost 3 years ago
Casey Bodley, can you please check: /a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250243/remote/smithi138/log/rgw.ceph.client.0.log.gz
for example
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276080
/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276235
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271099
/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271254
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-05-19_19:19:25-rados-wip-yuri11-testing-2023-05-19-0836-distro-default-smithi/7279399
/a/yuriw-2023-05-19_19:19:25-rados-wip-yuri11-testing-2023-05-19-0836-distro-default-smithi/7279244
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-06-01_19:33:38-rados-wip-yuri-testing-2023-06-01-0746-distro-default-smithi/7293994
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-06-06_14:45:18-rados-wip-yuri7-testing-2023-06-05-1505-distro-default-smithi/7296832
Updated by Laura Flores almost 3 years ago
/a/yuriw-2023-06-13_18:33:48-rados-wip-yuri10-testing-2023-06-02-1406-distro-default-smithi/7302949
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-07-03_15:32:36-rados-wip-yuri5-testing-2023-06-28-1515-distro-default-smithi/7325426
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-07-03_15:30:40-rados-wip-yuri7-testing-2023-06-23-1022-distro-default-smithi/7324875
Updated by Casey Bodley over 2 years ago
i do see rgw crashes in these logs:
-12> 2023-05-17T02:30:05.789+0000 7f8007d53700 1 do_command 'config diff' '{prefix=config diff}'
-11> 2023-05-17T02:30:05.789+0000 7f8007d53700 1 do_command 'config diff' '{prefix=config diff}' result is 0 bytes
-10> 2023-05-17T02:30:05.789+0000 7f8007d53700 1 do_command 'config help' '{prefix=config help}'
-9> 2023-05-17T02:30:05.797+0000 7f8007d53700 1 do_command 'config help' '{prefix=config help}' result is 0 bytes
-8> 2023-05-17T02:30:05.801+0000 7f8007d53700 1 do_command 'config show' '{prefix=config show}'
-7> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'config show' '{prefix=config show}' result is 0 bytes
-6> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'counter dump' '{prefix=counter dump}'
-5> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'counter dump' '{prefix=counter dump}' result is 0 bytes
-4> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'counter schema' '{prefix=counter schema}'
-3> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'counter schema' '{prefix=counter schema}' result is 0 bytes
-2> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'injectargs' '{prefix=injectargs}'
-1> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'injectargs' '{prefix=injectargs}' result is 0 bytes
0> 2023-05-17T02:30:05.805+0000 7f8007d53700 1 do_command 'log dump' '{prefix=log dump}'
but there's no stack trace, and ceph_test_admin_socket_output passes anyway
Updated by Kamoltat (Junior) Sirivadhna over 2 years ago
/a/yuriw-2023-07-10_18:41:02-rados-wip-yuri6-testing-2023-07-10-0816-distro-default-smithi/7332405
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-07-17_14:37:31-rados-wip-yuri-testing-2023-07-14-1641-distro-default-smithi/7341584
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-08-17_21:18:20-rados-wip-yuri11-testing-2023-08-17-0823-distro-default-smithi/7372055
Updated by Matan Breizman over 2 years ago
/a//yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376756
/a//yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376912
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-08-15_18:58:56-rados-wip-yuri3-testing-2023-08-15-0955-distro-default-smithi/7369398
Updated by Laura Flores over 2 years ago
/a/lflores-2023-09-05_22:05:24-rados-wip-yuri8-testing-2023-08-28-1340-distro-default-smithi/7389178
Updated by Laura Flores over 2 years ago
/a/lflores-2023-09-06_18:23:03-rados-wip-yuri-testing-2023-08-25-0809-distro-default-smithi/7390358
Updated by Laura Flores over 2 years ago
/a/lflores-2023-09-08_20:36:06-rados-wip-lflores-testing-2-2023-09-08-1755-distro-default-smithi/7391626
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-10-24_00:11:54-rados-wip-yuri4-testing-2023-10-23-0903-distro-default-smithi/7435687
Updated by Nitzan Mordechai over 2 years ago
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441165
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441319
Updated by Laura Flores over 2 years ago
- Backport set to reef
/a/yuriw-2023-10-31_14:43:48-rados-wip-yuri4-testing-2023-10-30-1117-distro-default-smithi/7442070
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-10-24_00:11:03-rados-wip-yuri2-testing-2023-10-23-0917-distro-default-smithi/7435871
Updated by Laura Flores over 2 years ago
/a/yuriw-2023-12-07_16:42:12-rados-wip-yuri2-testing-2023-12-06-1239-distro-default-smithi/7482272
Updated by Nitzan Mordechai over 2 years ago
/a/yuriw-2023-12-11_23:27:14-rados-wip-yuri8-testing-2023-12-11-1101-distro-default-smithi/7487804
/a/yuriw-2023-12-11_23:27:14-rados-wip-yuri8-testing-2023-12-11-1101-distro-default-smithi/7487806
/a/yuriw-2023-12-11_23:27:14-rados-wip-yuri8-testing-2023-12-11-1101-distro-default-smithi/7487647
Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505468
Updated by Laura Flores about 2 years ago
/a/yuriw-2024-01-18_21:18:17-rados-wip-yuri8-testing-2024-01-18-0823-distro-default-smithi/7521174
Updated by Matan Breizman about 2 years ago
/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553401
Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2024-03-15_19:59:43-rados-wip-yuri6-testing-2024-03-15-0709-distro-default-smithi/7603379/
/a/yuriw-2024-03-15_19:59:43-rados-wip-yuri6-testing-2024-03-15-0709-distro-default-smithi/7603639/
Updated by Aishwarya Mathuria about 2 years ago
/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609873
Updated by Aishwarya Mathuria almost 2 years ago
/a/yuriw-2024-03-20_14:28:32-rados-wip-yuri2-testing-2024-03-13-0827-distro-default-smithi/7612365/
Updated by Laura Flores almost 2 years ago
/a/yuriw-2024-03-20_18:33:32-rados-wip-yuri6-testing-2024-03-18-1406-squid-distro-default-smithi/7613140
Updated by Aishwarya Mathuria almost 2 years ago ยท Edited
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7681095/
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680996/
Updated by Laura Flores almost 2 years ago
/a/yuriw-2024-05-02_23:59:28-rados-wip-yuriw11-testing-20240501.200505-squid-distro-default-smithi/7687043
Updated by Sridhar Seshasayee almost 2 years ago
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705412
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705439
Updated by Aishwarya Mathuria almost 2 years ago
/a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7747149
/a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7746999
Updated by Neha Ojha almost 2 years ago
Looking at the logs for /a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705412/, it looks to me that the log dump command is invoked prior to osd.1 becomes full (Injected full).
-1> 2024-05-14T01:39:53.618+0000 7f0769c3c640 1 do_command 'injectargs' '{prefix=injectargs}' result is 0 bytes
0> 2024-05-14T01:39:53.618+0000 7f0769c3c640 1 do_command 'log dump' '{prefix=log dump}'
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
2024-05-14T01:40:00.834+0000 7f5fee7cb640 20 osd.1 30 check_full_status cur ratio 0.000345268, physical ratio 0.000345268, new state full (Injected) 2024-05-14T01:40:00.834+0000 7f5fee7cb640 10 osd.1 30 check_full_status none -> full ... 2024-05-14T01:40:01.606+0000 7f6009812640 10 osd.1 30 send_full_update want state full 2024-05-14T01:40:01.606+0000 7f6009812640 1 -- [v2:172.21.15.22:6808/3868053108,v1:172.21.15.22:6809/3868053108] --> [v2:172.21.15.22:3300/0,v1:172.21.15.22:6789/0] -- osd_full(e30 full v30) -- 0x563536ccb380 con 0x563535752400
Updated by Brad Hubbard almost 2 years ago
So in the logs I can still see it appears we are running the ceph_test_admin_socket_output command which will reflect the available commands back to each daemon it is testing and that includes the 'exit' admin socket command on the mds and I'm wondering if this is the actual problem?
$ egrep "(bash -c 'ceph_test_admin_socket_output|Exiting due to admin socket command)" /a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7747149/teuthology.log 2024-06-10T05:54:49.744 DEBUG:teuthology.orchestra.run.smithi150:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph_test_admin_socket_output --all' 2024-06-10T05:54:50.859 INFO:tasks.ceph.mds.a.smithi150.stderr:2024-06-10T05:54:50.860+0000 7f7b6bbd7640 -1 mds.a Exiting due to admin socket command 2024-06-10T05:54:53.436 INFO:tasks.ceph.mds.a.smithi150.stderr: -63> 2024-06-10T05:54:50.860+0000 7f7b6bbd7640 -1 mds.a Exiting due to admin socket command $ egrep "(bash -c 'ceph_test_admin_socket_output|Exiting due to admin socket command)" /a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7746999/teuthology.log 2024-06-10T04:37:47.117 DEBUG:teuthology.orchestra.run.smithi133:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph_test_admin_socket_output --all' 2024-06-10T04:37:48.233 INFO:tasks.ceph.mds.a.smithi133.stderr:2024-06-10T04:37:48.244+0000 7f3d41732640 -1 mds.a Exiting due to admin socket command 2024-06-10T04:37:50.808 INFO:tasks.ceph.mds.a.smithi133.stderr: -63> 2024-06-10T04:37:48.244+0000 7f3d41732640 -1 mds.a Exiting due to admin socket command
That looks probable to me at this stage, trying to reproduce and see if I can test that theory.
By a curious quirk of fate I wrote that test so it immediately caught my attention.
Updated by Brad Hubbard almost 2 years ago
So my first speculation is a "no" I believe :)
I'm working through this but here's what I've got so far.
2024-05-14T01:57:46.093 DEBUG:teuthology.run_tasks:Unwinding manager rgw 2024-05-14T01:57:46.103 DEBUG:tasks.rgw.client.0:waiting for process to exit 2024-05-14T01:57:46.103 INFO:teuthology.orchestra.run:waiting for 300 2024-05-14T01:57:46.136 INFO:tasks.rgw.client.0.smithi089.stdout:2024-05-14T01:57:46.136+0000 7fad828e9640 -1 received signal: Terminated from /usr/bin/python3 /usr/bin/daemon-helper term radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground (PID: 26042) UID: 0 2024-05-14T01:57:46.136 INFO:tasks.rgw.client.0.smithi089.stdout:2024-05-14T01:57:46.136+0000 7fad86131a40 -1 shutting down 2024-05-14T01:57:47.752 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s 2024-05-14T01:57:50.011 INFO:tasks.ceph.mon.a.smithi089.stderr:2024-05-14T01:57:50.008+0000 7f098360b640 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL) 2024-05-14T01:57:50.996 INFO:tasks.ceph.mds.a.smithi089.stderr:2024-05-14T01:57:50.996+0000 7f885e903640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request. 2024-05-14T01:57:52.206 INFO:tasks.rgw.client.0:Stopped
radosgw was ceph.rgw.client.0 and died on receiving a signal, more on this soon.
The watchdog has marked ceph.rgw.client.0 as failed.
2024-05-14T01:57:52.324 INFO:tasks.util.rgw:rgwadmin: client.0 : ['gc', 'process', '--include-all'] 2024-05-14T01:57:52.325 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/home/ubuntu/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'ceph', 'gc', 'process', '--include-all'] 2024-05-14T01:57:52.325 DEBUG:teuthology.orchestra.run.smithi089:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster ceph gc process --include-all 2024-05-14T01:57:53.256 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored
radosgw-admin starts running and I believe it hangs because there is no radosgw daemon running (confirmed in my reproducer) but it becomes ceph.rgw.client.0 and the daemon watchdog thinks the rgw daemon is back so does not think anything failed and is going to wait 12 hours for the test to "finish" before it times out [0].
Now why does the radosgw daemon receive a signal? I believe it is probably coming from the ceph_test_admin_socket_output sending it the fairly recently added "raise" signal [1] and this is killing it but I need to do some more testing to confirm that but certainly if I leave the radosgw-admin process it runs indefinitely and if I kill it the test finishes (fails) immediately.
Need to wrap this up and put a bow on it but I think we are zeroing in on the RC.
[0] https://github.com/ceph/ceph/blob/3394d1bbd4a653a92107b32da59890d10f286646/qa/tasks/daemonwatchdog.py#L98-L115
[1] https://github.com/ceph/ceph/commit/7cc54b2b7323096bf7394d8502ff232a78e0c355
Updated by Neha Ojha almost 2 years ago
- Priority changed from Normal to Urgent
Updated by Neha Ojha almost 2 years ago
rados/singleton-nomsgr/{all/admin_socket_output tests have been failing since more than a year now.
https://pulpito.ceph.com/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227455/ - oldest occurrence I could find. I suspect https://github.com/ceph/ceph/pull/45693/files#diff-01a30802cc74ea92dc086059fdd77bb5bcc4c04a2eb3b74f5620385615916f0bR242 could be related.
Updated by Brad Hubbard almost 2 years ago
Casey Bodley wrote in #note-16:
i do see rgw crashes in these logs:
[...]but there's no stack trace, and
ceph_test_admin_socket_outputpasses anyway
Right, and those entries are just the 'log dump' command if I'm not mistaken?
From this line it's clear the rgw instance is being shut down by teuthology as a normal operation after the tasks have all completed.
teuthology.run_tasks:Unwinding manager rgw
If I add the following change to qa/ this test does not fail.
bhubbard@teuthology:~/src/git.ceph.com_ceph_7b7a5d375cf4e4afab96b9d4d229b699ae8304bb$ git diff
diff --git a/qa/tasks/rgw.py b/qa/tasks/rgw.py
index 2e44c0bad5..880f72a6ff 100644
--- a/qa/tasks/rgw.py
+++ b/qa/tasks/rgw.py
@@ -266,7 +266,7 @@ def start_rgw(ctx, config, clients):
)
ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', token_path])
ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', url_file])
- rgwadmin(ctx, client, cmd=['gc', 'process', '--include-all'], check_status=True)
+ # rgwadmin(ctx, client, cmd=['gc', 'process', '--include-all'], check_status=True)
def assign_endpoints(ctx, config, default_cert):
role_endpoints = {}
So there are two issues here that need to be resolved AFAICT at this stage.
1: The radosgw-admin gc command runs indefinitely.
2: Per comment #45 even though the rgw daemon has been stopped normally the daemon_watchdog thinks it is running since it incorrectly detects the radosgw-admin gc task to be the rgw daemon restarting. Since the radosgw-admin gc task runs indefinitely the daemon_watchdog waits until twelve hours later to timeout and fail the test.
I created the following patch and ran the test again in an attempt to gather additional information from the radosgw-admin process.
bhubbard@teuthology:~/src/git.ceph.com_ceph_7b7a5d375cf4e4afab96b9d4d229b699ae8304bb$ git diff
diff --git a/qa/tasks/rgw.py b/qa/tasks/rgw.py
index 2e44c0bad5..262d3eb968 100644
--- a/qa/tasks/rgw.py
+++ b/qa/tasks/rgw.py
@@ -266,7 +266,7 @@ def start_rgw(ctx, config, clients):
)
ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', token_path])
ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', url_file])
- rgwadmin(ctx, client, cmd=['gc', 'process', '--include-all'], check_status=True)
+ rgwadmin(ctx, client, cmd=['--debug-rgw', '20', '--debug-rados', '20', 'gc', 'process', '--include-all'], check_status=True)
def assign_endpoints(ctx, config, default_cert):
role_endpoints = {}
It produced the following logs. I also looked at gstack output and every thread appears to be waiting so it seems to be waiting for something that never arrives.
2024-06-20 03:04:54,634.634 DEBUG:teuthology.orchestra.run.smithi057:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster ceph --debug-rgw 20 --debug-rados 20 gc process --include-all 2024-06-20 03:04:54,747.747 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800 1 librados: starting msgr at 2024-06-20 03:04:54,747.747 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800 1 librados: starting objecter 2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800 1 librados: setting wanted keys 2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800 1 librados: calling monclient init 2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.749+0000 7f9a1b525800 1 librados: init done 2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.749+0000 7f9a1b525800 10 librados: wait_for_osdmap waiting 2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.750+0000 7f9a1b525800 10 librados: wait_for_osdmap done waiting 2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.750+0000 7f9a1b525800 10 librados: read oid=default.realm nspace= 2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.751+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2 2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.751+0000 7f9a1b525800 10 librados: read oid=zone_names.default nspace= 2024-06-20 03:04:54,750.750 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.752+0000 7f9a1b525800 10 librados: Objecter returned from read r=0 2024-06-20 03:04:54,750.750 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.752+0000 7f9a1b525800 10 librados: call oid=zone_info.389a7194-1154-447c-91b8-1bf4ae4c5887 nspace= 2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.752+0000 7f9a1b525800 10 librados: Objecter returned from call r=0 2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: read oid=zonegroups_names.default nspace= 2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: Objecter returned from read r=0 2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: call oid=zonegroup_info.a1d12ca3-3683-47ea-815c-25f465240cc5 nspace= 2024-06-20 03:04:54,752.752 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: Objecter returned from call r=0 2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.757+0000 7f9a1b525800 1 librados: starting msgr at 2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.757+0000 7f9a1b525800 1 librados: starting objecter 2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.758+0000 7f9a1b525800 1 librados: setting wanted keys 2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.758+0000 7f9a1b525800 1 librados: calling monclient init 2024-06-20 03:04:54,757.757 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.759+0000 7f9a1b525800 1 librados: init done 2024-06-20 03:04:54,759.759 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,759.759 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 10 librados: read oid=default.realm nspace= 2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2 2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 realm 2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 10 librados: read oid=default.realm nspace= 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 4 RGWPeriod::init failed to init realm id : (2) No such file or directory 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: read oid=default.realm nspace= 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: read oid=zone_names.default nspace= 2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 10 librados: Objecter returned from read r=0 2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=46 2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 10 librados: read oid=zone_info.389a7194-1154-447c-91b8-1bf4ae4c5887 nspace= 2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 10 librados: Objecter returned from read r=0 2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=993 2024-06-20 03:04:54,763.763 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 searching for the correct realm 2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zone_info.389a7194-1154-447c-91b8-1bf4ae4c5887 2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got default.zonegroup. 2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got default.zone. 2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zone_names.default 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zonegroups_names.default 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zonegroup_info.a1d12ca3-3683-47ea-815c-25f465240cc5 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 10 librados: read oid=default.realm nspace= 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 10 librados: read oid=zonegroups_names.default nspace= 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from read r=0 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=46 2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: read oid=zonegroup_info.a1d12ca3-3683-47ea-815c-25f465240cc5 nspace= 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from read r=0 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=436 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 zone default found 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 4 Realm: () 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 4 ZoneGroup: default (a1d12ca3-3683-47ea-815c-25f465240cc5) 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 4 Zone: default (389a7194-1154-447c-91b8-1bf4ae4c5887) 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 cannot find current period zonegroup using local zonegroup configuration 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 zonegroup default 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: read oid=period_config.default nspace= 2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: call oid=bucket.sync-source-hints. nspace= 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from call r=-2 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: call oid=bucket.sync-target-hints. nspace= 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 10 librados: Objecter returned from call r=-2 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 20 started sync module instance, tier type = 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 20 started zone id=389a7194-1154-447c-91b8-1bf4ae4c5887 (name=default) with tier type = 2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 10 librados: create oid=notify.0 nspace= 2024-06-20 03:04:56,380.380 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored
So it looks like there are some unexpected ENOENTs there? Casey, are you able to provide any feedback on this at this time?
Updated by Casey Bodley almost 2 years ago
when radosgw-admin hangs, are all of the osds still up and healthy?
Updated by Brad Hubbard over 1 year ago
Hmm... I see your point... :P
Let me work out a way to clear that.
Updated by Casey Bodley over 1 year ago
Brad Hubbard wrote in #note-50:
Hmm... I see your point... :P
Let me work out a way to clear that.
if that proves difficult, the rgw task could make that command conditional on a gc-on-shutdown flag so this job can disable it
Updated by Brad Hubbard over 1 year ago
No Casey, it's OK in this instance (although this could possibly be a cautionary tail).
This fixes it and I'll submit a PR next week, thanks.
diff --git a/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml b/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
index 341a559f3f..2bf32a1fd9 100644
--- a/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
+++ b/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
@@ -26,3 +26,4 @@ tasks:
- exec:
client.0:
- ceph_test_admin_socket_output --all
+ - sudo ceph tell osd.* injectfull none # Fixes https://tracker.ceph.com/issues/59380
The problem was of course that we were sending injectfull to one of the osds and this caused IO, and therefore the radosgw-admin gc command, to hang. Clearing injectfull allows IO to proceed since the osd is not actually full.
Updated by Brad Hubbard over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 58215
Updated by Matan Breizman over 1 year ago
- Tags changed from test-failure to test-failure, main-failures
Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
PR that fixes this issue is currently being tested upstream
Updated by Laura Flores over 1 year ago
Note from bug scrub: Waiting on needs-qa
Updated by Laura Flores over 1 year ago
Test review is almost done for this one.
Updated by Laura Flores over 1 year ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from reef to squid,reef
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67157: squid: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" added
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67158: reef: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" added
Updated by Upkeep Bot over 1 year ago
- Tags (freeform) set to backport_processed
Updated by Nitzan Mordechai over 1 year ago
looks like we have few more suites that using the same exec and still having issue:
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812731
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812720
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812726
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812725
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812735
Updated by Laura Flores over 1 year ago
Nitzan Mordechai wrote in #note-64:
looks like we have few more suites that using the same exec and still having issue:
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812731
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812720
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812726
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812725
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812735
Was this from an older run? the fix was only merged two weeks ago.
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-08-02_15:32:29-rados-squid-release-distro-default-smithi/7832978
Updated by Brad Hubbard over 1 year ago ยท Edited
$ git checkout wip-yuri2-testing-2024-07-19-0817
$ tail -3 qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
- exec:
client.0:
- ceph_test_admin_socket_output --all
That branch does not include the commit @Nitzan Mordechai ?
Updated by Laura Flores over 1 year ago
/a/yuriw-2024-08-29_14:15:40-rados-wip-yuri8-testing-2024-08-28-1632-squid-distro-default-smithi/7880134
Updated by Laura Flores over 1 year ago
/a/skanta-2024-09-07_00:00:30-rados-squid-release-distro-default-smithi/7893165
Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
/a/yuriw-2024-10-16_22:34:49-rados-wip-yuri10-testing-2024-10-15-1007-distro-default-smithi/7952510/
/a/yuriw-2024-10-16_22:34:49-rados-wip-yuri10-testing-2024-10-15-1007-distro-default-smithi/7952657/
Updated by Naveen Naidu over 1 year ago
/a/yuriw-2024-10-17_14:30:20-rados-wip-yuri-testing-2024-10-09-1524-squid-distro-default-smithi/7954336
/a/yuriw-2024-10-17_14:30:20-rados-wip-yuri-testing-2024-10-09-1524-squid-distro-default-smithi/7954347
Updated by Brad Hubbard over 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 4adc795c49306d4a8ef0d55959b41cd3374c6420
- Fixed In set to v19.3.0-3681-g4adc795c493
- Upkeep Timestamp set to 2025-07-12T14:39:50+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-3681-g4adc795c493 to v19.3.0-3681-g4adc795c49
- Upkeep Timestamp changed from 2025-07-12T14:39:50+00:00 to 2025-07-14T23:40:15+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2411
- Upkeep Timestamp changed from 2025-07-14T23:40:15+00:00 to 2025-11-01T01:34:34+00:00