Bug #59380: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" - RADOS - Ceph

Actions

Copy link

Bug #59380

closed

rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Added by Laura Flores almost 3 years ago. Updated 5 months ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Brad Hubbard

Category:

Target version:

% Done:

Source:

Backport:

squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

58215

Tags (freeform):

backport_processed

Merge Commit:

4adc795c49306d4a8ef0d55959b41cd3374c6420

Fixed In:

v19.3.0-3681-g4adc795c49

Released In:

v20.2.0~2411

Upkeep Timestamp:

2025-11-01T01:34:34+00:00

Tags:

test-failure main-failures

Description

2023-03-31T00:25:12.169 INFO:tasks.rgw.client.0.smithi157.stdout:2023-03-31T00:25:12.167+0000 7f470012f700 -1 received  signal: Terminated from /usr/bin/python3 /bin/daemon-helper term radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground  (PID: 110884) UID: 0
2023-03-31T00:25:12.169 INFO:tasks.rgw.client.0.smithi157.stdout:2023-03-31T00:25:12.167+0000 7f47041857c0 -1 shutting down
2023-03-31T00:25:14.065 INFO:tasks.ceph.mon.a.smithi157.stderr:2023-03-31T00:25:14.063+0000 7f8bfe441700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL)
2023-03-31T00:25:15.662 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2023-03-31T00:25:17.523 INFO:tasks.ceph.mds.a.smithi157.stderr:2023-03-31T00:25:17.522+0000 7f1f542c9700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2023-03-31T00:25:18.270 INFO:tasks.rgw.client.0:Stopped
2023-03-31T00:25:18.271 DEBUG:teuthology.orchestra.run.smithi157:> rm -f /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock
2023-03-31T00:25:18.298 DEBUG:teuthology.orchestra.run.smithi157:> sudo rm -f /etc/ceph/vault-root-token
2023-03-31T00:25:18.372 DEBUG:teuthology.orchestra.run.smithi157:> radosgw-admin gc process --include-all
2023-03-31T00:25:21.168 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored
2023-03-31T12:11:11.597 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2023-03-31T12:11:11.634 DEBUG:teuthology.task.console_log:Killing console logger for smithi157
2023-03-31T12:11:11.635 DEBUG:teuthology.exit:Finished running handlers

/a/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227455
/a/yuriw-2023-04-04_15:24:40-rados-wip-yuri4-testing-2023-03-31-1237-distro-default-smithi/7231345

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Laura Flores almost 3 years ago

Also in the logs:

/a/yuriw-2023-04-04_15:22:56-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7231083

2023-04-04T18:50:13.147 INFO:tasks.rgw.client.0.smithi191.stdout:2023-04-04T18:50:13.142+0000 7ff108317700 -1 received  signal: Terminated from /usr/bin/python3 /usr/bin/daemon-helper term radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground  (PID: 29378) UID: 0
2023-04-04T18:50:13.147 INFO:tasks.rgw.client.0.smithi191.stdout:2023-04-04T18:50:13.142+0000 7ff10bb62f80 -1 shutting down
2023-04-04T18:50:13.997 INFO:tasks.ceph.mon.a.smithi191.stderr:2023-04-04T18:50:13.994+0000 7fe05488a700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2023-04-04T18:50:17.024 INFO:tasks.ceph.mon.a.smithi191.stderr:2023-04-04T18:50:17.022+0000 7fe05488a700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL)
2023-04-04T18:50:18.211 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2023-04-04T18:50:19.020 INFO:tasks.ceph.mds.a.smithi191.stderr:2023-04-04T18:50:19.018+0000 7fecb7b6d700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2023-04-04T18:50:19.249 INFO:tasks.rgw.client.0:Stopped
2023-04-04T18:50:19.249 DEBUG:teuthology.orchestra.run.smithi191:> rm -f /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock
2023-04-04T18:50:19.258 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -f /etc/ceph/vault-root-token
2023-04-04T18:50:19.274 DEBUG:teuthology.orchestra.run.smithi191:> radosgw-admin gc process --include-all
2023-04-04T18:50:23.714 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored
2023-04-05T06:39:49.154 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2023-04-05T06:39:49.186 DEBUG:teuthology.task.console_log:Killing console logger for smithi191
2023-04-05T06:39:49.187 DEBUG:teuthology.exit:Finished running handlers

Actions

Copy link

Updated by Laura Flores almost 3 years ago

Subject changed from rados/singleton-nomsgr: Health check failed: 1 full osd(s) (OSD_FULL) to rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Actions

Copy link

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-04-25_21:30:50-rados-wip-yuri3-testing-2023-04-25-1147-distro-default-smithi/7253420

Actions

Copy link

Updated by Nitzan Mordechai almost 3 years ago

Project changed from RADOS to rgw

it doesn't look like it is OSD_FULL issue, we didn't got to the point that we are scanning for log-ignore messages, the job got stuck for 12 hours before teuthology killed it.
i think its rgw client that we are waiting for that had some core dump that we didn't got:

  -306> 2023-04-04T22:32:57.496+0000 7f8b3005f7c0  5 rgw main: tl::expected<std::pair<boost::container::flat_map<long unsigned int, logback_generation>, obj_version>, boost::system::error_code> logback_generations:
:read(const DoutPrefixProvider*, optional_yield):410: oid=data_loggenerations_metadata not found
  -305> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding auth protocol: cephx
  -304> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding auth protocol: cephx
  -303> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding auth protocol: cephx
  -302> 2023-04-04T22:32:57.526+0000 7f8b11fd4700  2 rgw data changes log: RGWDataChangesLog::ChangesRenewThread: start
  -301> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure
  -300> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc
  -299> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure
  -298> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc
  -297> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure
  -296> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc
  -295> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure
  -294> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc
  -293> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure
  -292> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: crc
  -291> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 AuthRegistry(0x7ffe6a3da110) adding con mode: secure
  -290> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  2 auth: KeyRing::load: loaded key file /etc/ceph/ceph.client.0.keyring
  -289> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 asok(0x55f851b3e000) register_command cache list hook 0x55f851854050
  -288> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 asok(0x55f851b3e000) register_command cache inspect hook 0x55f851854050
  -287> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 asok(0x55f851b3e000) register_command cache erase hook 0x55f851854050
  -286> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  5 asok(0x55f851b3e000) register_command cache zap hook 0x55f851854050
  -285> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 10 monclient: get_monmap_and_config
  -284> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 10 monclient: build_initial_monmap
  -283> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0  1 build_initial for_mkfs: 0
  -282> 2023-04-04T22:32:57.526+0000 7f8b3005f7c0 10 monclient: monmap:

If anyone from RGW can check that

Actions

Copy link

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250243

Actions

Copy link

Updated by Casey Bodley almost 3 years ago

i don't see evidence of an rgw crash. but if an OSD is full, couldn't that cause radosgw to hang waiting on requests to it?

Actions

Copy link

Updated by Nitzan Mordechai almost 3 years ago

Casey Bodley, can you please check: /a/yuriw-2023-04-24_22:54:45-rados-wip-yuri7-testing-2023-04-19-1343-distro-default-smithi/7250243/remote/smithi138/log/rgw.ceph.client.0.log.gz
for example

Actions

Copy link

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276080
/a/yuriw-2023-05-16_23:44:06-rados-wip-yuri10-testing-2023-05-16-1243-distro-default-smithi/7276235

Actions

Copy link

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271099
/a/yuriw-2023-05-11_15:01:38-rados-wip-yuri8-testing-2023-05-10-1402-distro-default-smithi/7271254

Actions

Copy link

#10

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-05-19_19:19:25-rados-wip-yuri11-testing-2023-05-19-0836-distro-default-smithi/7279399
/a/yuriw-2023-05-19_19:19:25-rados-wip-yuri11-testing-2023-05-19-0836-distro-default-smithi/7279244

Actions

Copy link

#11

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-06-01_19:33:38-rados-wip-yuri-testing-2023-06-01-0746-distro-default-smithi/7293994

Actions

Copy link

#12

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-06-06_14:45:18-rados-wip-yuri7-testing-2023-06-05-1505-distro-default-smithi/7296832

Actions

Copy link

#13

Updated by Laura Flores almost 3 years ago

/a/yuriw-2023-06-13_18:33:48-rados-wip-yuri10-testing-2023-06-02-1406-distro-default-smithi/7302949

Actions

Copy link

#14

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-07-03_15:32:36-rados-wip-yuri5-testing-2023-06-28-1515-distro-default-smithi/7325426

Actions

Copy link

#15

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-07-03_15:30:40-rados-wip-yuri7-testing-2023-06-23-1022-distro-default-smithi/7324875

Actions

Copy link

#16

Updated by Casey Bodley over 2 years ago

i do see rgw crashes in these logs:

   -12> 2023-05-17T02:30:05.789+0000 7f8007d53700  1 do_command 'config diff' '{prefix=config diff}'
   -11> 2023-05-17T02:30:05.789+0000 7f8007d53700  1 do_command 'config diff' '{prefix=config diff}' result is 0 bytes
   -10> 2023-05-17T02:30:05.789+0000 7f8007d53700  1 do_command 'config help' '{prefix=config help}'
    -9> 2023-05-17T02:30:05.797+0000 7f8007d53700  1 do_command 'config help' '{prefix=config help}' result is 0 bytes
    -8> 2023-05-17T02:30:05.801+0000 7f8007d53700  1 do_command 'config show' '{prefix=config show}'
    -7> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'config show' '{prefix=config show}' result is 0 bytes
    -6> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'counter dump' '{prefix=counter dump}'
    -5> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'counter dump' '{prefix=counter dump}' result is 0 bytes
    -4> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'counter schema' '{prefix=counter schema}'
    -3> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'counter schema' '{prefix=counter schema}' result is 0 bytes
    -2> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'injectargs' '{prefix=injectargs}'
    -1> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'injectargs' '{prefix=injectargs}' result is 0 bytes
     0> 2023-05-17T02:30:05.805+0000 7f8007d53700  1 do_command 'log dump' '{prefix=log dump}'

but there's no stack trace, and ceph_test_admin_socket_output passes anyway

Actions

Copy link

#17

Updated by Kamoltat (Junior) Sirivadhna over 2 years ago

/a/yuriw-2023-07-10_18:41:02-rados-wip-yuri6-testing-2023-07-10-0816-distro-default-smithi/7332405

Actions

Copy link

#18

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-07-17_14:37:31-rados-wip-yuri-testing-2023-07-14-1641-distro-default-smithi/7341584

Actions

Copy link

#19

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-08-17_21:18:20-rados-wip-yuri11-testing-2023-08-17-0823-distro-default-smithi/7372055

Actions

Copy link

#20

Updated by Matan Breizman over 2 years ago

/a//yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376756
/a//yuriw-2023-08-22_18:16:03-rados-wip-yuri10-testing-2023-08-17-1444-distro-default-smithi/7376912

Actions

Copy link

#21

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-08-15_18:58:56-rados-wip-yuri3-testing-2023-08-15-0955-distro-default-smithi/7369398

Actions

Copy link

#22

Updated by Laura Flores over 2 years ago

/a/lflores-2023-09-05_22:05:24-rados-wip-yuri8-testing-2023-08-28-1340-distro-default-smithi/7389178

Actions

Copy link

#23

Updated by Laura Flores over 2 years ago

/a/lflores-2023-09-06_18:23:03-rados-wip-yuri-testing-2023-08-25-0809-distro-default-smithi/7390358

Actions

Copy link

#24

Updated by Laura Flores over 2 years ago

/a/lflores-2023-09-08_20:36:06-rados-wip-lflores-testing-2-2023-09-08-1755-distro-default-smithi/7391626

Actions

Copy link

#25

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-10-24_00:11:54-rados-wip-yuri4-testing-2023-10-23-0903-distro-default-smithi/7435687

Actions

Copy link

#26

Updated by Nitzan Mordechai over 2 years ago

/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441165
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441319

Actions

Copy link

#27

Updated by Laura Flores over 2 years ago

Backport set to reef

/a/yuriw-2023-10-31_14:43:48-rados-wip-yuri4-testing-2023-10-30-1117-distro-default-smithi/7442070

Actions

Copy link

#28

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-10-24_00:11:03-rados-wip-yuri2-testing-2023-10-23-0917-distro-default-smithi/7435871

Actions

Copy link

#29

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-12-07_16:42:12-rados-wip-yuri2-testing-2023-12-06-1239-distro-default-smithi/7482272

Actions

Copy link

#30

Updated by Nitzan Mordechai over 2 years ago

/a/yuriw-2023-12-11_23:27:14-rados-wip-yuri8-testing-2023-12-11-1101-distro-default-smithi/7487804
/a/yuriw-2023-12-11_23:27:14-rados-wip-yuri8-testing-2023-12-11-1101-distro-default-smithi/7487806
/a/yuriw-2023-12-11_23:27:14-rados-wip-yuri8-testing-2023-12-11-1101-distro-default-smithi/7487647

Actions

Copy link

#31

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2024-01-03_16:19:00-rados-wip-yuri6-testing-2024-01-02-0832-distro-default-smithi/7505468

Actions

Copy link

#32

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-01-18_21:18:17-rados-wip-yuri8-testing-2024-01-18-0823-distro-default-smithi/7521174

Actions

Copy link

#33

Updated by Matan Breizman about 2 years ago

/a/yuriw-2024-02-09_00:15:46-rados-wip-yuri2-testing-2024-02-08-0727-distro-default-smithi/7553401

Actions

Copy link

#34

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2024-03-15_19:59:43-rados-wip-yuri6-testing-2024-03-15-0709-distro-default-smithi/7603379/
/a/yuriw-2024-03-15_19:59:43-rados-wip-yuri6-testing-2024-03-15-0709-distro-default-smithi/7603639/

Actions

Copy link

#35

Updated by Aishwarya Mathuria about 2 years ago

/a/yuriw-2024-03-19_00:09:45-rados-wip-yuri5-testing-2024-03-18-1144-distro-default-smithi/7609873

Actions

Copy link

#36

Updated by Aishwarya Mathuria almost 2 years ago

/a/yuriw-2024-03-20_14:28:32-rados-wip-yuri2-testing-2024-03-13-0827-distro-default-smithi/7612365/

Actions

Copy link

#37

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-03-20_18:33:32-rados-wip-yuri6-testing-2024-03-18-1406-squid-distro-default-smithi/7613140

Actions

Copy link

#38

Updated by Aishwarya Mathuria almost 2 years ago · Edited

/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7681095/
/a/yuriw-2024-04-30_14:17:59-rados-wip-yuri5-testing-2024-04-17-1400-distro-default-smithi/7680996/

Actions

Copy link

#39

Updated by Laura Flores almost 2 years ago

/a/yuriw-2024-05-02_23:59:28-rados-wip-yuriw11-testing-20240501.200505-squid-distro-default-smithi/7687043

Actions

Copy link

#40

Updated by Sridhar Seshasayee almost 2 years ago

/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705412
/a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705439

Actions

Copy link

#41

Updated by Aishwarya Mathuria almost 2 years ago

/a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7747149
/a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7746999

Actions

Copy link

#42

Updated by Neha Ojha almost 2 years ago

Looking at the logs for /a/yuriw-2024-05-14_00:32:08-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7705412/, it looks to me that the log dump command is invoked prior to osd.1 becomes full (Injected full).

    -1> 2024-05-14T01:39:53.618+0000 7f0769c3c640  1 do_command 'injectargs' '{prefix=injectargs}' result is 0 bytes
     0> 2024-05-14T01:39:53.618+0000 7f0769c3c640  1 do_command 'log dump' '{prefix=log dump}'
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush

2024-05-14T01:40:00.834+0000 7f5fee7cb640 20 osd.1 30 check_full_status cur ratio 0.000345268, physical ratio 0.000345268, new state full (Injected)
2024-05-14T01:40:00.834+0000 7f5fee7cb640 10 osd.1 30 check_full_status none -> full
...
2024-05-14T01:40:01.606+0000 7f6009812640 10 osd.1 30 send_full_update want state full
2024-05-14T01:40:01.606+0000 7f6009812640  1 -- [v2:172.21.15.22:6808/3868053108,v1:172.21.15.22:6809/3868053108] --> [v2:172.21.15.22:3300/0,v1:172.21.15.22:6789/0] -- osd_full(e30 full v30) -- 0x563536ccb380 con 0x563535752400

Actions

Copy link

#43

Updated by Neha Ojha almost 2 years ago

Assignee set to Brad Hubbard

Actions

Copy link

#44

Updated by Brad Hubbard almost 2 years ago

So in the logs I can still see it appears we are running the ceph_test_admin_socket_output command which will reflect the available commands back to each daemon it is testing and that includes the 'exit' admin socket command on the mds and I'm wondering if this is the actual problem?

$ egrep "(bash -c 'ceph_test_admin_socket_output|Exiting due to admin socket command)" /a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7747149/teuthology.log 
2024-06-10T05:54:49.744 DEBUG:teuthology.orchestra.run.smithi150:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph_test_admin_socket_output --all'
2024-06-10T05:54:50.859 INFO:tasks.ceph.mds.a.smithi150.stderr:2024-06-10T05:54:50.860+0000 7f7b6bbd7640 -1 mds.a Exiting due to admin socket command
2024-06-10T05:54:53.436 INFO:tasks.ceph.mds.a.smithi150.stderr:   -63> 2024-06-10T05:54:50.860+0000 7f7b6bbd7640 -1 mds.a Exiting due to admin socket command

$ egrep "(bash -c 'ceph_test_admin_socket_output|Exiting due to admin socket command)" /a/teuthology-2024-06-09_21:00:03-rados-squid-distro-default-smithi/7746999/teuthology.log 
2024-06-10T04:37:47.117 DEBUG:teuthology.orchestra.run.smithi133:> sudo TESTDIR=/home/ubuntu/cephtest bash -c 'ceph_test_admin_socket_output --all'
2024-06-10T04:37:48.233 INFO:tasks.ceph.mds.a.smithi133.stderr:2024-06-10T04:37:48.244+0000 7f3d41732640 -1 mds.a Exiting due to admin socket command
2024-06-10T04:37:50.808 INFO:tasks.ceph.mds.a.smithi133.stderr:   -63> 2024-06-10T04:37:48.244+0000 7f3d41732640 -1 mds.a Exiting due to admin socket command

That looks probable to me at this stage, trying to reproduce and see if I can test that theory.

By a curious quirk of fate I wrote that test so it immediately caught my attention.

Actions

Copy link

#45

Updated by Brad Hubbard almost 2 years ago

So my first speculation is a "no" I believe :)

I'm working through this but here's what I've got so far.

2024-05-14T01:57:46.093 DEBUG:teuthology.run_tasks:Unwinding manager rgw
2024-05-14T01:57:46.103 DEBUG:tasks.rgw.client.0:waiting for process to exit
2024-05-14T01:57:46.103 INFO:teuthology.orchestra.run:waiting for 300
2024-05-14T01:57:46.136 INFO:tasks.rgw.client.0.smithi089.stdout:2024-05-14T01:57:46.136+0000 7fad828e9640 -1 received  signal: Terminated from /usr/bin/python3 /usr/bin/daemon-helper term radosgw --rgw-frontends beast port=80 -n client.0 --cluster ceph -k /etc/ceph/ceph.client.0.keyring --log-file /var/log/ceph/rgw.ceph.client.0.log --rgw_ops_log_socket_path /home/ubuntu/cephtest/rgw.opslog.ceph.client.0.sock --foreground  (PID: 26042) UID: 0
2024-05-14T01:57:46.136 INFO:tasks.rgw.client.0.smithi089.stdout:2024-05-14T01:57:46.136+0000 7fad86131a40 -1 shutting down
2024-05-14T01:57:47.752 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 is failed for ~0s
2024-05-14T01:57:50.011 INFO:tasks.ceph.mon.a.smithi089.stderr:2024-05-14T01:57:50.008+0000 7f098360b640 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL)
2024-05-14T01:57:50.996 INFO:tasks.ceph.mds.a.smithi089.stderr:2024-05-14T01:57:50.996+0000 7f885e903640 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-05-14T01:57:52.206 INFO:tasks.rgw.client.0:Stopped

radosgw was ceph.rgw.client.0 and died on receiving a signal, more on this soon.

The watchdog has marked ceph.rgw.client.0 as failed.

2024-05-14T01:57:52.324 INFO:tasks.util.rgw:rgwadmin: client.0 : ['gc', 'process', '--include-all']
2024-05-14T01:57:52.325 DEBUG:tasks.util.rgw:rgwadmin: cmd=['adjust-ulimits', 'ceph-coverage', '/home/ubuntu/cephtest/archive/coverage', 'radosgw-admin', '--log-to-stderr', '--format', 'json', '-n', 'client.0', '--cluster', 'ceph', 'gc', 'process', '--include-all']
2024-05-14T01:57:52.325 DEBUG:teuthology.orchestra.run.smithi089:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster ceph gc process --include-all
2024-05-14T01:57:53.256 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored

radosgw-admin starts running and I believe it hangs because there is no radosgw daemon running (confirmed in my reproducer) but it becomes ceph.rgw.client.0 and the daemon watchdog thinks the rgw daemon is back so does not think anything failed and is going to wait 12 hours for the test to "finish" before it times out [0].

Now why does the radosgw daemon receive a signal? I believe it is probably coming from the ceph_test_admin_socket_output sending it the fairly recently added "raise" signal [1] and this is killing it but I need to do some more testing to confirm that but certainly if I leave the radosgw-admin process it runs indefinitely and if I kill it the test finishes (fails) immediately.

Need to wrap this up and put a bow on it but I think we are zeroing in on the RC.

[0] https://github.com/ceph/ceph/blob/3394d1bbd4a653a92107b32da59890d10f286646/qa/tasks/daemonwatchdog.py#L98-L115
[1] https://github.com/ceph/ceph/commit/7cc54b2b7323096bf7394d8502ff232a78e0c355

Actions

Copy link

#46

Updated by Neha Ojha almost 2 years ago

Priority changed from Normal to Urgent

This bug is causing dead jobs pretty consistently.

https://pulpito.ceph.com/lflores-2024-06-18_23:03:53-rados-squid-release-distro-default-smithi/7762822/
https://pulpito.ceph.com/lflores-2024-06-18_23:03:53-rados-squid-release-distro-default-smithi/7762969/

Actions

Copy link

#47

Updated by Neha Ojha almost 2 years ago

rados/singleton-nomsgr/{all/admin_socket_output tests have been failing since more than a year now.

https://pulpito.ceph.com/yuriw-2023-03-30_21:29:24-rados-wip-yuri2-testing-2023-03-30-0826-distro-default-smithi/7227455/ - oldest occurrence I could find. I suspect https://github.com/ceph/ceph/pull/45693/files#diff-01a30802cc74ea92dc086059fdd77bb5bcc4c04a2eb3b74f5620385615916f0bR242 could be related.

Actions

Copy link

#48

Updated by Brad Hubbard almost 2 years ago

Casey Bodley wrote in #note-16:

i do see rgw crashes in these logs:
[...]

but there's no stack trace, and ceph_test_admin_socket_output passes anyway

Right, and those entries are just the 'log dump' command if I'm not mistaken?

From this line it's clear the rgw instance is being shut down by teuthology as a normal operation after the tasks have all completed.

teuthology.run_tasks:Unwinding manager rgw

If I add the following change to qa/ this test does not fail.

bhubbard@teuthology:~/src/git.ceph.com_ceph_7b7a5d375cf4e4afab96b9d4d229b699ae8304bb$ git diff
diff --git a/qa/tasks/rgw.py b/qa/tasks/rgw.py
index 2e44c0bad5..880f72a6ff 100644
--- a/qa/tasks/rgw.py
+++ b/qa/tasks/rgw.py
@@ -266,7 +266,7 @@ def start_rgw(ctx, config, clients):
                 )
             ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', token_path])
             ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', url_file])
-            rgwadmin(ctx, client, cmd=['gc', 'process', '--include-all'], check_status=True)
+            # rgwadmin(ctx, client, cmd=['gc', 'process', '--include-all'], check_status=True)

 def assign_endpoints(ctx, config, default_cert):
     role_endpoints = {}

So there are two issues here that need to be resolved AFAICT at this stage.

1: The radosgw-admin gc command runs indefinitely.
2: Per comment #45 even though the rgw daemon has been stopped normally the daemon_watchdog thinks it is running since it incorrectly detects the radosgw-admin gc task to be the rgw daemon restarting. Since the radosgw-admin gc task runs indefinitely the daemon_watchdog waits until twelve hours later to timeout and fail the test.

I created the following patch and ran the test again in an attempt to gather additional information from the radosgw-admin process.

bhubbard@teuthology:~/src/git.ceph.com_ceph_7b7a5d375cf4e4afab96b9d4d229b699ae8304bb$ git diff
diff --git a/qa/tasks/rgw.py b/qa/tasks/rgw.py
index 2e44c0bad5..262d3eb968 100644
--- a/qa/tasks/rgw.py
+++ b/qa/tasks/rgw.py
@@ -266,7 +266,7 @@ def start_rgw(ctx, config, clients):
                 )
             ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', token_path])
             ctx.cluster.only(client).run(args=['sudo', 'rm', '-f', url_file])
-            rgwadmin(ctx, client, cmd=['gc', 'process', '--include-all'], check_status=True)
+            rgwadmin(ctx, client, cmd=['--debug-rgw', '20', '--debug-rados', '20', 'gc', 'process', '--include-all'], check_status=True)

 def assign_endpoints(ctx, config, default_cert):
     role_endpoints = {}

It produced the following logs. I also looked at gstack output and every thread appears to be waiting so it seems to be waiting for something that never arrives.

2024-06-20 03:04:54,634.634 DEBUG:teuthology.orchestra.run.smithi057:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage radosgw-admin --log-to-stderr --format json -n client.0 --cluster ceph 
--debug-rgw 20 --debug-rados 20 gc process --include-all                                                                                                                                                          
2024-06-20 03:04:54,747.747 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800  1 librados: starting msgr at                    
2024-06-20 03:04:54,747.747 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800  1 librados: starting objecter              
2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800  1 librados: setting wanted keys                 
2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.748+0000 7f9a1b525800  1 librados: calling monclient init                    
2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.749+0000 7f9a1b525800  1 librados: init done                                                            
2024-06-20 03:04:54,748.748 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.749+0000 7f9a1b525800 10 librados: wait_for_osdmap waiting
2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.750+0000 7f9a1b525800 10 librados: wait_for_osdmap done waiting  
2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.750+0000 7f9a1b525800 10 librados: read oid=default.realm nspace=  
2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.751+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2
2024-06-20 03:04:54,749.749 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.751+0000 7f9a1b525800 10 librados: read oid=zone_names.default nspace=
2024-06-20 03:04:54,750.750 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.752+0000 7f9a1b525800 10 librados: Objecter returned from read r=0          
2024-06-20 03:04:54,750.750 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.752+0000 7f9a1b525800 10 librados: call oid=zone_info.389a7194-1154-447c-91b8-1bf4ae4c5887 nspace=
2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.752+0000 7f9a1b525800 10 librados: Objecter returned from call r=0
2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: read oid=zonegroups_names.default nspace=
2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: Objecter returned from read r=0                                     
2024-06-20 03:04:54,751.751 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: call oid=zonegroup_info.a1d12ca3-3683-47ea-815c-25f465240cc5 nspace=
2024-06-20 03:04:54,752.752 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.753+0000 7f9a1b525800 10 librados: Objecter returned from call r=0
2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.757+0000 7f9a1b525800  1 librados: starting msgr at
2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.757+0000 7f9a1b525800  1 librados: starting objecter       
2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.758+0000 7f9a1b525800  1 librados: setting wanted keys                                         
2024-06-20 03:04:54,756.756 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.758+0000 7f9a1b525800  1 librados: calling monclient init                                      
2024-06-20 03:04:54,757.757 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.759+0000 7f9a1b525800  1 librados: init done                                                     
2024-06-20 03:04:54,759.759 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 rados->read ofs=0 len=0
2024-06-20 03:04:54,759.759 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 10 librados: read oid=default.realm nspace=
2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2      
2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0      
2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 realm                               
2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 20 rados->read ofs=0 len=0
2024-06-20 03:04:54,760.760 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.761+0000 7f9a1b525800 10 librados: read oid=default.realm nspace=            
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800  4 RGWPeriod::init failed to init realm  id  : (2) No such file or directory
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados->read ofs=0 len=0                             
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: read oid=default.realm nspace=  
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0     
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 20 rados->read ofs=0 len=0                                                             
2024-06-20 03:04:54,761.761 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.762+0000 7f9a1b525800 10 librados: read oid=zone_names.default nspace=
2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 10 librados: Objecter returned from read r=0
2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=46
2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 rados->read ofs=0 len=0
2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 10 librados: read oid=zone_info.389a7194-1154-447c-91b8-1bf4ae4c5887 nspace=
2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 10 librados: Objecter returned from read r=0                                                 
2024-06-20 03:04:54,762.762 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=993                                                     
2024-06-20 03:04:54,763.763 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.763+0000 7f9a1b525800 20 searching for the correct realm                                                           
2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zone_info.389a7194-1154-447c-91b8-1bf4ae4c5887                
2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got default.zonegroup.
2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got default.zone.
2024-06-20 03:04:54,770.770 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zone_names.default
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zonegroups_names.default
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 RGWRados::pool_iterate: got zonegroup_info.a1d12ca3-3683-47ea-815c-25f465240cc5
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 rados->read ofs=0 len=0          
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 10 librados: read oid=default.realm nspace=
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0      
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 20 rados->read ofs=0 len=0                      
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.770+0000 7f9a1b525800 10 librados: read oid=zonegroups_names.default nspace=
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from read r=0                                
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=46     
2024-06-20 03:04:54,771.771 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0                            
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: read oid=zonegroup_info.a1d12ca3-3683-47ea-815c-25f465240cc5 nspace=
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from read r=0                                     
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=0 bl.length=436    
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 zone default found        
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800  4 Realm:                          ()
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800  4 ZoneGroup: default              (a1d12ca3-3683-47ea-815c-25f465240cc5)
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800  4 Zone:      default              (389a7194-1154-447c-91b8-1bf4ae4c5887)
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 cannot find current period zonegroup using local zonegroup configuration
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 zonegroup default      
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0                 
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: read oid=period_config.default nspace=
2024-06-20 03:04:54,772.772 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from read r=-2
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: call oid=bucket.sync-source-hints. nspace=
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: Objecter returned from call r=-2
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 20 rados->read ofs=0 len=0                                                  
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.771+0000 7f9a1b525800 10 librados: call oid=bucket.sync-target-hints. nspace=
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 10 librados: Objecter returned from call r=-2
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 20 rados_obj.operate() r=-2 bl.length=0      
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 20 started sync module instance, tier type =
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 20 started zone id=389a7194-1154-447c-91b8-1bf4ae4c5887 (name=default) with tier type =
2024-06-20 03:04:54,773.773 INFO:teuthology.orchestra.run.smithi057.stderr:2024-06-20T03:04:54.772+0000 7f9a1b525800 10 librados: create oid=notify.0 nspace=        
2024-06-20 03:04:56,380.380 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.rgw.client.0 has been restored

So it looks like there are some unexpected ENOENTs there? Casey, are you able to provide any feedback on this at this time?

Actions

Copy link

#49

Updated by Casey Bodley almost 2 years ago

when radosgw-admin hangs, are all of the osds still up and healthy?

Actions

Copy link

#50

Updated by Brad Hubbard over 1 year ago

Hmm... I see your point... :P

Let me work out a way to clear that.

Actions

Copy link

#51

Updated by Casey Bodley over 1 year ago

Brad Hubbard wrote in #note-50:

Hmm... I see your point... :P

Let me work out a way to clear that.

if that proves difficult, the rgw task could make that command conditional on a gc-on-shutdown flag so this job can disable it

Actions

Copy link

#52

Updated by Brad Hubbard over 1 year ago

No Casey, it's OK in this instance (although this could possibly be a cautionary tail).

This fixes it and I'll submit a PR next week, thanks.

diff --git a/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml b/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
index 341a559f3f..2bf32a1fd9 100644
--- a/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
+++ b/qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
@@ -26,3 +26,4 @@ tasks:
 - exec:
     client.0:
       - ceph_test_admin_socket_output --all
+      - sudo ceph tell osd.* injectfull none # Fixes https://tracker.ceph.com/issues/59380

The problem was of course that we were sending injectfull to one of the osds and this caused IO, and therefore the radosgw-admin gc command, to hang. Clearing injectfull allows IO to proceed since the osd is not actually full.

Actions

Copy link

#53

Updated by Brad Hubbard over 1 year ago

Status changed from New to Fix Under Review
Pull request ID set to 58215

Actions

Copy link

#54

Updated by Matan Breizman over 1 year ago

Tags changed from test-failure to test-failure, main-failures

Actions

Copy link

#55

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

PR that fixes this issue is currently being tested upstream

Actions

Copy link

#56

Updated by Laura Flores over 1 year ago

https://pulpito.ceph.com/yuriw-2024-07-05_14:04:08-rados-wip-yuri3-testing-2024-07-01-1610-distro-default-smithi/7788582

Actions

Copy link

#57

Updated by J. Eric Ivancich over 1 year ago

Project changed from rgw to RADOS

Actions

Copy link

#58

Updated by Laura Flores over 1 year ago

Note from bug scrub: Waiting on needs-qa

Actions

Copy link

#59

Updated by Laura Flores over 1 year ago

Test review is almost done for this one.

Actions

Copy link

#60

Updated by Laura Flores over 1 year ago

Status changed from Fix Under Review to Pending Backport
Backport changed from reef to squid,reef

Actions

Copy link

#61

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67157: squid: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" added

Actions

Copy link

#62

Updated by Upkeep Bot over 1 year ago

Copied to Backport #67158: reef: rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)" added

Actions

Copy link

#63

Updated by Upkeep Bot over 1 year ago

Tags (freeform) set to backport_processed

Actions

Copy link

#64

Updated by Nitzan Mordechai over 1 year ago

looks like we have few more suites that using the same exec and still having issue:
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812731
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812720
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812726
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812725
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812735

Actions

Copy link

#65

Updated by Laura Flores over 1 year ago

Nitzan Mordechai wrote in #note-64:

looks like we have few more suites that using the same exec and still having issue:
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812731
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812720
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812726
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812725
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812735

Was this from an older run? the fix was only merged two weeks ago.

Actions

Copy link

#66

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-08-02_15:32:29-rados-squid-release-distro-default-smithi/7832978

Actions

Copy link

#67

Updated by Brad Hubbard over 1 year ago · Edited

$ git checkout wip-yuri2-testing-2024-07-19-0817
$ tail -3 qa/suites/rados/singleton-nomsgr/all/admin_socket_output.yaml
- exec:
    client.0:
      - ceph_test_admin_socket_output --all

That branch does not include the commit @Nitzan Mordechai ?

Actions

Copy link

#68

Updated by Laura Flores over 1 year ago

/a/yuriw-2024-08-29_14:15:40-rados-wip-yuri8-testing-2024-08-28-1632-squid-distro-default-smithi/7880134

Actions

Copy link

#69

Updated by Laura Flores over 1 year ago

/a/skanta-2024-09-07_00:00:30-rados-squid-release-distro-default-smithi/7893165

Actions

Copy link

#70

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2024-10-16_22:34:49-rados-wip-yuri10-testing-2024-10-15-1007-distro-default-smithi/7952510/
/a/yuriw-2024-10-16_22:34:49-rados-wip-yuri10-testing-2024-10-15-1007-distro-default-smithi/7952657/

Actions

Copy link

#71

Updated by Naveen Naidu over 1 year ago

/a/yuriw-2024-10-17_14:30:20-rados-wip-yuri-testing-2024-10-09-1524-squid-distro-default-smithi/7954336
/a/yuriw-2024-10-17_14:30:20-rados-wip-yuri-testing-2024-10-09-1524-squid-distro-default-smithi/7954347

Actions

Copy link

#72

Updated by Brad Hubbard over 1 year ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#73

Updated by Upkeep Bot 8 months ago

Merge Commit set to 4adc795c49306d4a8ef0d55959b41cd3374c6420
Fixed In set to v19.3.0-3681-g4adc795c493
Upkeep Timestamp set to 2025-07-12T14:39:50+00:00

Actions

Copy link

#74

Updated by Upkeep Bot 8 months ago

Fixed In changed from v19.3.0-3681-g4adc795c493 to v19.3.0-3681-g4adc795c49
Upkeep Timestamp changed from 2025-07-12T14:39:50+00:00 to 2025-07-14T23:40:15+00:00

Actions

Copy link

#75

Updated by Upkeep Bot 5 months ago

Released In set to v20.2.0~2411
Upkeep Timestamp changed from 2025-07-14T23:40:15+00:00 to 2025-11-01T01:34:34+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #59380

rados/singleton-nomsgr: test failing from "Health check failed: 1 full osd(s) (OSD_FULL)" and "Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)"

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Nitzan Mordechai almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Casey Bodley almost 3 years ago

Updated by Nitzan Mordechai almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores almost 3 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Casey Bodley over 2 years ago

Updated by Kamoltat (Junior) Sirivadhna over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Matan Breizman over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Nitzan Mordechai over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Nitzan Mordechai over 2 years ago

Updated by Aishwarya Mathuria about 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Matan Breizman about 2 years ago

Updated by Aishwarya Mathuria about 2 years ago

Updated by Aishwarya Mathuria about 2 years ago

Updated by Aishwarya Mathuria almost 2 years ago

Updated by Laura Flores almost 2 years ago

Updated by Aishwarya Mathuria almost 2 years ago · Edited

Updated by Laura Flores almost 2 years ago

Updated by Sridhar Seshasayee almost 2 years ago

Updated by Aishwarya Mathuria almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Brad Hubbard almost 2 years ago

Updated by Brad Hubbard almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Neha Ojha almost 2 years ago

Updated by Brad Hubbard almost 2 years ago

Updated by Casey Bodley almost 2 years ago

Updated by Brad Hubbard over 1 year ago

Updated by Casey Bodley over 1 year ago

Updated by Brad Hubbard over 1 year ago

Updated by Brad Hubbard over 1 year ago

Updated by Matan Breizman over 1 year ago

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by J. Eric Ivancich over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Upkeep Bot over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Brad Hubbard over 1 year ago · Edited

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Updated by Naveen Naidu over 1 year ago

Updated by Brad Hubbard over 1 year ago