Actions
Bug #67406
closedtasks.daemonwatchdog.daemon_watchdog:quiescer.fs.[cephfs] failed
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Testing
Target version:
% Done:
0%
Source:
Q/A
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
quiesce
Labels (FS):
qa, qa-failure
Pull request ID:
Tags (freeform):
backport_processed
Merge Commit:
Fixed In:
v19.3.0-4533-g48cb2d6e7b
Released In:
v20.2.0~2151
Upkeep Timestamp:
2025-11-01T01:35:52+00:00
Description
/a/vshankar-2024-08-01_09:47:36-fs-wip-vshankar-testing-20240801.065302-debug-testing-default-smithi/7829711
description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate 3-modules} clusters/1a5s-mds-1c-client
conf/{client mds mgr mon osd} distro/{centos_latest} mount/fuse msgr-failures/osd-mds-delay
objectstore-ec/bluestore-comp-ec-root overrides/{client-shutdown frag ignorelist_health
ignorelist_wrongly_marked_down pg_health prefetch_dirfrags/no prefetch_dirfrags/yes
prefetch_entire_dirfrags/no prefetch_entire_dirfrags/yes races session_timeout thrashosds-health}
ranks/3 tasks/{1-thrash/with-quiesce 2-workunit/suites/pjd}}
pjd.sh failed since daemon watchdog unmounted due to failed quiescer task:
2024-08-01T15:18:34.532 ERROR:tasks.quiescer.fs.[cephfs]:exception:
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_ceph-c_1e006c2acbb487c272896d2c2951fa5f0332117f/qa/tasks/quiescer.py", line 357, in _run
self.do_quiesce(d)
File "/home/teuthworker/src/git.ceph.com_ceph-c_1e006c2acbb487c272896d2c2951fa5f0332117f/qa/tasks/quiescer.py", line 313, in do_quiesce
raise RuntimeError(f"Error quiescing set '{set_id}': {rcinfo(rc)}")
RuntimeError: Error quiescing set '8c1c8735': 110 (ETIMEDOUT)
Updated by Venky Shankar over 1 year ago
- Status changed from New to Triaged
- Assignee set to Patrick Donnelly
- ceph-qa-suite fs added
Updated by Patrick Donnelly over 1 year ago
- Description updated (diff)
correcting the description (wrong yaml set pasted)
Updated by Patrick Donnelly over 1 year ago
- Status changed from Triaged to In Progress
socket failure:
2024-08-01T15:17:04.482+0000 7ffb65200640 2 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 crc :-1 s=READY pgs=13 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x5623a678f400 seq=33 lock(a=lock ifile 0x1.head) ... 2024-08-01T15:17:04.483+0000 7ffb65200640 0 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 msgr2=0x5623a68ae100 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send injecting socket failure 2024-08-01T15:17:04.483+0000 7ffb65200640 1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 msgr2=0x5623a68ae100 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe 2024-08-01T15:17:04.483+0000 7ffb65200640 1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 crc :-1 s=READY pgs=13 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_event send msg failed 2024-08-01T15:17:04.484+0000 7ffb65200640 1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=READY pgs=13 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault with nothing to send, going to standby ... 2024-08-01T15:17:04.485+0000 7ffb60c00640 1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] send_to--> mds [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] -- lock(a=sync ifile 0x1.head) -- ?+0 0x5623a6665c00 2024-08-01T15:17:04.485+0000 7ffb60c00640 1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] -- lock(a=sync ifile 0x1.head) -- 0x5623a6665c00 con 0x5623a683c900 2024-08-01T15:17:04.485+0000 7ffb60c00640 1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] send_to--> mds [v2:172.21.15.53:6833/3227231508,v1:172.21.15.53:6836/3227231508] -- lock(a=sync ifile 0x1.head) -- ?+0 0x5623a6896200 2024-08-01T15:17:04.485+0000 7ffb60c00640 1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --> [v2:172.21.15.53:6833/3227231508,v1:172.21.15.53:6836/3227231508] -- lock(a=sync ifile 0x1.head) -- 0x5623a6896200 con 0x5623a684cd80 2024-08-01T15:17:04.485+0000 7ffb65200640 1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=STANDBY pgs=13 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).connect ... 2024-08-01T15:17:04.485+0000 7ffb64800640 1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683cd80 0x5623a6900000 crc :-1 s=SESSION_ACCEPTING pgs=23 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect reconnect race detected, replacing existing=0x5623a683c900 socket by this connection's socket 2024-08-01T15:17:04.485+0000 7ffb64800640 1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683cd80 0x5623a6900000 crc :-1 s=SESSION_ACCEPTING pgs=23 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect reconnect to existing=0x5623a683c900 ... 2024-08-01T15:17:04.486+0000 7ffb60c00640 10 mds.f new session 0x5623a681d400 for mds.0 v2:172.21.15.94:6834/2050407724 con 0x5623a683c900 2024-08-01T15:17:04.486+0000 7ffb64800640 1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=READY pgs=23 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=mds.1 client_cookie=3f8b6a2bb5a1f933 server_cookie=2128a6c96319b669 in_seq=34 out_seq=33 2024-08-01T15:17:04.486+0000 7ffb60c00640 1 mds.f parse_caps: cannot decode auth caps buffer of length 0 2024-08-01T15:17:04.486+0000 7ffb60c00640 10 mds.f ms_handle_accept v2:172.21.15.94:6834/2050407724 con 0x5623a683c900 session 0x5623a681d400 2024-08-01T15:17:04.486+0000 7ffb64800640 2 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=READY pgs=23 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x5623a6665c00 seq=34 lock(a=sync ifile 0x1.head) ... 2024-08-01T15:17:04.487+0000 7ffb60c00640 1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] <== mds.1 v2:172.21.15.94:6834/2050407724 35 ==== ==== 0+0+0 (unknown 0 0 0) 0x5623a68db200 con 0x5623a683c900 2024-08-01T15:17:04.487+0000 7ffb55800640 20 quiesce.mgr.4375 <quiesce_db_thread_main> db idle, age: 61.085994720s next_event_at_age: 106.079994202s 2024-08-01T15:17:04.487+0000 7ffb60c00640 -1 quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2] 2024-08-01T15:17:04.487+0000 7ffb60c00640 10 quiesce.mds.0 <quiesce_dispatch> dump: 2024-08-01T15:17:04.487+0000 7ffb60c00640 0 ms_deliver_dispatch: unhandled message 0x5623a68db200 from mds.1 v2:172.21.15.94:6834/2050407724
From: /teuthology/vshankar-2024-08-01_09:47:36-fs-wip-vshankar-testing-20240801.065302-debug-testing-default-smithi/7829711/remote/smithi094/log/ceph-mds.f.log.gz
then
2024-08-01T15:17:04.483+0000 7ff13d600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] <== mds.0 v2:172.21.15.94:6835/3679291363 33 ==== lock(a=lock ifile 0x1.head) ==== 69+0+0 (crc 0 0 0) 0x55f20181da00 con 0x55f201768000 ... 2024-08-01T15:17:04.483+0000 7ff13d600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] send_to--> mds [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- lock(a=lockack ifile 0x1.head) -- ?+0 0x55f20181ca00 2024-08-01T15:17:04.483+0000 7ff13d600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] --> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- lock(a=lockack ifile 0x1.head) -- 0x55f20181ca00 con 0x55f201768000 ... 2024-08-01T15:17:04.483+0000 7ff141200640 2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f20181ca00 seq=33 lock(a=lockack ifile 0x1.head) 2024-08-01T15:17:04.483+0000 7ff133600640 10 quiesce.mds.1 <operator()> sending ack q-map[v:(14:1) roots:0/0] to the leader 4375 2024-08-01T15:17:04.483+0000 7ff133600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] send_to--> mds [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- -- ?+0 0x55f2017cc480 2024-08-01T15:17:04.483+0000 7ff133600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] --> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- -- 0x55f2017cc480 con 0x55f201768000 2024-08-01T15:17:04.483+0000 7ff133600640 20 quiesce.agt <agent_thread_main> old = (0:0), current = (14:1) 2024-08-01T15:17:04.483+0000 7ff133600640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(14:1) roots:1/0] 2024-08-01T15:17:04.483+0000 7ff133600640 10 quiesce.mds.1 <operator()> sending ack q-map[v:(14:1) roots:1/0] to the leader 4375 2024-08-01T15:17:04.483+0000 7ff141200640 2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f2017cc480 seq=34 2024-08-01T15:17:04.483+0000 7ff133600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] send_to--> mds [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- -- ?+0 0x55f2017cd380 2024-08-01T15:17:04.483+0000 7ff133600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] --> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- -- 0x55f2017cd380 con 0x55f201768000 2024-08-01T15:17:04.483+0000 7ff141200640 2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f2017cd380 seq=35 2024-08-01T15:17:04.483+0000 7ff141200640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 msgr2=0x55f2017e0680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 38 2024-08-01T15:17:04.483+0000 7ff141200640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 msgr2=0x55f2017e0680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed 2024-08-01T15:17:04.483+0000 7ff141200640 1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 2024-08-01T15:17:04.484+0000 7ff141200640 1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 unknown :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault initiating reconnect 2024-08-01T15:17:04.484+0000 7ff141200640 1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 unknown :-1 s=BANNER_CONNECTING pgs=16 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0 ... 2024-08-01T15:17:04.486+0000 7ff141200640 1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=mds.0 client_cookie=3f8b6a2bb5a1f933 server_cookie=2128a6c96319b669 in_seq=33 out_seq=34 2024-08-01T15:17:04.486+0000 7ff141200640 2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f2017cd380 seq=35 2024-08-01T15:17:04.486+0000 7ff13d600640 1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] <== mds.0 v2:172.21.15.94:6835/3679291363 34 ==== lock(a=sync ifile 0x1.head) ==== 283+0+0 (crc 0 0 0) 0x55f2015a3400 con 0x55f201768000
From: /teuthology/vshankar-2024-08-01_09:47:36-fs-wip-vshankar-testing-20240801.065302-debug-testing-default-smithi/7829711/remote/smithi094/log/ceph-mds.b.log.gz
It seems the payload got zero'd. I'm looking into this more.
Updated by Patrick Donnelly over 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 59176
- Component(FS) quiesce added
Updated by Patrick Donnelly over 1 year ago
- Has duplicate Enhancement #66107: QuiesceDB: improve resiliency to lost agent acks added
Updated by Patrick Donnelly over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Upkeep Bot over 1 year ago
- Copied to Backport #67745: squid: tasks.daemonwatchdog.daemon_watchdog:quiescer.fs.[cephfs] failed added
Updated by Patrick Donnelly over 1 year ago
- Has duplicate Bug #66650: mds: quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2] added
Updated by Upkeep Bot 8 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp set to 2025-07-08T18:13:25+00:00
Updated by Upkeep Bot 8 months ago
- Merge Commit set to 48cb2d6e7bd2254f71dcc328570be85e2b7ee019
- Fixed In set to v19.3.0-4533-g48cb2d6e7b
- Upkeep Timestamp changed from 2025-07-08T18:13:25+00:00 to 2025-08-02T04:50:26+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~2151
- Upkeep Timestamp changed from 2025-08-02T04:50:26+00:00 to 2025-11-01T01:35:52+00:00
Actions