Project

General

Profile

Actions

Bug #67406

closed

tasks.daemonwatchdog.daemon_watchdog:quiescer.fs.[cephfs] failed

Added by Venky Shankar over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Category:
Testing
Target version:
% Done:

0%

Source:
Q/A
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
quiesce
Labels (FS):
qa, qa-failure
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-4533-g48cb2d6e7b
Released In:
v20.2.0~2151
Upkeep Timestamp:
2025-11-01T01:35:52+00:00

Description

/a/vshankar-2024-08-01_09:47:36-fs-wip-vshankar-testing-20240801.065302-debug-testing-default-smithi/7829711

  description: fs/thrash/workloads/{begin/{0-install 1-ceph 2-logrotate 3-modules} clusters/1a5s-mds-1c-client
    conf/{client mds mgr mon osd} distro/{centos_latest} mount/fuse msgr-failures/osd-mds-delay
    objectstore-ec/bluestore-comp-ec-root overrides/{client-shutdown frag ignorelist_health
    ignorelist_wrongly_marked_down pg_health prefetch_dirfrags/no prefetch_dirfrags/yes
    prefetch_entire_dirfrags/no prefetch_entire_dirfrags/yes races session_timeout thrashosds-health}
    ranks/3 tasks/{1-thrash/with-quiesce 2-workunit/suites/pjd}}

pjd.sh failed since daemon watchdog unmounted due to failed quiescer task:

2024-08-01T15:18:34.532 ERROR:tasks.quiescer.fs.[cephfs]:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_ceph-c_1e006c2acbb487c272896d2c2951fa5f0332117f/qa/tasks/quiescer.py", line 357, in _run
    self.do_quiesce(d)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_1e006c2acbb487c272896d2c2951fa5f0332117f/qa/tasks/quiescer.py", line 313, in do_quiesce
    raise RuntimeError(f"Error quiescing set '{set_id}': {rcinfo(rc)}")
RuntimeError: Error quiescing set '8c1c8735': 110 (ETIMEDOUT)

Related issues 3 (0 open3 closed)

Has duplicate CephFS - Enhancement #66107: QuiesceDB: improve resiliency to lost agent acksDuplicateDhairya Parmar

Actions
Has duplicate CephFS - Bug #66650: mds: quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2]DuplicateKotresh Hiremath Ravishankar

Actions
Copied to CephFS - Backport #67745: squid: tasks.daemonwatchdog.daemon_watchdog:quiescer.fs.[cephfs] failedResolvedPatrick DonnellyActions
Actions #1

Updated by Venky Shankar over 1 year ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • ceph-qa-suite fs added
Actions #2

Updated by Patrick Donnelly over 1 year ago

  • Description updated (diff)

correcting the description (wrong yaml set pasted)

Actions #3

Updated by Patrick Donnelly over 1 year ago

  • Status changed from Triaged to In Progress

socket failure:

2024-08-01T15:17:04.482+0000 7ffb65200640  2 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 crc :-1 s=READY pgs=13 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x5623a678f400 seq=33 lock(a=lock ifile 0x1.head)
...
2024-08-01T15:17:04.483+0000 7ffb65200640  0 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 msgr2=0x5623a68ae100 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send injecting socket failure
2024-08-01T15:17:04.483+0000 7ffb65200640  1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 msgr2=0x5623a68ae100 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2024-08-01T15:17:04.483+0000 7ffb65200640  1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 crc :-1 s=READY pgs=13 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_event send msg failed
2024-08-01T15:17:04.484+0000 7ffb65200640  1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=READY pgs=13 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault with nothing to send, going to standby
...
2024-08-01T15:17:04.485+0000 7ffb60c00640  1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] send_to--> mds [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] -- lock(a=sync ifile 0x1.head) -- ?+0 0x5623a6665c00
2024-08-01T15:17:04.485+0000 7ffb60c00640  1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] -- lock(a=sync ifile 0x1.head) -- 0x5623a6665c00 con 0x5623a683c900
2024-08-01T15:17:04.485+0000 7ffb60c00640  1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] send_to--> mds [v2:172.21.15.53:6833/3227231508,v1:172.21.15.53:6836/3227231508] -- lock(a=sync ifile 0x1.head) -- ?+0 0x5623a6896200
2024-08-01T15:17:04.485+0000 7ffb60c00640  1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --> [v2:172.21.15.53:6833/3227231508,v1:172.21.15.53:6836/3227231508] -- lock(a=sync ifile 0x1.head) -- 0x5623a6896200 con 0x5623a684cd80
2024-08-01T15:17:04.485+0000 7ffb65200640  1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=STANDBY pgs=13 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).connect
...
2024-08-01T15:17:04.485+0000 7ffb64800640  1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683cd80 0x5623a6900000 crc :-1 s=SESSION_ACCEPTING pgs=23 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect reconnect race detected, replacing existing=0x5623a683c900 socket by this connection's socket
2024-08-01T15:17:04.485+0000 7ffb64800640  1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683cd80 0x5623a6900000 crc :-1 s=SESSION_ACCEPTING pgs=23 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect reconnect to existing=0x5623a683c900
...
2024-08-01T15:17:04.486+0000 7ffb60c00640 10 mds.f  new session 0x5623a681d400 for mds.0 v2:172.21.15.94:6834/2050407724 con 0x5623a683c900
2024-08-01T15:17:04.486+0000 7ffb64800640  1 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=READY pgs=23 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=mds.1 client_cookie=3f8b6a2bb5a1f933 server_cookie=2128a6c96319b669 in_seq=34 out_seq=33
2024-08-01T15:17:04.486+0000 7ffb60c00640  1 mds.f parse_caps: cannot decode auth caps buffer of length 0
2024-08-01T15:17:04.486+0000 7ffb60c00640 10 mds.f ms_handle_accept v2:172.21.15.94:6834/2050407724 con 0x5623a683c900 session 0x5623a681d400
2024-08-01T15:17:04.486+0000 7ffb64800640  2 --2- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] >> [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] conn(0x5623a683c900 0x5623a68ae100 unknown :-1 s=READY pgs=23 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x5623a6665c00 seq=34 lock(a=sync ifile 0x1.head)
...
2024-08-01T15:17:04.487+0000 7ffb60c00640  1 -- [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] <== mds.1 v2:172.21.15.94:6834/2050407724 35 ====  ==== 0+0+0 (unknown 0 0 0) 0x5623a68db200 con 0x5623a683c900
2024-08-01T15:17:04.487+0000 7ffb55800640 20 quiesce.mgr.4375 <quiesce_db_thread_main> db idle, age: 61.085994720s next_event_at_age: 106.079994202s
2024-08-01T15:17:04.487+0000 7ffb60c00640 -1 quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2]
2024-08-01T15:17:04.487+0000 7ffb60c00640 10 quiesce.mds.0 <quiesce_dispatch> dump:

2024-08-01T15:17:04.487+0000 7ffb60c00640  0 ms_deliver_dispatch: unhandled message 0x5623a68db200  from mds.1 v2:172.21.15.94:6834/2050407724

From: /teuthology/vshankar-2024-08-01_09:47:36-fs-wip-vshankar-testing-20240801.065302-debug-testing-default-smithi/7829711/remote/smithi094/log/ceph-mds.f.log.gz

then

2024-08-01T15:17:04.483+0000 7ff13d600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] <== mds.0 v2:172.21.15.94:6835/3679291363 33 ==== lock(a=lock ifile 0x1.head) ==== 69+0+0 (crc 0 0 0) 0x55f20181da00 con 0x55f201768000
...
2024-08-01T15:17:04.483+0000 7ff13d600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] send_to--> mds [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- lock(a=lockack ifile 0x1.head) -- ?+0 0x55f20181ca00
2024-08-01T15:17:04.483+0000 7ff13d600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] --> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] -- lock(a=lockack ifile 0x1.head) -- 0x55f20181ca00 con 0x55f201768000
...
2024-08-01T15:17:04.483+0000 7ff141200640  2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f20181ca00 seq=33 lock(a=lockack ifile 0x1.head)
2024-08-01T15:17:04.483+0000 7ff133600640 10 quiesce.mds.1 <operator()> sending ack q-map[v:(14:1) roots:0/0] to the leader 4375
2024-08-01T15:17:04.483+0000 7ff133600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] send_to--> mds [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --  -- ?+0 0x55f2017cc480
2024-08-01T15:17:04.483+0000 7ff133600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] --> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --  -- 0x55f2017cc480 con 0x55f201768000
2024-08-01T15:17:04.483+0000 7ff133600640 20 quiesce.agt <agent_thread_main> old = (0:0), current = (14:1)
2024-08-01T15:17:04.483+0000 7ff133600640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(14:1) roots:1/0]
2024-08-01T15:17:04.483+0000 7ff133600640 10 quiesce.mds.1 <operator()> sending ack q-map[v:(14:1) roots:1/0] to the leader 4375
2024-08-01T15:17:04.483+0000 7ff141200640  2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f2017cc480 seq=34
2024-08-01T15:17:04.483+0000 7ff133600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] send_to--> mds [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --  -- ?+0 0x55f2017cd380
2024-08-01T15:17:04.483+0000 7ff133600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] --> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] --  -- 0x55f2017cd380 con 0x55f201768000
2024-08-01T15:17:04.483+0000 7ff141200640  2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f2017cd380 seq=35
2024-08-01T15:17:04.483+0000 7ff141200640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 msgr2=0x55f2017e0680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 38
2024-08-01T15:17:04.483+0000 7ff141200640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 msgr2=0x55f2017e0680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2024-08-01T15:17:04.483+0000 7ff141200640  1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
2024-08-01T15:17:04.484+0000 7ff141200640  1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 unknown :-1 s=READY pgs=16 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault initiating reconnect
2024-08-01T15:17:04.484+0000 7ff141200640  1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 unknown :-1 s=BANNER_CONNECTING pgs=16 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
...
2024-08-01T15:17:04.486+0000 7ff141200640  1 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=mds.0 client_cookie=3f8b6a2bb5a1f933 server_cookie=2128a6c96319b669 in_seq=33 out_seq=34
2024-08-01T15:17:04.486+0000 7ff141200640  2 --2- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] >> [v2:172.21.15.94:6835/3679291363,v1:172.21.15.94:6838/3679291363] conn(0x55f201768000 0x55f2017e0680 crc :-1 s=READY pgs=16 cs=1 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message sending message m=0x55f2017cd380 seq=35
2024-08-01T15:17:04.486+0000 7ff13d600640  1 -- [v2:172.21.15.94:6834/2050407724,v1:172.21.15.94:6837/2050407724] <== mds.0 v2:172.21.15.94:6835/3679291363 34 ==== lock(a=sync ifile 0x1.head) ==== 283+0+0 (crc 0 0 0) 0x55f2015a3400 con 0x55f201768000

From: /teuthology/vshankar-2024-08-01_09:47:36-fs-wip-vshankar-testing-20240801.065302-debug-testing-default-smithi/7829711/remote/smithi094/log/ceph-mds.b.log.gz

It seems the payload got zero'd. I'm looking into this more.

Actions #4

Updated by Patrick Donnelly over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 59176
  • Component(FS) quiesce added
Actions #5

Updated by Patrick Donnelly over 1 year ago

  • Has duplicate Enhancement #66107: QuiesceDB: improve resiliency to lost agent acks added
Actions #6

Updated by Patrick Donnelly over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Upkeep Bot over 1 year ago

  • Copied to Backport #67745: squid: tasks.daemonwatchdog.daemon_watchdog:quiescer.fs.[cephfs] failed added
Actions #8

Updated by Upkeep Bot over 1 year ago

  • Tags (freeform) set to backport_processed
Actions #9

Updated by Patrick Donnelly over 1 year ago

  • Has duplicate Bug #66650: mds: quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2] added
Actions #10

Updated by Upkeep Bot 8 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp set to 2025-07-08T18:13:25+00:00
Actions #11

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 48cb2d6e7bd2254f71dcc328570be85e2b7ee019
  • Fixed In set to v19.3.0-4533-g48cb2d6e7b
  • Upkeep Timestamp changed from 2025-07-08T18:13:25+00:00 to 2025-08-02T04:50:26+00:00
Actions #12

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~2151
  • Upkeep Timestamp changed from 2025-08-02T04:50:26+00:00 to 2025-11-01T01:35:52+00:00
Actions

Also available in: Atom PDF