Actions
Bug #66229
open[quiesce timeout] unable to xlock a number of files due to a laggy client
% Done:
0%
Source:
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
quiesce
Labels (FS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:
Description
[leonidus@vossi04 leonidus-2024-05-26_10:48:16-fs-wip-lusov-quiesce-distro-default-smithi]$ grep "ERROR:tasks.quiescer" */teuthology.log 7726344/teuthology.log:2024-05-26T12:51:16.035 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout: 7726344/teuthology.log:2024-05-26T12:51:16.035 ERROR:tasks.quiescer.fs.[cephfs]:exception:
[leonidus@vossi04 leonidus-2024-05-26_10:48:16-fs-wip-lusov-quiesce-distro-default-smithi]$ grep Outstanding 7726344/teuthology.log | head -2 2024-05-26T12:50:48.644 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 2 (g) for 420c46f1-1: 'internal op quiesce_inode:mds.2:111 fp=#0x20000000210' 2024-05-26T12:50:48.648 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 2 (g) for 420c46f1-1: 'internal op quiesce_inode:mds.2:115 fp=#0x2000000026d'
ยป cat ops-420c46f1-1-mds.g.json | jq '.ops | map(select(.type_data.op_name == "quiesce_inode" and (.type_data.flag_point | test("quiesce complete|quiescing children")|not)) | [.type_data.flag_point, .description, (.type_data.locks | map(.lock.type))] )
[
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:111 fp=#0x20000000210", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:115 fp=#0x2000000026d", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:116 fp=#0x20000000205", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:120 fp=#0x20000000209", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:122 fp=#0x20000000266", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:123 fp=#0x2000000021e", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:126 fp=#0x2000000020a", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:128 fp=#0x20000000201", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:102 fp=#0x2000000021b", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:134 fp=#0x2000000023f", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:135 fp=#0x20000000227", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:104 fp=#0x20000000207", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:136 fp=#0x20000000211", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:105 fp=#0x2000000022b", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:137 fp=#0x20000000260", ["iquiesce","ipolicy"]],
["failed to xlock, waiting", "internal op quiesce_inode:mds.2:109 fp=#0x20000000269", ["iquiesce","ipolicy"]]
]
There are two observations to be made:
1. It doesn't look like a deadlock, as there are no pending operations other than the quiesce ops
2. The quiesce ops are failing to xlock the file lock - the only other xlock quiesce takes is the quiesce, which all of the operations already hold along with the policy
This suggests that we must be waiting for a cap revoke. This is to be corroborated with the mds.g.log
Updated by Leonid Usov almost 2 years ago
Evidence from the logs about the client being laggy:
2024-05-26T12:50:41.997+0000 7f9062800700 20 mds.2.21 get_session have 0x5580dafcc800 client.14925 192.168.0.1:0/1626256772 state open 2024-05-26T12:50:41.997+0000 7f9062800700 3 mds.2.server handle_client_session client_session(request_renewcaps seq 4) from client.14925 2024-05-26T12:50:41.997+0000 7f9062800700 10 mds.2.sessionmap touch_session s=0x5580dafcc800 name=client.14925 2024-05-26T12:50:41.997+0000 7f9062800700 10 mds.2.21 send_message_client client.14925 192.168.0.1:0/1626256772 client_session(renewcaps seq 4) 2024-05-26T12:50:43.349+0000 7f9060a00700 20 mds.2.server laggiest active session is client.14925 192.168.0.1:0/1626256772 and renewed caps recently (1.35197s ago) 2024-05-26T12:50:43.349+0000 7f9060a00700 20 mds.2.locker caps_tick age = 58.319965 client.14925.0x2000000021b 2024-05-26T12:50:48.349+0000 7f9060a00700 20 mds.2.server laggiest active session is client.14925 192.168.0.1:0/1626256772 and renewed caps recently (6.35187s ago) 2024-05-26T12:50:48.349+0000 7f9060a00700 20 mds.2.locker caps_tick age = 63.320044 client.14925.0x2000000021b 2024-05-26T12:50:53.349+0000 7f9060a00700 20 mds.2.server laggiest active session is client.14925 192.168.0.1:0/1626256772 and renewed caps recently (11.3518s ago) 2024-05-26T12:50:53.349+0000 7f9060a00700 20 mds.2.locker caps_tick age = 68.320127 client.14925.0x2000000021b 2024-05-26T12:50:58.349+0000 7f9060a00700 20 mds.2.server laggiest active session is client.14925 192.168.0.1:0/1626256772 and renewed caps recently (16.3517s ago) 2024-05-26T12:50:58.349+0000 7f9060a00700 20 mds.2.locker caps_tick age = 73.320200 client.14925.0x2000000021b 2024-05-26T12:51:03.349+0000 7f9060a00700 20 mds.2.server laggiest active session is client.14925 192.168.0.1:0/1626256772 and renewed caps recently (21.3516s ago) 2024-05-26T12:51:03.349+0000 7f9060a00700 20 mds.2.locker caps_tick age = 78.320323 client.14925.0x2000000021b 2024-05-26T12:51:03.729+0000 7f9062800700 1 -- [v2:172.21.15.113:6838/1193729985,v1:172.21.15.113:6839/1193729985] <== client.14925 192.168.0.1:0/1626256772 1360 ==== client_caps(update ino 0x20000000205 28 seq 5 caps=pAsLsXsFc dirty=- wanted=- follows 0 size 104857600/0 mtime 2024-05-26T12:49:21.138274+0000 ctime 2024-05-26T12:49:21.138274+0000 change_attr 35840 xattrs(v=1451578575 l=0)) ==== 268+0+0 (secure 0 0 0) 0x5580db427500 con 0x5580db048c00 2024-05-26T12:51:03.729+0000 7f9062800700 20 mds.2.21 get_session have 0x5580dafcc800 client.14925 192.168.0.1:0/1626256772 state open 2024-05-26T12:51:03.729+0000 7f9062800700 20 mds.2.locker client.14925 pending pAsLsXsFc allowed pAsLsXsFscr wanted - 2024-05-26T12:51:03.729+0000 7f9062800700 20 mds.2.locker client.14925 pending pAsLsXsFc allowed pAsLsXsFcb wanted - 2024-05-26T12:51:04.011+0000 7f9062800700 1 -- [v2:172.21.15.113:6838/1193729985,v1:172.21.15.113:6839/1193729985] <== client.14925 192.168.0.1:0/1626256772 1361 ==== client_cap_release(1) ==== 32+0+0 (secure 0 0 0) 0x5580dae86000 con 0x5580db048c00
Updated by Leonid Usov almost 2 years ago
- Subject changed from [quiesce timeout] unable to xlock a number of files: presumably, waiting for the client to [quiesce timeout] unable to xlock a number of files due to a laggy client
Updated by Venky Shankar almost 2 years ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Dhairya Parmar
- Target version set to v20.0.0
- Backport set to squid
Actions