Bug #65607
closedmds deadlock between 'lookup' and the 'rename/create, etc' requests
0%
Description
This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html
The lookup request was stuck by waiting the wrlock:
{
"description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})",
"initiated_at": "2024-04-18T12:44:36.548789+0000",
"age": 28621.412714391001,
"duration": 28621.412753429999,
"type_data": {
"flag_point": "failed to wrlock, waiting",
"reqid": {
"entity": {
"type": "client",
"num": 139490
},
"tid": 14003440
},
"op_type": "client_request",
"events": [
{
"time": "2024-04-18T12:44:36.548789+0000",
"event": "initiated"
},
{
"time": "2024-04-18T12:44:36.548790+0000",
"event": "throttled"
},
{
"time": "2024-04-18T12:44:36.548789+0000",
"event": "header_read"
},
{
"time": "2024-04-18T12:44:36.548794+0000",
"event": "all_read"
},
{
"time": "2024-04-18T12:44:36.551501+0000",
"event": "dispatched"
},
{
"time": "2024-04-18T12:44:36.555540+0000",
"event": "acquired locks"
},
{
"time": "2024-04-18T12:44:36.555589+0000",
"event": "failed to rdlock, waiting"
},
{
"time": "2024-04-18T12:44:36.558280+0000",
"event": "failed to wrlock, waiting"
}
],
"locks": null
}
},
From the above events we can see that the lookup request has successfully acquired the locks in the path_traverse(), because in the handle_client_getattr() there has not wrlock need to be acquired, so after the locks being successfully acquired in path_traverse(), the locks must be dropped and the lookup request was retried and then went into the *path_traverse()` again and stuck.
While for the rename request it also went into the path_traverse() and have successfully acquired the filelock and nestlock, but failed to acquire the xlock of dn->lock:
8474 lov.add_wrlock(&cur->filelock); 8475 lov.add_wrlock(&cur->nestlock); 8476 if (rdlock_authlock) 8477 lov.add_rdlock(&cur->authlock); 8478 } 8479 lov.add_xlock(&dn->lock);
{
"description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})",
"initiated_at": "2024-04-18T12:44:36.552941+0000",
"age": 28621.408561975,
"duration": 28621.408643424002,
"type_data": {
"flag_point": "failed to xlock, waiting",
"reqid": {
"entity": {
"type": "mds",
"num": 1
},
"tid": 239710
},
"op_type": "client_request",
"events": [
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "initiated"
},
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "throttled"
},
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "header_read"
},
{
"time": "2024-04-18T12:44:36.552941+0000",
"event": "all_read"
},
{
"time": "2024-04-18T12:44:36.554035+0000",
"event": "dispatched"
},
{
"time": "2024-04-18T12:44:36.556007+0000",
"event": "failed to wrlock, waiting"
},
{
"time": "2024-04-18T12:44:36.557760+0000",
"event": "failed to xlock, waiting"
}
],
"locks": null
}
},
The "event": "failed to xlock, waiting" event must be from the path_traverse(), not from the handle_client_rename(), because if so there must be a "event": "acquired locks" event, which is done in the path_traverse(), before this.
So the previous lookup request must be waiting for any of filelock and nestlock.
But I still couldn't figure out exactly where is holding the dn->lock in the lookup request or somewhere else.
Updated by Xiubo Li almost 2 years ago
As Erich mentioned he enabled multiple active MDSs, but he only updated the block ops from on MDS. I guess maybe another MDS was holding the dn->lock ?
Updated by Xiubo Li almost 2 years ago
- Status changed from In Progress to Need More Info
Have suggested Erich to make max_mds = 1 to reproduce it to get rid of the noises.
Updated by Xiubo Li almost 2 years ago
This possibly caused by the lock order issue as in https://tracker.ceph.com/issues/62123.
Updated by Xiubo Li over 1 year ago
The client.233090:699937 create request successfully acquired ifile(wrlock) --> inest(wrlock) --> iauth(rdlock) --> dn(xlock) --> dversion(wrlock):
2024-06-21T22:47:06.173+0000 7f76d8e28700 7 mds.0.server dispatch_client_request client_request(client.233090:699937 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.129071+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.173+0000 7f76d8e28700 7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.233090:699937 nref=2 cr=0x55616e2dc300) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.173+0000 7f76d8e28700 7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker try_rdlock_snap_layout request(client.233090:699937 nref=3 cr=0x55616e2dc300) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) hit -> (target.dat,head)
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker acquire_locks request(client.233090:699937 nref=3 cr=0x55616e2dc300)
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker must wrlock (ifile lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker must wrlock (inest lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker must rdlock (iauth sync r=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker must xlock (dn lock) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker must wrlock (dversion lock) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker must authpin [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker must authpin [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker already auth_pinned [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker already auth_pinned [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker already wrlocked (ifile lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker already wrlocked (inest lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker already rdlocked (iauth sync r=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 7 mds.0.locker xlock_start on (dn lock) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 7 mds.0.locker simple_xlock on (dn lock) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=1 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.cache.den(0x100244d7fcb target.dat) auth_pin by 0x5583df4dafe0 on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00] now 7
2024-06-21T22:47:06.173+0000 7f76d8e28700 15 mds.0.cache.dir(0x100244d7fcb) adjust_nested_auth_pins 1 on [dir 0x100244d7fcb /clusteradmin/weiler/deph-test/trash/ [2,head] auth v=2184 cv=2127/2127 ap=0+7 state=1610612737|complete f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000) hs=0+1,ss=0+0 dirty=1 | child=1 dirty=1 authpin=0 0x555c5019ad00] by 0x5583df4dafe0 count now 0/7
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker got xlock on (dn xlock x=1 by 0x55814520fa80) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 7 mds.0.locker local_wrlock_start on (dversion lock) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker got wrlock on (dversion lock w=1 last_client=233090) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock w=1 last_client=233090) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.cache traverse: null+readable dentry at [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock w=1 last_client=233090) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.cache path_traverse finish on snapid head
...
But it never released the ifile(wrlock) when the request finished:
Early reply:
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.server journal_and_reply tracei 0x555e4c9d2c00 tracedn 0x5583df4daf00
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.locker set_xlocks_done on (dn xlock x=1 by 0x55814520fa80) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock w=1 last_client=233090) pv=2185 v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.233090:699937 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.129071+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist snapid head
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.server set_trace_dist snaprealm snaprealm(0x1000866c4c9 seq 1 lc 0 cr 1 cps 2 snaps={} last_modified 0.000000 change_attr 0 0x555c51c078c0) len=96
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.cache.ino(0x100244d7fcb) pfile 0 pauth 0 plink 0 pxattr 0 plocal 1 mtime 2024-06-21T22:47:06.129071+0000 ctime 2024-06-21T22:47:06.129071+0000 change_attr 887 valid=1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) encode_inodestat issuing pAsLsXs seq 26
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) encode_inodestat caps pAsLsXs seq 26 mseq 0 xattrv 0
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist added diri [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 pv2197 ap=7 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock w=1 last_client=233090) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@26,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist added dir [dir 0x100244d7fcb /clusteradmin/weiler/deph-test/trash/ [2,head] auth pv=2186 v=2184 cv=2127/2127 ap=2+7 state=1610612737|complete f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000) hs=0+1,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x555c5019ad00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.cache.den(0x100244d7fcb target.dat) add_client_lease client.233090 on (dn xlockdone x=1)
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.locker issue_client_lease seq 14 dur 30000ms on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=233090) pv=2185 v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.cache.ino(0x10020b95064) pfile 1 pauth 1 plink 1 pxattr 1 plocal 0 mtime 2024-06-21T22:47:06.129071+0000 ctime 2024-06-21T22:47:06.129071+0000 change_attr 0 valid=1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x10020b95064) encode_inodestat issuing pAsxLsXsxFsxcrwb seq 1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x10020b95064) encode_inodestat caps pAsxLsXsxFsxcrwb seq 1 mseq 0 xattrv 1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x10020b95064) including xattrs version 1
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist added snap head in [inode 0x10020b95064 [2,head] /clusteradmin/weiler/deph-test/trash/target.dat auth v2185 s=0 n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={233090=0-4194304@1} caps={233090=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@1},l=233090 | request=1 caps=1 0x555e4c9d2c00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.260398 send_message_client client.233090 v1:10.50.3.4:0/1578958282 client_reply(???:699937 = 0 (0) Success unsafe) v1
2024-06-21T22:47:06.174+0000 7f76d8e28700 1 -- [v2:10.50.1.76:6816/378751528,v1:10.50.1.76:6817/378751528] --> v1:10.50.3.4:0/1578958282 -- client_reply(???:699937 = 0 (0) Success unsafe) v1 -- 0x55859c6a2e00 con 0x555c4a8da400
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server lat 0.042211
Safe reply:
2024-06-21T22:47:06.177+0000 7f76d2e1c700 7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.233090:699937 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.129071+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.177+0000 7f76d2e1c700 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x10020b95064
2024-06-21T22:47:06.177+0000 7f76d2e1c700 20 mds.0.sessionmap mark_dirty s=0x555c4a99aa00 name=client.233090 v=612904046
2024-06-21T22:47:06.177+0000 7f76d2e1c700 7 mds.0.locker local_wrlock_finish on (iversion lock w=1 last_client=233090) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2197 ap=7 f(v0 m2024-06-21T22:47:06.129071+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.129071+0000 2=1+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock w=1 last_client=233090) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@26,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.177+0000 7f76d2e1c700 7 mds.0.locker wrlock_finish on (inest lock w=1 dirty) on [inode 0x100244d7fca [...2,head] /clusteradmin/weiler/deph-test/ auth v374 f(v0 m2024-06-21T21:10:45.592871+0000 1=0+1) n(v9 rc2024-06-21T22:47:05.901071+0000 3=1+2) (isnap sync r=6) (inest lock w=1 dirty) (ipolicy sync r=6) (iversion lock) caps={139550=pAsLsXsFs/-@1,148495=pAsLsXsFs/-@2,148510=pAsLsXsFs/-@2,151360=pAsLsXsFs/-@2,233090=pAsLsXsFs/-@2,236085=pAsLsXsFs/-@0,361375=pAsLsXsFs/-@2,363400=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=3 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 authpin=0 0x555c4f45bb80]
2024-06-21T22:47:06.177+0000 7f76d2e1c700 10 mds.0.locker scatter_eval (inest lock dirty) on [inode 0x100244d7fca [...2,head] /clusteradmin/weiler/deph-test/ auth v374 f(v0 m2024-06-21T21:10:45.592871+0000 1=0+1) n(v9 rc2024-06-21T22:47:05.901071+0000 3=1+2) (isnap sync r=6) (inest lock dirty) (ipolicy sync r=6) (iversion lock) caps={139550=pAsLsXsFs/-@1,148495=pAsLsXsFs/-@2,148510=pAsLsXsFs/-@2,151360=pAsLsXsFs/-@2,233090=pAsLsXsFs/-@2,236085=pAsLsXsFs/-@0,361375=pAsLsXsFs/-@2,363400=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 authpin=0 0x555c4f45bb80]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 mds.0.locker xlock_finish on (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn xlockdone l x=1) (dversion lock w=1 last_client=233090) v=2185 ap=7 ino=0x10020b95064 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 mds.0.locker eval_gather (dn xlockdone l) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn xlockdone l) (dversion lock w=1 last_client=233090) v=2185 ap=7 ino=0x10020b95064 state=1610612736 | request=1 lock=1 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 7 mds.0.locker eval_gather finished gather on (dn xlockdone l) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn xlockdone l) (dversion lock w=1 last_client=233090) v=2185 ap=7 ino=0x10020b95064 state=1610612736 | request=1 lock=1 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 mds.0.cache.den(0x100244d7fcb target.dat) auth_unpin by 0x5583df4dafe0 on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn sync l) (dversion lock w=1 last_client=233090) v=2185 ap=6 ino=0x10020b95064 state=1610612736 | request=1 lock=1 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00] now 6
2024-06-21T22:47:06.178+0000 7f76d2e1c700 15 mds.0.cache.dir(0x100244d7fcb) adjust_nested_auth_pins -1 on [dir 0x100244d7fcb /clusteradmin/weiler/deph-test/trash/ [2,head] auth v=2186 cv=2127/2127 ap=2+6 state=1610874881|complete f(v0 m2024-06-21T22:47:06.129071+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.129071+0000 1=1+0) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x555c5019ad00] by 0x5583df4dafe0 count now 2/6
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-06-21T22:47:06.178+0000 7f76d2e1c700 7 mds.0.server dispatch_client_request client_request(client.236085:2157106 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.160318+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
...
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.cache.ino(0x10020b95064) add_waiter tag 2000000000040000 0x555d9250e440 !ambig 1 !frozen 1 !freezing 1
2024-06-21T22:47:06.180+0000 7f76d2e1c700 15 mds.0.cache.ino(0x10020b95064) taking waiter here
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.locker nudge_log (iauth excl->sync) on [inode 0x10020b95064 [2,head] /clusteradmin/weiler/deph-test/trash/target.dat auth v2185 ap=6 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl->sync) (ifile excl) (ixattr excl) (iversion lock) cr={233090=0-4194304@1} caps={233090=pAsLsXsxFsxcrwb/pAsxLsXsxFsxcrwb/pAsxXsxFxwb@3},l=233090 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x555e4c9d2c00]
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.locker simple_eval (dn sync l r=5) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn sync l r=5) (dversion lock w=1 last_client=233090) v=2185 ap=6 ino=0x10020b95064 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.180+0000 7f76d2e1c700 7 mds.0.locker local_wrlock_finish on (dversion lock w=1 last_client=233090) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn sync l r=5) (dversion lock w=1 last_client=233090) v=2185 ap=6 ino=0x10020b95064 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.260398 send_message_client client.233090 v1:10.50.3.4:0/1578958282 client_reply(???:699937 = 0 (0) Success safe) v1
2024-06-21T22:47:06.180+0000 7f76d2e1c700 1 -- [v2:10.50.1.76:6816/378751528,v1:10.50.1.76:6817/378751528] --> v1:10.50.3.4:0/1578958282 -- client_reply(???:699937 = 0 (0) Success safe) v1 -- 0x556405160e00 con 0x555c4a8da400
2024-06-21T22:47:06.180+0000 7f76d2e1c700 7 mds.0.cache request_finish request(client.233090:699937 nref=2 cr=0x55616e2dc300) 2024-06-21T22:47:06.180+0000 7f76d2e1c700 15 mds.0.cache request_cleanup request(client.233090:699937 nref=2 cr=0x55616e2dc300)
And also it's so strange that the safe reply was broken into two part and during it there has 5 other client requests were fired.
Next need to figure out why the ifile(wrlock) wasn't released.
Updated by Xiubo Li over 1 year ago
2024-06-21T22:47:06.606+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=1) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=2 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (isnap sync r=2) (inest lock) (ipolicy sync r=2) (ifile lock w=1) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=3 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.363400:23748267 create --> lock cache
2024-06-21T22:47:06.606+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=2) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=2 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=1) (isnap sync r=2) (inest lock w=1) (ipolicy sync r=2) (ifile lock w=2) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.233090:699958 create --> Okay
2024-06-21T22:47:06.606+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=3 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=2) (isnap sync r=3) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=3) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.820230:5988257 create --> Okay
2024-06-21T22:47:06.611+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=4) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=4 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=3) (isnap sync r=4) (inest lock w=3) (ipolicy sync r=4) (ifile lock w=4) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.236085:2157114 create --> Okay
2024-06-21T22:47:06.616+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=5) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 pv2257 ap=6 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=4) (isnap sync r=5) (inest lock w=4) (ipolicy sync r=4) (ifile lock w=5) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.139550:25098228 lookup --> Okay
2024-06-21T22:47:06.621+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2257 ap=6 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=4) (isnap sync r=6) (inest lock w=4) (ipolicy sync r=4) (ifile lock w=6) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.361375:28642315 lookup --> Okay
2024-06-21T22:47:06.637+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=7) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2257 ap=7 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=4) (isnap sync r=7) (inest lock w=4) (ipolicy sync r=4) (ifile lock w=7) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.363400:23748271 unlink --> lock cache
2024-06-21T22:47:06.649+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2261 pv2263 ap=4 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=3) (inest lock w=3) (ipolicy sync r=2) (ifile lock w=3) (iversion lock w=1) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@7} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.361375:28642317 create --> Okay
2024-06-21T22:47:06.650+0000 7f76d8e28700 10 mds.0.locker got wrlock on (ifile lock w=4) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 ap=4 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=4) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=4) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@7} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
===> client.148495:5294062 create --> Okay
2024-06-21T22:47:06.673+0000 7f76d8e28700 7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.673+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.363400:23748273 nref=2 cr=0x555e0261f500) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.673+0000 7f76d8e28700 7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.673+0000 7f76d8e28700 10 mds.0.locker found lock cache for create on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 pv2265 ap=5 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=4) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=4) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600] 2024-06-21T22:47:06.673+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.673+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.673+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) hit -> (target.dat,head)
===> client.363400:23748273 create --> found the lock cache created by the "client.363400:23748267 create".
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.server dispatch_client_request client_request(client.363400:23748273 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.665679+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.363400:23748273 nref=2 cr=0x555e0261f500) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.685+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) hit -> (target.dat,head)
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker acquire_locks request(client.363400:23748273 nref=3 cr=0x555e0261f500)
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.locker must wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
Invalidate Lock Caches:
2024-06-21T22:47:06.674+0000 7f76d8e28700 7 mds.0.locker simple_sync on (ifile lock w=4) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 pv2265 ap=6 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=5) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=4) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=1 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.674+0000 7f76d8e28700 10 mds.0.locker invalidate_lock_caches (ifile lock->sync w=4) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 pv2265 ap=6 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=5) (inest lock w=2) (ipolicy sync r=3) (ifile lock->sync w=4) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=1 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
Drop Cache 1:
2024-06-21T22:47:06.680+0000 7f76d8e28700 10 MDSContext::complete: 15C_MDL_DropCache
2024-06-21T22:47:06.680+0000 7f76d8e28700 7 mds.0.locker rdlock_finish on (isnap sync r=4685) on [inode 0x1 [...2,head] / auth{1=6} v7704853 snaprealm=0x555c4a97e900 f(v0 m2024-03-13T23:28:02.843604+0000 24=0+24) n(v976219 rc2036-12-08T15:08:49.000000+0000 b679352367182001 88277252=82000408+6276844)/n(v0 rc2024-03-01T19:27:05.353098+0000 1=0+1) (isnap sync r=4685) (inest mix dirty) (ipolicy sync r=4640) (iversion lock) caps={111580=pAsLsXsFs/-@1,139530=pAsLsXsFs/-@1,139535=pAsLsXsFs/-@1,139550=pAsLsXsFs/-@1,148495=pAsLsXsFs/-@2,148510=pAsLsXsFs/-@2,148515=pAsLsXsFs/-@1,151360=pAsLsXsFs/pAsLsXsFsx@97,233090=pAsLsXsFs/-@1,236085=pAsLsXsFs/-@0,361375=pAsLsXsFs/-@1,361455=pAsLsXsFs/-@2,361460=pAsLsXsFs/-@1,361470=pAsLsXsFs/-@1,361475=pAsLsXsFs/-@1,361490=pAsLsXsFs/-@1,361500=pAsLsXsFs/-@1,362295=pAsLsXsFs/-@1,363400=pAsLsXsFs/-@3,379420=pAsLsXsFs/-@0,458255=pAsLsXsFs/pAsLsXsFs@3,820230=pAsLsXsFs/-@2,1365120=pAsLsXsFs/-@2,1365125=pAsLsXsFs/-@1,1365140=pAsLsXsFs/-@1,1366055=pAsLsXsFs/-@2,1366185=pAsLsXsFs/-@1,1863515=pAsLsXsFs/-@3} mcw={1=pAsLsXsFsxcral} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 openingsnapparents=0 replicated=1 dirty=1 authpin=0 0x555c4a976680]
2024-06-21T22:47:06.680+0000 7f76d8e28700 7 mds.0.locker wrlock_finish on (ifile lock->sync w=4) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=3) (isnap sync r=7) (inest lock w=4) (ipolicy sync r=3) (ifile lock->sync w=4) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=1 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
Drop Cache 2:
Noop
It seems the wrlock ref in ===> client.363400:23748271 unlink --> lock cache never been released. Maybe the CACHED flag races between the lock caches.
Will continue next week.
Updated by Xiubo Li over 1 year ago
It should be a deadlock bug between the lock cache and create/unlink operations:
From the logs when client.363400:23748273 create request was last time retried we can see that it will try to acquire must wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb and must rdlock (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat:
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.server dispatch_client_request client_request(client.363400:23748273 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.665679+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.363400:23748273 nref=2 cr=0x555e0261f500) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.685+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) hit -> (target.dat,head)
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker acquire_locks request(client.363400:23748273 nref=3 cr=0x555e0261f500)
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.locker must wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.locker must rdlock (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker must authpin [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker must authpin [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker auth_pinning [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) auth_pin by 0x55865d960400 on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600] now 8
2024-06-21T22:47:06.685+0000 7f76d8e28700 15 mds.0.cache.dir(0x100244d7fca) adjust_nested_auth_pins 1 on [dir 0x100244d7fca /clusteradmin/weiler/deph-test/ [2,head] auth pv=2268 v=2266 cv=2138/2138 ap=1+8 state=1610874881|complete f(v0 m2024-06-21T21:10:45.592871+0000 1=0+1) n(v9 rc2024-06-21T22:47:06.625679+0000 1=0+1)/n(v9 rc2024-06-21T22:47:05.901071+0000 2=1+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x555c5019a880] by 0x555c4f45b600 count now 1/8
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker already auth_pinned [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker wrlock_start (ifile lock->sync w=3) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 7 mds.0.locker wrlock_start waiting on (ifile lock->sync w=3) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) add_waiter tag 2000000040000000 0x55848644e580 !ambig 1 !frozen 1 !freezing 1
2024-06-21T22:47:06.685+0000 7f76d8e28700 15 mds.0.cache.ino(0x100244d7fcb) taking waiter here
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker nudge_log (ifile lock->sync w=3) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.cache traverse: failed to rdlock (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
From path_traverse() code we can confirm that this request was acquired the lock cache and kept holding it:
8268 int MDCache::path_traverse(MDRequestRef& mdr, MDSContextFactory& cf,
8269 const filepath& path, int flags,
8270 vector<CDentry*> *pdnvec, CInode **pin)
8271 {
8272 bool discover = (flags & MDS_TRAVERSE_DISCOVER);
8273 bool forward = !discover;
8274 bool path_locked = (flags & MDS_TRAVERSE_PATH_LOCKED);
8275 bool want_dentry = (flags & MDS_TRAVERSE_WANT_DENTRY);
8276 bool want_inode = (flags & MDS_TRAVERSE_WANT_INODE);
8277 bool want_auth = (flags & MDS_TRAVERSE_WANT_AUTH);
8278 bool rdlock_snap = (flags & (MDS_TRAVERSE_RDLOCK_SNAP | MDS_TRAVERSE_RDLOCK_SNAP2));
8279 bool rdlock_path = (flags & MDS_TRAVERSE_RDLOCK_PATH);
8280 bool xlock_dentry = (flags & MDS_TRAVERSE_XLOCK_DENTRY);
8281 bool rdlock_authlock = (flags & MDS_TRAVERSE_RDLOCK_AUTHLOCK);
...
8445 if (rdlock_path) {
8446 lov.clear();
8447 // do not xlock the tail dentry if target inode exists and caller wants it
8448 if (xlock_dentry && (dnl->is_null() || !want_inode) &&
8449 depth == path.depth() - 1) {
8450 ceph_assert(dn->is_auth());
8451 if (depth > 0 || !mdr->lock_cache) {
8452 lov.add_wrlock(&cur->filelock);
8453 lov.add_wrlock(&cur->nestlock);
8454 if (rdlock_authlock)
8455 lov.add_rdlock(&cur->authlock);
8456 }
8457 lov.add_xlock(&dn->lock);
8458 } else {
8459 // force client to flush async dir operation if necessary
8460 if (cur->filelock.is_cached())
8461 lov.add_wrlock(&cur->filelock);
8462 lov.add_rdlock(&dn->lock);
8463 }
8464 if (!mds->locker->acquire_locks(mdr, lov)) {
8465 dout(10) << "traverse: failed to rdlock " << dn->lock << " " << *dn << dendl;
8466 return 1;
8467 }
...
And then when requiring the wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb it failed and waiting for all the lock caches to be released, but this request itself was holding one reference of the create lock cache.
Then deadlocked.
Updated by Xiubo Li over 1 year ago
- Status changed from Need More Info to In Progress
Updated by Xiubo Li over 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 58474
Updated by sunnat samad over 1 year ago
I was able to find another way to reproduce the issue on demand by following script:
import os
import time
import errno
class FileLockException(Exception):
pass
class FileLock(object):
def __init__(self, file_name, timeout=10, delay=.05):
if timeout is not None and delay is None:
raise ValueError("If timeout is not None, then delay must not be None.")
self.is_locked = False
self.lockfile = os.path.join(os.getcwd(), "%s.lock" % file_name)
self.file_name = file_name
self.timeout = timeout
self.delay = delay
def acquire(self):
start_time = time.time()
while True:
try:
self.fd = os.open(self.lockfile, os.O_CREAT|os.O_EXCL|os.O_RDWR)
self.is_locked = True
break;
except OSError as e:
if e.errno != errno.EEXIST:
raise
if self.timeout is None:
raise FileLockException("Could not acquire lock on {}".format(self.file_name))
if (time.time() - start_time) >= self.timeout:
raise FileLockException("Timeout occured.")
time.sleep(self.delay)
def release(self):
if self.is_locked:
os.close(self.fd)
os.unlink(self.lockfile)
self.is_locked = False
def __enter__(self):
if not self.is_locked:
self.acquire()
return self
def __exit__(self, type, value, traceback):
if self.is_locked:
self.release()
def __del__(self):
self.release()
def main():
count = 0
while True:
with FileLock('.lock'):
print(f'Doing stuff {count}')
count += 1
main()
But likely any active file locking sheme would work. This with 18.2.4 softlocks the MDS in some minutes. I was also able to limit the search to certain ceph releases where it do not happen and where it happens.
16.2.13 (do not have issue)
17.2.6 (do not have issue)
16.2.14 (has the issue)
17.2.7 (has the issue)
18.X has issue
This then pointed me to https://github.com/ceph/ceph/pull/51609 commit. And now I have done test with reverting it and it is not anymore softlocking. So something in that commit caused that metadata operations in mds may permanently softlock. And there only way to recover is rebooting mds server. But in worst case if the softlock has be presen long time the journal trimming might have not been executed for long time and journal might get in multi TB size and after that MDS is not able to start as the code is written so that it first need to load the whole journal to memory and then it start to process it. So to be able to recover you need to be able to load multiple TB of data to MDS memory. That commit tried to fix some kernel panic cases so just reverting it might not be best. Even in our case it might be better than current case. As we never saw the issue that was fixed by that commit, but we see constantly the side effect it caused.
Updated by Venky Shankar about 1 year ago
- Category set to Correctness/Safety
- Target version set to v20.0.0
- Source set to Community (user)
- Backport changed from quincy,reef,squid to reef,squid
Updated by sunnat samad about 1 year ago
CAn someone assign this to me. I am working on this here.
https://github.com/ceph/ceph/pull/61250
Updated by sunnat samad about 1 year ago
Can someone assign this to me? I am working on this here.
https://github.com/ceph/ceph/pull/61250
Updated by Venky Shankar about 1 year ago
- Status changed from Fix Under Review to Pending Backport
Sunnat sam wrote in #note-13:
Can someone assign this to me? I am working on this here.
https://github.com/ceph/ceph/pull/61250
Can't see your id in the "Assignee" list. Could you get that sorted?
Updated by Upkeep Bot about 1 year ago
- Copied to Backport #69682: reef: mds deadlock between 'lookup' and the 'rename/create, etc' requests added
Updated by Upkeep Bot about 1 year ago
- Copied to Backport #69683: squid: mds deadlock between 'lookup' and the 'rename/create, etc' requests added
Updated by Upkeep Bot about 1 year ago
- Tags (freeform) set to backport_processed
Updated by Igor Golikov about 1 year ago
- Assignee changed from Xiubo Li to Igor Golikov
Updated by Igor Golikov about 1 year ago
- Assignee changed from Igor Golikov to Alex Markuze
Updated by Venky Shankar about 1 year ago
- Assignee changed from Alex Markuze to Venky Shankar
- Pull request ID changed from 58474 to 61250
Updated by Upkeep Bot 8 months ago
- Merge Commit set to b6a0e49d4db0bfb5de866caea0e28eb3dd7b49c2
- Fixed In set to v19.3.0-7097-gb6a0e49d4db
- Upkeep Timestamp set to 2025-07-09T13:46:28+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v19.3.0-7097-gb6a0e49d4db to v19.3.0-7097-gb6a0e49d4d
- Upkeep Timestamp changed from 2025-07-09T13:46:28+00:00 to 2025-07-14T17:11:35+00:00
Updated by Upkeep Bot 6 months ago
- Status changed from Pending Backport to Resolved
- Upkeep Timestamp changed from 2025-07-14T17:11:35+00:00 to 2025-09-16T12:40:38+00:00
Updated by Upkeep Bot 5 months ago
- Released In set to v20.2.0~1275
- Upkeep Timestamp changed from 2025-09-16T12:40:38+00:00 to 2025-11-01T01:38:23+00:00