Project

General

Profile

Actions

Bug #65607

closed

mds deadlock between 'lookup' and the 'rename/create, etc' requests

Added by Xiubo Li almost 2 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Backport:
reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v19.3.0-7097-gb6a0e49d4d
Released In:
v20.2.0~1275
Upkeep Timestamp:
2025-11-01T01:38:23+00:00

Description

This is reported by Eric, more detail please see https://www.mail-archive.com/ceph-users@ceph.io/msg24587.html

The lookup request was stuck by waiting the wrlock:

        {
            "description": "client_request(client.139490:14003440 lookup #0x2000285540f/family-720-cons-5.fa.nsq 2024-04-18T12:44:36.546296+0000 caller_uid=30101, caller_gid=600{600,622,})",
            "initiated_at": "2024-04-18T12:44:36.548789+0000",                                                                                                                                                                                                
            "age": 28621.412714391001,
            "duration": 28621.412753429999,
            "type_data": {
                "flag_point": "failed to wrlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "client",
                        "num": 139490
                    },
                    "tid": 14003440
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-04-18T12:44:36.548789+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548790+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548789+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.548794+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.551501+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.555540+0000",
                        "event": "acquired locks" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.555589+0000",
                        "event": "failed to rdlock, waiting" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.558280+0000",
                        "event": "failed to wrlock, waiting" 
                    }
                ],
                "locks": null
            }
        },                            

From the above events we can see that the lookup request has successfully acquired the locks in the path_traverse(), because in the handle_client_getattr() there has not wrlock need to be acquired, so after the locks being successfully acquired in path_traverse(), the locks must be dropped and the lookup request was retried and then went into the *path_traverse()` again and stuck.

While for the rename request it also went into the path_traverse() and have successfully acquired the filelock and nestlock, but failed to acquire the xlock of dn->lock:

8474             lov.add_wrlock(&cur->filelock);
8475             lov.add_wrlock(&cur->nestlock);
8476             if (rdlock_authlock)
8477               lov.add_rdlock(&cur->authlock);
8478           }
8479           lov.add_xlock(&dn->lock);
        {
            "description": "client_request(mds.1:239710 rename #0x2000285540f/family-720-cons-5.fa.nsq #0x611/20002860f27 caller_uid=0, caller_gid=0{})",
            "initiated_at": "2024-04-18T12:44:36.552941+0000",
            "age": 28621.408561975,
            "duration": 28621.408643424002,
            "type_data": {
                "flag_point": "failed to xlock, waiting",
                "reqid": {
                    "entity": {
                        "type": "mds",
                        "num": 1
                    },
                    "tid": 239710
                },
                "op_type": "client_request",
                "events": [
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "initiated" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "throttled" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "header_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.552941+0000",
                        "event": "all_read" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.554035+0000",
                        "event": "dispatched" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.556007+0000",
                        "event": "failed to wrlock, waiting" 
                    },
                    {
                        "time": "2024-04-18T12:44:36.557760+0000",                                                                                                                                                                                            
                        "event": "failed to xlock, waiting" 
                    }
                ],
                "locks": null
            }
        },

The "event": "failed to xlock, waiting" event must be from the path_traverse(), not from the handle_client_rename(), because if so there must be a "event": "acquired locks" event, which is done in the path_traverse(), before this.

So the previous lookup request must be waiting for any of filelock and nestlock.

But I still couldn't figure out exactly where is holding the dn->lock in the lookup request or somewhere else.


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #69682: reef: mds deadlock between 'lookup' and the 'rename/create, etc' requestsResolvedVenky ShankarActions
Copied to CephFS - Backport #69683: squid: mds deadlock between 'lookup' and the 'rename/create, etc' requestsResolvedVenky ShankarActions
Actions #1

Updated by Xiubo Li almost 2 years ago

  • Description updated (diff)
Actions #2

Updated by Xiubo Li almost 2 years ago

As Erich mentioned he enabled multiple active MDSs, but he only updated the block ops from on MDS. I guess maybe another MDS was holding the dn->lock ?

Actions #3

Updated by Xiubo Li almost 2 years ago

  • Status changed from In Progress to Need More Info

Have suggested Erich to make max_mds = 1 to reproduce it to get rid of the noises.

Actions #4

Updated by Xiubo Li almost 2 years ago

This possibly caused by the lock order issue as in https://tracker.ceph.com/issues/62123.

Actions #5

Updated by Xiubo Li over 1 year ago

The client.233090:699937 create request successfully acquired ifile(wrlock) --> inest(wrlock) --> iauth(rdlock) --> dn(xlock) --> dversion(wrlock):

2024-06-21T22:47:06.173+0000 7f76d8e28700  7 mds.0.server dispatch_client_request client_request(client.233090:699937 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.129071+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.173+0000 7f76d8e28700  7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.233090:699937 nref=2 cr=0x55616e2dc300) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.173+0000 7f76d8e28700  7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker try_rdlock_snap_layout request(client.233090:699937 nref=3 cr=0x55616e2dc300) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb)   hit -> (target.dat,head)
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker acquire_locks request(client.233090:699937 nref=3 cr=0x55616e2dc300)
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker  must wrlock (ifile lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker  must wrlock (inest lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker  must rdlock (iauth sync r=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker  must xlock (dn lock) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 20 mds.0.locker  must wrlock (dversion lock) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  must authpin [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  must authpin [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  already auth_pinned [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  already auth_pinned [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  already wrlocked (ifile lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  already wrlocked (inest lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]                                                                                    
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  already rdlocked (iauth sync r=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 ap=6 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@25,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.173+0000 7f76d8e28700  7 mds.0.locker xlock_start on (dn lock) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=0 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700  7 mds.0.locker simple_xlock on (dn lock) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=6 ino=(nil) state=1610612736 | request=1 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.cache.den(0x100244d7fcb target.dat) auth_pin by 0x5583df4dafe0 on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn lock) (dversion lock) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=0 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00] now 7
2024-06-21T22:47:06.173+0000 7f76d8e28700 15 mds.0.cache.dir(0x100244d7fcb) adjust_nested_auth_pins 1 on [dir 0x100244d7fcb /clusteradmin/weiler/deph-test/trash/ [2,head] auth v=2184 cv=2127/2127 ap=0+7 state=1610612737|complete f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000) hs=0+1,ss=0+0 dirty=1 | child=1 dirty=1 authpin=0 0x555c5019ad00] by 0x5583df4dafe0 count now 0/7
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  got xlock on (dn xlock x=1 by 0x55814520fa80) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700  7 mds.0.locker local_wrlock_start  on (dversion lock) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (dversion lock w=1 last_client=233090) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock w=1 last_client=233090) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.cache traverse: null+readable dentry at [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock w=1 last_client=233090) v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.173+0000 7f76d8e28700 10 mds.0.cache path_traverse finish on snapid head
...

But it never released the ifile(wrlock) when the request finished:

Early reply:

2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.server journal_and_reply tracei 0x555e4c9d2c00 tracedn 0x5583df4daf00
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.locker set_xlocks_done on (dn xlock x=1 by 0x55814520fa80) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlock x=1 by 0x55814520fa80) (dversion lock w=1 last_client=233090) pv=2185 v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=0 0x5583df4daf00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.233090:699937 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.129071+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist snapid head
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.server set_trace_dist snaprealm snaprealm(0x1000866c4c9 seq 1 lc 0 cr 1 cps 2 snaps={} last_modified 0.000000 change_attr 0 0x555c51c078c0) len=96
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.cache.ino(0x100244d7fcb)  pfile 0 pauth 0 plink 0 pxattr 0 plocal 1 mtime 2024-06-21T22:47:06.129071+0000 ctime 2024-06-21T22:47:06.129071+0000 change_attr 887 valid=1                                    
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) encode_inodestat issuing pAsLsXs seq 26
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) encode_inodestat caps pAsLsXs seq 26 mseq 0 xattrv 0
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist added diri [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2195 pv2197 ap=7 f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000 1=0+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock w=1 last_client=233090) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@26,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist added dir  [dir 0x100244d7fcb /clusteradmin/weiler/deph-test/trash/ [2,head] auth pv=2186 v=2184 cv=2127/2127 ap=2+7 state=1610612737|complete f(v0 m2024-06-21T22:47:06.121071+0000) n(v2 rc2024-06-21T22:47:06.121071+0000) hs=0+1,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x555c5019ad00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.cache.den(0x100244d7fcb target.dat) add_client_lease client.233090 on (dn xlockdone x=1)
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.locker issue_client_lease seq 14 dur 30000ms  on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=233090) pv=2185 v=2183 ap=7 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=0 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.cache.ino(0x10020b95064)  pfile 1 pauth 1 plink 1 pxattr 1 plocal 0 mtime 2024-06-21T22:47:06.129071+0000 ctime 2024-06-21T22:47:06.129071+0000 change_attr 0 valid=1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x10020b95064) encode_inodestat issuing pAsxLsXsxFsxcrwb seq 1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x10020b95064) encode_inodestat caps pAsxLsXsxFsxcrwb seq 1 mseq 0 xattrv 1
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.cache.ino(0x10020b95064) including xattrs version 1
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server set_trace_dist added snap head in [inode 0x10020b95064 [2,head] /clusteradmin/weiler/deph-test/trash/target.dat auth v2185 s=0 n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={233090=0-4194304@1} caps={233090=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@1},l=233090 | request=1 caps=1 0x555e4c9d2c00]
2024-06-21T22:47:06.174+0000 7f76d8e28700 10 mds.0.260398 send_message_client client.233090 v1:10.50.3.4:0/1578958282 client_reply(???:699937 = 0 (0) Success unsafe) v1
2024-06-21T22:47:06.174+0000 7f76d8e28700  1 -- [v2:10.50.1.76:6816/378751528,v1:10.50.1.76:6817/378751528] --> v1:10.50.3.4:0/1578958282 -- client_reply(???:699937 = 0 (0) Success unsafe) v1 -- 0x55859c6a2e00 con 0x555c4a8da400
2024-06-21T22:47:06.174+0000 7f76d8e28700 20 mds.0.server lat 0.042211

Safe reply:

2024-06-21T22:47:06.177+0000 7f76d2e1c700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.233090:699937 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.129071+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.177+0000 7f76d2e1c700 10 mds.0.server apply_allocated_inos 0x0 / [] / 0x10020b95064
2024-06-21T22:47:06.177+0000 7f76d2e1c700 20 mds.0.sessionmap mark_dirty s=0x555c4a99aa00 name=client.233090 v=612904046
2024-06-21T22:47:06.177+0000 7f76d2e1c700  7 mds.0.locker local_wrlock_finish  on (iversion lock w=1 last_client=233090) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2197 ap=7 f(v0 m2024-06-21T22:47:06.129071+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.129071+0000 2=1+1) (iauth sync r=6) (isnap sync r=6) (inest lock w=6) (ipolicy sync r=6) (ifile lock w=6) (iversion lock w=1 last_client=233090) caps={148495=pAsLsXs/pAsLsXsFsx@8,148510=pAsLsXs/pAsLsXsFsx@8,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@26,236085=pAsLsXs/pAsLsXsFsx@12,361375=pAsLsXs/pAsLsXsFsx@10,363400=pAsLsXs/-@6} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.177+0000 7f76d2e1c700  7 mds.0.locker wrlock_finish on (inest lock w=1 dirty) on [inode 0x100244d7fca [...2,head] /clusteradmin/weiler/deph-test/ auth v374 f(v0 m2024-06-21T21:10:45.592871+0000 1=0+1) n(v9 rc2024-06-21T22:47:05.901071+0000 3=1+2) (isnap sync r=6) (inest lock w=1 dirty) (ipolicy sync r=6) (iversion lock) caps={139550=pAsLsXsFs/-@1,148495=pAsLsXsFs/-@2,148510=pAsLsXsFs/-@2,151360=pAsLsXsFs/-@2,233090=pAsLsXsFs/-@2,236085=pAsLsXsFs/-@0,361375=pAsLsXsFs/-@2,363400=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=3 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 authpin=0 0x555c4f45bb80]
2024-06-21T22:47:06.177+0000 7f76d2e1c700 10 mds.0.locker scatter_eval (inest lock dirty) on [inode 0x100244d7fca [...2,head] /clusteradmin/weiler/deph-test/ auth v374 f(v0 m2024-06-21T21:10:45.592871+0000 1=0+1) n(v9 rc2024-06-21T22:47:05.901071+0000 3=1+2) (isnap sync r=6) (inest lock dirty) (ipolicy sync r=6) (iversion lock) caps={139550=pAsLsXsFs/-@1,148495=pAsLsXsFs/-@2,148510=pAsLsXsFs/-@2,151360=pAsLsXsFs/-@2,233090=pAsLsXsFs/-@2,236085=pAsLsXsFs/-@0,361375=pAsLsXsFs/-@2,363400=pAsLsXsFs/-@2} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 authpin=0 0x555c4f45bb80]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 mds.0.locker xlock_finish on (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn xlockdone l x=1) (dversion lock w=1 last_client=233090) v=2185 ap=7 ino=0x10020b95064 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 mds.0.locker eval_gather (dn xlockdone l) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn xlockdone l) (dversion lock w=1 last_client=233090) v=2185 ap=7 ino=0x10020b95064 state=1610612736 | request=1 lock=1 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.178+0000 7f76d2e1c700  7 mds.0.locker eval_gather finished gather on (dn xlockdone l) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn xlockdone l) (dversion lock w=1 last_client=233090) v=2185 ap=7 ino=0x10020b95064 state=1610612736 | request=1 lock=1 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 mds.0.cache.den(0x100244d7fcb target.dat) auth_unpin by 0x5583df4dafe0 on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn sync l) (dversion lock w=1 last_client=233090) v=2185 ap=6 ino=0x10020b95064 state=1610612736 | request=1 lock=1 inodepin=1 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00] now 6
2024-06-21T22:47:06.178+0000 7f76d2e1c700 15 mds.0.cache.dir(0x100244d7fcb) adjust_nested_auth_pins -1 on [dir 0x100244d7fcb /clusteradmin/weiler/deph-test/trash/ [2,head] auth v=2186 cv=2127/2127 ap=2+6 state=1610874881|complete f(v0 m2024-06-21T22:47:06.129071+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.129071+0000 1=1+0) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x555c5019ad00] by 0x5583df4dafe0 count now 2/6
2024-06-21T22:47:06.178+0000 7f76d2e1c700 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-06-21T22:47:06.178+0000 7f76d2e1c700  7 mds.0.server dispatch_client_request client_request(client.236085:2157106 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.160318+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
...
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.cache.ino(0x10020b95064) add_waiter tag 2000000000040000 0x555d9250e440 !ambig 1 !frozen 1 !freezing 1
2024-06-21T22:47:06.180+0000 7f76d2e1c700 15 mds.0.cache.ino(0x10020b95064) taking waiter here
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.locker nudge_log (iauth excl->sync) on [inode 0x10020b95064 [2,head] /clusteradmin/weiler/deph-test/trash/target.dat auth v2185 ap=6 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl->sync) (ifile excl) (ixattr excl) (iversion lock) cr={233090=0-4194304@1} caps={233090=pAsLsXsxFsxcrwb/pAsxLsXsxFsxcrwb/pAsxXsxFxwb@3},l=233090 | request=1 lock=1 caps=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x555e4c9d2c00]
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.locker simple_eval (dn sync l r=5) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn sync l r=5) (dversion lock w=1 last_client=233090) v=2185 ap=6 ino=0x10020b95064 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.180+0000 7f76d2e1c700  7 mds.0.locker local_wrlock_finish  on (dversion lock w=1 last_client=233090) on [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth (dn sync l r=5) (dversion lock w=1 last_client=233090) v=2185 ap=6 ino=0x10020b95064 state=1610612736 | request=1 lock=2 inodepin=1 dirty=1 waiter=0 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.180+0000 7f76d2e1c700 10 mds.0.260398 send_message_client client.233090 v1:10.50.3.4:0/1578958282 client_reply(???:699937 = 0 (0) Success safe) v1
2024-06-21T22:47:06.180+0000 7f76d2e1c700  1 -- [v2:10.50.1.76:6816/378751528,v1:10.50.1.76:6817/378751528] --> v1:10.50.3.4:0/1578958282 -- client_reply(???:699937 = 0 (0) Success safe) v1 -- 0x556405160e00 con 0x555c4a8da400
2024-06-21T22:47:06.180+0000 7f76d2e1c700  7 mds.0.cache request_finish request(client.233090:699937 nref=2 cr=0x55616e2dc300)                                                                                                                                2024-06-21T22:47:06.180+0000 7f76d2e1c700 15 mds.0.cache request_cleanup request(client.233090:699937 nref=2 cr=0x55616e2dc300)

And also it's so strange that the safe reply was broken into two part and during it there has 5 other client requests were fired.

Next need to figure out why the ifile(wrlock) wasn't released.

Actions #6

Updated by Xiubo Li over 1 year ago

2024-06-21T22:47:06.606+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=1) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=2 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (isnap sync r=2) (inest lock) (ipolicy sync r=2) (ifile lock w=1) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=3 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]                                                 

===> client.363400:23748267 create --> lock cache

2024-06-21T22:47:06.606+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=2) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=2 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=1) (isnap sync r=2) (inest lock w=1) (ipolicy sync r=2) (ifile lock w=2) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.233090:699958 create --> Okay

2024-06-21T22:47:06.606+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=3 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=2) (isnap sync r=3) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=3) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.820230:5988257 create --> Okay

2024-06-21T22:47:06.611+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=4) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 ap=4 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=3) (isnap sync r=4) (inest lock w=3) (ipolicy sync r=4) (ifile lock w=4) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@25,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.236085:2157114 create --> Okay

2024-06-21T22:47:06.616+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=5) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2255 pv2257 ap=6 f(v0 m2024-06-21T22:47:06.501679+0000) n(v2 rc2024-06-21T22:47:06.501679+0000 1=0+1) (iauth sync r=4) (isnap sync r=5) (inest lock w=4) (ipolicy sync r=4) (ifile lock w=5) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.139550:25098228 lookup --> Okay

2024-06-21T22:47:06.621+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=6) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2257 ap=6 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=4) (isnap sync r=6) (inest lock w=4) (ipolicy sync r=4) (ifile lock w=6) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.361375:28642315 lookup --> Okay

2024-06-21T22:47:06.637+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=7) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2257 ap=7 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=4) (isnap sync r=7) (inest lock w=4) (ipolicy sync r=4) (ifile lock w=7) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@10,148495=pAsLsXs/pAsLsXsFsx@13,148510=pAsLsXs/pAsLsXsFsx@9,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@32,236085=pAsLsXs/pAsLsXsFsx@14,361375=pAsLsXs/pAsLsXsFsx@15,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@6} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.363400:23748271 unlink --> lock cache

2024-06-21T22:47:06.649+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2261 pv2263 ap=4 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=3) (inest lock w=3) (ipolicy sync r=2) (ifile lock w=3) (iversion lock w=1) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@7} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.361375:28642317 create --> Okay

2024-06-21T22:47:06.650+0000 7f76d8e28700 10 mds.0.locker  got wrlock on (ifile lock w=4) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 ap=4 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=4) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=4) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@26,820230=pAsLsXs/pAsLsXsFsx@7} | dirtyscattered=0 request=0 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]

===> client.148495:5294062 create --> Okay

2024-06-21T22:47:06.673+0000 7f76d8e28700  7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.673+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.363400:23748273 nref=2 cr=0x555e0261f500) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.673+0000 7f76d8e28700  7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.673+0000 7f76d8e28700 10 mds.0.locker found lock cache for create on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 pv2265 ap=5 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=4) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=4) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=0 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]                                                                                                                                                                                                                                  2024-06-21T22:47:06.673+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.673+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.673+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb)   hit -> (target.dat,head)

===> client.363400:23748273 create --> found the lock cache created by the "client.363400:23748267 create".
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.server dispatch_client_request client_request(client.363400:23748273 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.665679+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.363400:23748273 nref=2 cr=0x555e0261f500) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head
2024-06-21T22:47:06.685+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb)   hit -> (target.dat,head)
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker acquire_locks request(client.363400:23748273 nref=3 cr=0x555e0261f500)
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.locker  must wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]

Invalidate Lock Caches:

2024-06-21T22:47:06.674+0000 7f76d8e28700  7 mds.0.locker simple_sync on (ifile lock w=4) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 pv2265 ap=6 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=5) (inest lock w=2) (ipolicy sync r=3) (ifile lock w=4) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=1 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.674+0000 7f76d8e28700 10 mds.0.locker invalidate_lock_caches (ifile lock->sync w=4) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2263 pv2265 ap=6 f(v0 m2024-06-21T22:47:06.577679+0000 1=1+0) n(v2 rc2024-06-21T22:47:06.577679+0000 2=1+1) (iauth sync r=1) (isnap sync r=5) (inest lock w=2) (ipolicy sync r=3) (ifile lock->sync w=4) (iversion lock w=1 last_client=363400) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=1 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=0 authpin=1 0x555c4f45b600]             

Drop Cache 1:

2024-06-21T22:47:06.680+0000 7f76d8e28700 10 MDSContext::complete: 15C_MDL_DropCache
2024-06-21T22:47:06.680+0000 7f76d8e28700  7 mds.0.locker rdlock_finish on (isnap sync r=4685) on [inode 0x1 [...2,head] / auth{1=6} v7704853 snaprealm=0x555c4a97e900 f(v0 m2024-03-13T23:28:02.843604+0000 24=0+24) n(v976219 rc2036-12-08T15:08:49.000000+0000 b679352367182001 88277252=82000408+6276844)/n(v0 rc2024-03-01T19:27:05.353098+0000 1=0+1) (isnap sync r=4685) (inest mix dirty) (ipolicy sync r=4640) (iversion lock) caps={111580=pAsLsXsFs/-@1,139530=pAsLsXsFs/-@1,139535=pAsLsXsFs/-@1,139550=pAsLsXsFs/-@1,148495=pAsLsXsFs/-@2,148510=pAsLsXsFs/-@2,148515=pAsLsXsFs/-@1,151360=pAsLsXsFs/pAsLsXsFsx@97,233090=pAsLsXsFs/-@1,236085=pAsLsXsFs/-@0,361375=pAsLsXsFs/-@1,361455=pAsLsXsFs/-@2,361460=pAsLsXsFs/-@1,361470=pAsLsXsFs/-@1,361475=pAsLsXsFs/-@1,361490=pAsLsXsFs/-@1,361500=pAsLsXsFs/-@1,362295=pAsLsXsFs/-@1,363400=pAsLsXsFs/-@3,379420=pAsLsXsFs/-@0,458255=pAsLsXsFs/pAsLsXsFs@3,820230=pAsLsXsFs/-@2,1365120=pAsLsXsFs/-@2,1365125=pAsLsXsFs/-@1,1365140=pAsLsXsFs/-@1,1366055=pAsLsXsFs/-@2,1366185=pAsLsXsFs/-@1,1863515=pAsLsXsFs/-@3} mcw={1=pAsLsXsFsxcral} | dirtyscattered=1 request=0 lock=2 dirfrag=1 caps=1 openingsnapparents=0 replicated=1 dirty=1 authpin=0 0x555c4a976680]
2024-06-21T22:47:06.680+0000 7f76d8e28700  7 mds.0.locker wrlock_finish on (ifile lock->sync w=4) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=3) (isnap sync r=7) (inest lock w=4) (ipolicy sync r=3) (ifile lock->sync w=4) (iversion lock) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@16,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=1 lock=5 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]

Drop Cache 2:

Noop

It seems the wrlock ref in ===> client.363400:23748271 unlink --> lock cache never been released. Maybe the CACHED flag races between the lock caches.

Will continue next week.

Actions #7

Updated by Xiubo Li over 1 year ago

It should be a deadlock bug between the lock cache and create/unlink operations:

From the logs when client.363400:23748273 create request was last time retried we can see that it will try to acquire must wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb and must rdlock (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat:

2024-06-21T22:47:06.685+0000 7f76d8e28700 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.server dispatch_client_request client_request(client.363400:23748273 create owner_uid=3495, owner_gid=600 #0x100244d7fcb/target.dat 2024-06-21T22:47:06.665679+0000 caller_uid=3495, caller_gid=600{600,611,999,}) v4
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.server open w/ O_CREAT on #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.server rdlock_path_xlock_dentry request(client.363400:23748273 nref=2 cr=0x555e0261f500) #0x100244d7fcb/target.dat
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.cache traverse: opening base ino 0x100244d7fcb snap head                                                                                                                                                   
2024-06-21T22:47:06.685+0000 7f76d8e28700 12 mds.0.cache traverse: path seg depth 0 'target.dat' snapid head
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb) lookup (target.dat, 'head')
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.cache.dir(0x100244d7fcb)   hit -> (target.dat,head)
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker acquire_locks request(client.363400:23748273 nref=3 cr=0x555e0261f500)
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.locker  must wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 20 mds.0.locker  must rdlock (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker  must authpin [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker  must authpin [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker  auth_pinning [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=7 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) auth_pin by 0x55865d960400 on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600] now 8
2024-06-21T22:47:06.685+0000 7f76d8e28700 15 mds.0.cache.dir(0x100244d7fca) adjust_nested_auth_pins 1 on [dir 0x100244d7fca /clusteradmin/weiler/deph-test/ [2,head] auth pv=2268 v=2266 cv=2138/2138 ap=1+8 state=1610874881|complete f(v0 m2024-06-21T21:10:45.592871+0000 1=0+1) n(v9 rc2024-06-21T22:47:06.625679+0000 1=0+1)/n(v9 rc2024-06-21T22:47:05.901071+0000 2=1+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 authpin=1 0x555c5019a880] by 0x555c4f45b600 count now 1/8
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker  already auth_pinned [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker wrlock_start (ifile lock->sync w=3) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700  7 mds.0.locker wrlock_start waiting on (ifile lock->sync w=3) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.cache.ino(0x100244d7fcb) add_waiter tag 2000000040000000 0x55848644e580 !ambig 1 !frozen 1 !freezing 1
2024-06-21T22:47:06.685+0000 7f76d8e28700 15 mds.0.cache.ino(0x100244d7fcb) taking waiter here
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.locker nudge_log (ifile lock->sync w=3) on [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/ auth v2265 pv2267 ap=8 f(v0 m2024-06-21T22:47:06.625679+0000) n(v2 rc2024-06-21T22:47:06.625679+0000 1=0+1) (iauth sync r=2) (isnap sync r=6) (inest lock w=3) (ipolicy sync r=3) (ifile lock->sync w=3) (iversion lock w=1 last_client=361375) caps={139550=pAsLsXs/pAsLsXsFsx@11,148495=pAsLsXs/pAsLsXsFsx@14,148510=pAsLsXs/pAsLsXsFsx@10,151360=pAsLsXs/-@2,233090=pAsLsXs/pAsLsXsFsx@33,236085=pAsLsXs/pAsLsXsFsx@16,361375=pAsLsXs/pAsLsXsFsx@17,363400=pAsLsXs/-@27,820230=pAsLsXs/pAsLsXsFsx@7,1365120=pAsLsXs/-@1} | dirtyscattered=0 request=3 lock=6 dirfrag=1 caps=1 dirtyrstat=0 dirtyparent=0 dirty=1 waiter=1 authpin=1 0x555c4f45b600]
2024-06-21T22:47:06.685+0000 7f76d8e28700 10 mds.0.cache traverse: failed to rdlock (dn xlockdone l x=1) [dentry #0x1/clusteradmin/weiler/deph-test/trash/target.dat [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=361375) pv=2255 v=2253 ap=6 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 dirty=1 waiter=1 authpin=1 clientlease=1 0x5583df4daf00]

From path_traverse() code we can confirm that this request was acquired the lock cache and kept holding it:

 8268 int MDCache::path_traverse(MDRequestRef& mdr, MDSContextFactory& cf,
 8269                            const filepath& path, int flags,
 8270                            vector<CDentry*> *pdnvec, CInode **pin)
 8271 {     
 8272   bool discover = (flags & MDS_TRAVERSE_DISCOVER);
 8273   bool forward = !discover;
 8274   bool path_locked = (flags & MDS_TRAVERSE_PATH_LOCKED);
 8275   bool want_dentry = (flags & MDS_TRAVERSE_WANT_DENTRY);
 8276   bool want_inode = (flags & MDS_TRAVERSE_WANT_INODE);
 8277   bool want_auth = (flags & MDS_TRAVERSE_WANT_AUTH);
 8278   bool rdlock_snap = (flags & (MDS_TRAVERSE_RDLOCK_SNAP | MDS_TRAVERSE_RDLOCK_SNAP2));
 8279   bool rdlock_path = (flags & MDS_TRAVERSE_RDLOCK_PATH);
 8280   bool xlock_dentry = (flags & MDS_TRAVERSE_XLOCK_DENTRY);
 8281   bool rdlock_authlock = (flags & MDS_TRAVERSE_RDLOCK_AUTHLOCK);
...
 8445       if (rdlock_path) {
 8446         lov.clear();
 8447         // do not xlock the tail dentry if target inode exists and caller wants it
 8448         if (xlock_dentry && (dnl->is_null() || !want_inode) &&
 8449             depth == path.depth() - 1) {
 8450           ceph_assert(dn->is_auth());
 8451           if (depth > 0 || !mdr->lock_cache) {
 8452             lov.add_wrlock(&cur->filelock);
 8453             lov.add_wrlock(&cur->nestlock);
 8454             if (rdlock_authlock)
 8455               lov.add_rdlock(&cur->authlock);
 8456           }
 8457           lov.add_xlock(&dn->lock);
 8458         } else {
 8459           // force client to flush async dir operation if necessary
 8460           if (cur->filelock.is_cached()) 
 8461             lov.add_wrlock(&cur->filelock);                                                                                                                                                                                                             
 8462           lov.add_rdlock(&dn->lock);
 8463         }
 8464         if (!mds->locker->acquire_locks(mdr, lov)) {
 8465           dout(10) << "traverse: failed to rdlock " << dn->lock << " " << *dn << dendl;
 8466           return 1;
 8467         }
...

And then when requiring the wrlock (ifile lock->sync w=3) [inode 0x100244d7fcb it failed and waiting for all the lock caches to be released, but this request itself was holding one reference of the create lock cache.

Then deadlocked.

Actions #8

Updated by Xiubo Li over 1 year ago

  • Status changed from Need More Info to In Progress
Actions #9

Updated by Xiubo Li over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 58474
Actions #10

Updated by sunnat samad over 1 year ago

I was able to find another way to reproduce the issue on demand by following script:

import os
import time
import errno
class FileLockException(Exception):
    pass
class FileLock(object):
    def __init__(self, file_name, timeout=10, delay=.05):
        if timeout is not None and delay is None:
            raise ValueError("If timeout is not None, then delay must not be None.")
        self.is_locked = False
        self.lockfile = os.path.join(os.getcwd(), "%s.lock" % file_name)
        self.file_name = file_name
        self.timeout = timeout
        self.delay = delay
    def acquire(self):
        start_time = time.time()
        while True:
            try:
                self.fd = os.open(self.lockfile, os.O_CREAT|os.O_EXCL|os.O_RDWR)
                self.is_locked = True
                break;
            except OSError as e:
                if e.errno != errno.EEXIST:
                    raise
                if self.timeout is None:
                    raise FileLockException("Could not acquire lock on {}".format(self.file_name))
                if (time.time() - start_time) >= self.timeout:
                    raise FileLockException("Timeout occured.")
                time.sleep(self.delay)
    def release(self):
        if self.is_locked:
            os.close(self.fd)
            os.unlink(self.lockfile)
            self.is_locked = False
    def __enter__(self):
        if not self.is_locked:
            self.acquire()
        return self
    def __exit__(self, type, value, traceback):
        if self.is_locked:
            self.release()

    def __del__(self):
        self.release()
def main():
    count = 0
    while True:
        with FileLock('.lock'):
            print(f'Doing stuff {count}')
            count += 1
main()

But likely any active file locking sheme would work. This with 18.2.4 softlocks the MDS in some minutes. I was also able to limit the search to certain ceph releases where it do not happen and where it happens.

16.2.13 (do not have issue)
17.2.6 (do not have issue)

16.2.14 (has the issue)
17.2.7 (has the issue)
18.X has issue

This then pointed me to https://github.com/ceph/ceph/pull/51609 commit. And now I have done test with reverting it and it is not anymore softlocking. So something in that commit caused that metadata operations in mds may permanently softlock. And there only way to recover is rebooting mds server. But in worst case if the softlock has be presen long time the journal trimming might have not been executed for long time and journal might get in multi TB size and after that MDS is not able to start as the code is written so that it first need to load the whole journal to memory and then it start to process it. So to be able to recover you need to be able to load multiple TB of data to MDS memory. That commit tried to fix some kernel panic cases so just reverting it might not be best. Even in our case it might be better than current case. As we never saw the issue that was fixed by that commit, but we see constantly the side effect it caused.

Actions #11

Updated by Venky Shankar about 1 year ago

  • Category set to Correctness/Safety
  • Target version set to v20.0.0
  • Source set to Community (user)
  • Backport changed from quincy,reef,squid to reef,squid
Actions #12

Updated by sunnat samad about 1 year ago

CAn someone assign this to me. I am working on this here.
https://github.com/ceph/ceph/pull/61250

Actions #13

Updated by sunnat samad about 1 year ago

Can someone assign this to me? I am working on this here.
https://github.com/ceph/ceph/pull/61250

Actions #14

Updated by Venky Shankar about 1 year ago

  • Status changed from Fix Under Review to Pending Backport

Sunnat sam wrote in #note-13:

Can someone assign this to me? I am working on this here.
https://github.com/ceph/ceph/pull/61250

Can't see your id in the "Assignee" list. Could you get that sorted?

Actions #15

Updated by Upkeep Bot about 1 year ago

  • Copied to Backport #69682: reef: mds deadlock between 'lookup' and the 'rename/create, etc' requests added
Actions #16

Updated by Upkeep Bot about 1 year ago

  • Copied to Backport #69683: squid: mds deadlock between 'lookup' and the 'rename/create, etc' requests added
Actions #17

Updated by Upkeep Bot about 1 year ago

  • Tags (freeform) set to backport_processed
Actions #18

Updated by Igor Golikov about 1 year ago

  • Assignee changed from Xiubo Li to Igor Golikov
Actions #19

Updated by Igor Golikov about 1 year ago

  • Assignee changed from Igor Golikov to Alex Markuze
Actions #20

Updated by Venky Shankar about 1 year ago

  • Assignee changed from Alex Markuze to Venky Shankar
  • Pull request ID changed from 58474 to 61250
Actions #21

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to b6a0e49d4db0bfb5de866caea0e28eb3dd7b49c2
  • Fixed In set to v19.3.0-7097-gb6a0e49d4db
  • Upkeep Timestamp set to 2025-07-09T13:46:28+00:00
Actions #22

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v19.3.0-7097-gb6a0e49d4db to v19.3.0-7097-gb6a0e49d4d
  • Upkeep Timestamp changed from 2025-07-09T13:46:28+00:00 to 2025-07-14T17:11:35+00:00
Actions #23

Updated by Upkeep Bot 6 months ago

  • Status changed from Pending Backport to Resolved
  • Upkeep Timestamp changed from 2025-07-14T17:11:35+00:00 to 2025-09-16T12:40:38+00:00
Actions #24

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~1275
  • Upkeep Timestamp changed from 2025-09-16T12:40:38+00:00 to 2025-11-01T01:38:23+00:00
Actions

Also available in: Atom PDF