mds: avoid acquiring the wrlock twice for a single request#58474
mds: avoid acquiring the wrlock twice for a single request#58474
Conversation
src/mds/Server.cc
Outdated
| } | ||
|
|
||
| void Server::handle_conf_change(const std::set<std::string>& changed) { | ||
| if (changed.count("mds_lock_cache")){ |
There was a problem hiding this comment.
This needs added to the tracked configs in MDSRankDispatcher::get_tracked_conf_keys()?
This config may be more appropriately named mds_allow_async_dirops.
There was a problem hiding this comment.
This needs added to the tracked configs in
MDSRankDispatcher::get_tracked_conf_keys()?
Yeah, will fix it.
This config may be more appropriately named
mds_allow_async_dirops.
Sound better.
There was a problem hiding this comment.
@batrick BTW, then we should also stop delegating the inode# to kclient to comply to the
name of mds_allow_async_dirops.
IMO this should be fine, because at least in mds side we will have a simple way to disable the async dirops instead of doing this in kclient mount one by one.
What do you think ?
There was a problem hiding this comment.
Recently was reminded of mds_client_delegate_inos_pct which can be set to 0, effectively also disabling async dirops.
It would be nice if we had a single config to control async dirops which is more clear (like a simple well-named bool). I still support having the new name. However, we shoudl observe that even without delegated inodes that the MDS may create lock caches. So perhaps this config should keep its current name, although "mds_allow_lock_caching" would be better.
There was a problem hiding this comment.
Recently was reminded of
mds_client_delegate_inos_pctwhich can be set to0, effectively also disabling async dirops.
Yeah, correct. I remember this now.
src/mds/MDCache.cc
Outdated
| } else { | ||
| // force client to flush async dir operation if necessary | ||
| if (cur->filelock.is_cached()) | ||
| if (!mdr->lock_cache && cur->filelock.is_cached()) |
There was a problem hiding this comment.
Should we not simply check that the filelock is not already in the lock_cache?
d49a096 to
2ade6e2
Compare
|
jenkins test windows |
| if (cur->filelock.is_cached()) | ||
| if (cur->filelock.is_cached() && | ||
| !(mdr->lock_cache && | ||
| static_cast<const MutationImpl*>(mdr->lock_cache)->is_wrlocked(&cur->filelock))) { |
There was a problem hiding this comment.
Deciding if we need to take a lock by checking if it's already taken is a huge code smell. How is this safe? If it is safe, please comment with some justification.
Is there really not a way to know up front what the expected and needed state is? How do we ensure that the lock state doesn't change, if we don't own the lock?
There was a problem hiding this comment.
Once a lock_cache is created the locks belong to it won't be dropped until the lock_cache is removed.
And for a lock_cache it will hold a reference for each mdr. So here it will be safe that to check whether the cur->filelock is owned by the lock_cache's or not.
| if (cur->filelock.is_cached() && | ||
| !(mdr->lock_cache && | ||
| static_cast<const MutationImpl*>(mdr->lock_cache)->is_wrlocked(&cur->filelock))) { | ||
| lov.add_wrlock(&cur->filelock); | ||
| } |
There was a problem hiding this comment.
Hmm, so it's correct that the current design of the lock caching is that we skip adding locks if held by the lock_cache.
What confuses me here is that this fix presupposes that the mdr is taking this else path. From what I can see in the log posted on the tracker, the mdr should be taking if path and not adding the filelock at all.
I do think this patch is fixing a genuine issue however. I think perhaps the right approach is:
| if (cur->filelock.is_cached() && | |
| !(mdr->lock_cache && | |
| static_cast<const MutationImpl*>(mdr->lock_cache)->is_wrlocked(&cur->filelock))) { | |
| lov.add_wrlock(&cur->filelock); | |
| } | |
| if (cur->filelock.is_cached()) { | |
| if (mdr->lock_cache) | |
| mds->locker->put_lock_cache(mdr->lock_cache); | |
| lov.add_wrlock(&cur->filelock); | |
| } |
There was a problem hiding this comment.
Hmm, so it's correct that the current design of the lock caching is that we skip adding locks if held by the
lock_cache.What confuses me here is that this fix presupposes that the
mdris taking thiselsepath. From what I can see in the log posted on the tracker, themdrshould be takingifpath and not adding thefilelockat all.
This is because the (dnl->is_null() || !want_inode) was also false for [inode 0x100244d7fcb [...2,head] /clusteradmin/weiler/deph-test/trash/.
And then went to the else path.
I do think this patch is fixing a genuine issue however. I think perhaps the right approach is:
I am afraid we will introduce other issues here.
Because in the front of the path_traverse() it will attach the lock_cache to current mdr and then skip to call try_rdlock_snap_layout():
8362 if (flags & MDS_TRAVERSE_CHECK_LOCKCACHE)
8363 mds->locker->find_and_attach_lock_cache(mdr, cur);
8364
8365 if (mdr && mdr->lock_cache) {
8366 if (flags & MDS_TRAVERSE_WANT_DIRLAYOUT)
8367 mdr->dir_layout = mdr->lock_cache->get_dir_layout();
8368 } else if (rdlock_snap) {
8369 int n = (flags & MDS_TRAVERSE_RDLOCK_SNAP2) ? 1 : 0;
8370 if ((n == 0 && !(mdr->locking_state & MutationImpl::SNAP_LOCKED)) ||
8371 (n == 1 && !(mdr->locking_state & MutationImpl::SNAP2_LOCKED))) {
8372 bool want_layout = (flags & MDS_TRAVERSE_WANT_DIRLAYOUT);
8373 if (!mds->locker->try_rdlock_snap_layout(cur, mdr, n, want_layout))
8374 return 1;
8375 }
8376 }
If we just drop the lock_cache here we should to do the above beforehand. Or retry the path_traverse().
There was a problem hiding this comment.
And also I can be sure that the lock_cache feature will be disabled always, because we need to drop the lock_cache always.
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
The lock cache is buggy and we need to disable it as a workaround. Fixes: https://tracker.ceph.com/issues/65607 Signed-off-by: Xiubo Li <xiubli@redhat.com>
In case the current request has lock cache attached and then the lock cache must have already acquired the wrlock of filelock. So currently the path_traverse() will acquire the wrlock twice and possibly caused deadlock by itself. Fixes: https://tracker.ceph.com/issues/65607 Signed-off-by: Xiubo Li <xiubli@redhat.com>
|
Nothing changed, just rebased it and resolved the conflicts. |
|
jenkins retest this please |
We can confirm that PR is resolving the mds locking issue. is there any plan to take this changes in? |
This has to be reviewed and tested again before it can be merged and backported. We will try to get it included soon... |
|
cc @kotreshhr for review (@mchangir please keep an eye on this for QA testing once the change has been reviewed and approved). |
|
@vshankar @kotreshhr can we take this forward? This issue is critical for us. |
Bumping this up @kotreshhr - PTAL |
|
@vshankar @kotreshhr I'd like to add that this is a critical issue for us as well, and am excited to see this fix merged! |
|
@vshankar @kotreshhr just a bump on this item! |
|
I opened new PR according to our discussion on cephfs standup as author has shifted focus from this project. I am new to ceph community, all comments and suggestions for improvements are welcome. New PR is here |
Thanks for the patch @Sunnatillo - will have it reviewed. |
|
Superseded by #61250 |
In case the current request has lock cache attached and then the
lock cache must have already acquired the wrlock of filelock. So
currently the path_traverse() will acquire the wrlock twice and
possibly caused deadlock by itself.
Fixes: https://tracker.ceph.com/issues/65607
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e