Skip to content

mds: getattr just waits the xlock to be released by the previous client#56602

Merged
vshankar merged 1 commit intoceph:mainfrom
lxbsz:wip-63906
Aug 23, 2024
Merged

mds: getattr just waits the xlock to be released by the previous client#56602
vshankar merged 1 commit intoceph:mainfrom
lxbsz:wip-63906

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Apr 1, 2024

When the previous client's setattr request is still holding the xlock for the linklock/authlock/xattrlock/filelock locks, if the same client send a getattr request it will use the projected inode to fill the reply, while for other clients the getattr requests will use the none projected inode to fill replies. This just cause inconsistent file mode across multiple clients.

This will just let the getattr wait until the previous client release the xlock.

Fixes: https://tracker.ceph.com/issues/63906

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@lxbsz lxbsz requested a review from a team April 1, 2024 01:38
@github-actions github-actions bot added the cephfs Ceph File System label Apr 1, 2024
@mchangir
Copy link
Contributor

also, possibly update the PR title and the commit title to:
mds: always make getattr wait for xlock to be released by the previous client

@lxbsz
Copy link
Member Author

lxbsz commented Apr 25, 2024

mds: always make getattr wait for xlock to be released by the previous client

This looks good to me and fixed them all. Thanks @mchangir

@mchangir
Copy link
Contributor

... none projected inode to fill replies. This just cause inconsistent file mode across multiple clients.

  • please change this to: non-projected
  • please change the last sentence to: This causes inconsistent ...

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to discuss this. The code doesn't look right, it's trying to hack the locking system. We don't want to add such code outside of the Locker, and even there it's smelly.

Using the projected inode in requests from the holder of the xlock is a feature: it means that the client gets back the recent changes even if they haven't been committed yet.

Going back to the ticket I'm not sure we're fixing the issue at the correct place. If chmod in POSIX is a synchronous operation, then we should've waited for the change to be committed, before we allowed another query to propagate to the MDS.
I haven't yet found out whether fsync is required after a chmod, but in any case, I doubt that we need to fix it on the MDS side

Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to request changes. See my previous review comment

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

I'd like to discuss this. The code doesn't look right, it's trying to hack the locking system. We don't want to add such code outside of the Locker, and even there it's smelly.

Then where should it be ?

Using the projected inode in requests from the holder of the xlock is a feature: it means that the client gets back the recent changes even if they haven't been committed yet.

Yeah, correct, but this will only readable by the client holding the xlock, but for the other clients they are still reading the old metadata, this apparently violates the POSIX semantic comparing with the local filesystems.

Going back to the ticket I'm not sure we're fixing the issue at the correct place. If chmod in POSIX is a synchronous operation, then we should've waited for the change to be committed, before we allowed another query to propagate to the MDS.

Yeah, this change is obviously doing this. This should be opaque to users and we shouldn't fail it by telling users that MDS is not ready yet and you should wait for a while to do the query.

I haven't yet found out whether fsync is required after a chmod, but in any case, I doubt that we need to fix it on the MDS side

No, this isn't a must, we shouldn't ask users to do this because the POSIX doesn't mention it. Even we do fsync here it won't resolve this issue, because the chmod and fsync are not atomic and then we couldn't guarantee that when another client is trying to query the fsync has already finished. The most important thing is that for the most rhel-8 and other old clients the fsync will only flush the client side cache to MDS, and won't guarantee it will flush the MDS' MDLog to the pool.

@leonid-s-usov Any better approach to resolve this ?

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I agree with Leonid here, we should definitely not be poking into the locking system.
Looking at the tracker ticket, @lxbsz notes that we are batching the getattr from two separate clients, one of which has the xlock (so it can see the projected inode, and the other client can't). I suspect this is the root cause of the issue — a non-xlocker should not read out-of-date inodes; the locking system should block that data access until the inode is stable.

This sounds to me like a bug with the batching implementation, which I have not investigated. :( But I suspect we need to adjust the batching system so that it doesn't batch operations from clients with different caps?

@gregsfortytwo
Copy link
Member

If this isn't something caused by batching, then it is very scary. You can see down at https://github.com/ceph/ceph/pull/56602/files#diff-277dc6e796ccecb6aa14c9357f7a86898d6ddcf4113c110a4e00e0a10f6fefa7R4189 where we require the rdlock on the authlock, and I believe that should prevent exactly the read-stable-while-there-is-a-projected-inode issue described in the ticket?

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

... none projected inode to fill replies. This just cause inconsistent file mode across multiple clients.

  • please change this to: non-projected
  • please change the last sentence to: This causes inconsistent ...

Done, Thanks @mchangir

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

Hmm I agree with Leonid here, we should definitely not be poking into the locking system. Looking at the tracker ticket, @lxbsz notes that we are batching the getattr from two separate clients, one of which has the xlock (so it can see the projected inode, and the other client can't). I suspect this is the root cause of the issue — a non-xlocker should not read out-of-date inodes; the locking system should block that data access until the inode is stable.

This sounds to me like a bug with the batching implementation, which I have not investigated. :( But I suspect we need to adjust the batching system so that it doesn't batch operations from clients with different caps?

It has already done this. Only for the requests having the same mask will they be batched. Maybe we can just avoid batching non-xlocked requests to the xlocked one.

@gregsfortytwo
Copy link
Member

@batrick there's a FIXME in handle_client_getattr stemming from #27866 — do you have any idea what's going on there?

I don't really understand how what @lxbsz described in https://tracker.ceph.com/issues/63906#note-9 is possible — you can clearly see Server::handle_client_getattr() invoking rdlock_path_pin_ref and then

if ((mask & CEPH_CAP_AUTH_SHARED) && !(issued & CEPH_CAP_AUTH_EXCL))
  lov.add_rdlock(&ref->authlock);
...
if (!mds->locker->acquire_locks(mdr, lov))
  return;

So how can we possibly be returning data while there's a projected inode outstanding?

@gregsfortytwo
Copy link
Member

Hmm I agree with Leonid here, we should definitely not be poking into the locking system. Looking at the tracker ticket, @lxbsz notes that we are batching the getattr from two separate clients, one of which has the xlock (so it can see the projected inode, and the other client can't). I suspect this is the root cause of the issue — a non-xlocker should not read out-of-date inodes; the locking system should block that data access until the inode is stable.
This sounds to me like a bug with the batching implementation, which I have not investigated. :( But I suspect we need to adjust the batching system so that it doesn't batch operations from clients with different caps?

Maybe we can just avoid batching non-xlocked requests to the xlocked one.

Oh, yes, I said "caps" but I suppose you can still have an xlock assigned to you without different issued caps. This is what I meant.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

@batrick there's a FIXME in handle_client_getattr stemming from #27866 — do you have any idea what's going on there?

I don't really understand how what @lxbsz described in https://tracker.ceph.com/issues/63906#note-9 is possible — you can clearly see Server::handle_client_getattr() invoking rdlock_path_pin_ref and then

I just made a mistake and mislead you in our 1x1, yeah it's a bug of the batch ops.

1, clientA does a sync setattr in MDS and then in MDS it holds xlock for authlock.
2, clientA sends a getattr requestA and then it will be marked as batch head.
3, clientB sends a getattr requestB too and then it will be added to the batch since it has the same mask with the getattr from clientA.

So here we need to make sure getattr requestA and getattr requestB won't be batched will be fine. Because each getattr request will try to do rdlock, which will wait the prvious xlock to be released. While the xlock is released the pi will be popped.

if ((mask & CEPH_CAP_AUTH_SHARED) && !(issued & CEPH_CAP_AUTH_EXCL))
  lov.add_rdlock(&ref->authlock);

I think this is because once the xlock is acquired by any client, then all the clients request need to acquire the rdlock here. Only in the excl mode will the corresponding client do not have to acquire the rdlock here.

...
if (!mds->locker->acquire_locks(mdr, lov))
return;


So how can we possibly be returning data while there's a projected inode outstanding?

Once the xlock is held by any client, then for all the non-xlocked clients they need to wait the xlock to be released. While for the xlocked client it could get the latest info from the pi directly and no need to wait.

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

Updated the patch and the new change will avoid batching the ops when any of the xlock is held. And then later when the xlock is released they can batch again in the next try.

@leonid-s-usov
Copy link
Contributor

I think it all starts in the early_reply. There's this code:

  // mark xlocks "done", indicating that we are exposing uncommitted changes.
  //
  //_rename_finish() does not send dentry link/unlink message to replicas.
  // so do not set xlocks on dentries "done", the xlocks prevent dentries
  // that have projected linkages from getting new replica.
  mds->locker->set_xlocks_done(mdr.get(), req->get_op() == CEPH_MDS_OP_RENAME);

The comment suggests that we should somehow expose projected changes. Now, the set_xlock_done method only resets the xlock_by field, but not the xlock_by_client field:

  void set_xlock_done() {
    ceph_assert(more()->xlock_by);
    ceph_assert(state == LOCK_XLOCK || is_locallock() ||
	   state == LOCK_LOCK /* if we are a peer */);
    if (!is_locallock())
      state = LOCK_XLOCKDONE;
    more()->xlock_by.reset();
  }

and that field is what's causing the pauth 1 in encode_inodestat for only one of the clients:

  bool pauth = authlock.is_xlocked_by_client(client) || get_loner() == client;

That check doesn't comply with the comment in early_reply, maybe that comment is misleading or I misinterpret it. Anyway, by choosing to reply early we create this difference in how we respond to the same client vs other clients, so either we shouldn't respond early, or we should change the encode_inodestat to expose projected node if the xlock is done

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

I think it all starts in the early_reply. There's this code:

  // mark xlocks "done", indicating that we are exposing uncommitted changes.
  //
  //_rename_finish() does not send dentry link/unlink message to replicas.
  // so do not set xlocks on dentries "done", the xlocks prevent dentries
  // that have projected linkages from getting new replica.
  mds->locker->set_xlocks_done(mdr.get(), req->get_op() == CEPH_MDS_OP_RENAME);

The comment suggests that we should somehow expose projected changes. Now, the set_xlock_done method only resets the xlock_by field, but not the xlock_by_client field:

  void set_xlock_done() {
    ceph_assert(more()->xlock_by);
    ceph_assert(state == LOCK_XLOCK || is_locallock() ||
	   state == LOCK_LOCK /* if we are a peer */);
    if (!is_locallock())
      state = LOCK_XLOCKDONE;
    more()->xlock_by.reset();
  }

and that field is what's causing the pauth 1 in encode_inodestat for only one of the clients:

  bool pauth = authlock.is_xlocked_by_client(client) || get_loner() == client;

That check doesn't comply with the comment in early_reply, maybe that comment is misleading or I misinterpret it. Anyway, by choosing to reply early we create this difference in how we respond to the same client vs other clients, so either we shouldn't respond early, or we should change the encode_inodestat to expose projected node if the xlock is done

I think the comments is misleading and it should be exposing uncommitted changes between mdrs, not between clients.

@leonid-s-usov
Copy link
Contributor

leonid-s-usov commented Apr 26, 2024

Because each getattr request will try to do rdlock, which will wait the prvious xlock to be released. While the xlock is released the pi will be popped.

The client that holds the xlock will be able to read (projected) immediately in the XLOCKDONE state, while the other will have to wait. But both will return the new projected value.

                      // stable     loner  rep state  r     rp   rd   wr   fwr  l    x    caps,other
    [LOCK_XLOCK]     = { LOCK_SYNC, false, LOCK_LOCK, 0,    XCL, 0,   0,   0,   0,   0,   0,0,0,0 },
    [LOCK_XLOCKDONE] = { LOCK_SYNC, false, LOCK_LOCK, XCL,  XCL, XCL, 0,   0,   XCL, 0,   0,0,CEPH_CAP_GSHARED,0 },
  bool can_read(client_t client) const {
    return get_sm()->states[state].can_read == ANY ||
      (get_sm()->states[state].can_read == AUTH && parent->is_auth()) ||
      (get_sm()->states[state].can_read == XCL && client >= 0 && get_xlock_by_client() == client);
  }

You are right, @lxbsz, preventing the batch should help. But I'm still not sure about the early_reply and what it meant by "exposing uncommitted changes". Is that just for the xclocking client to be able to rdlock?

@lxbsz
Copy link
Member Author

lxbsz commented Apr 26, 2024

Because each getattr request will try to do rdlock, which will wait the prvious xlock to be released. While the xlock is released the pi will be popped.

The client that holds the xlock will be able to read immediately, while the other will wait. But both will return the new projected value.

                      // stable     loner  rep state  r     rp   rd   wr   fwr  l    x    caps,other
    [LOCK_XLOCK]     = { LOCK_SYNC, false, LOCK_LOCK, 0,    XCL, 0,   0,   0,   0,   0,   0,0,0,0 },
    [LOCK_XLOCKDONE] = { LOCK_SYNC, false, LOCK_LOCK, XCL,  XCL, XCL, 0,   0,   XCL, 0,   0,0,CEPH_CAP_GSHARED,0 },
  bool can_read(client_t client) const {
    return get_sm()->states[state].can_read == ANY ||
      (get_sm()->states[state].can_read == AUTH && parent->is_auth()) ||
      (get_sm()->states[state].can_read == XCL && client >= 0 && get_xlock_by_client() == client);
  }

You are right, @lxbsz, preventing the batch should help. But I'm still not sure about the early_reply and what it meant by "exposing uncommitted changes". Is that just for the xclocking client to be able to rdlock?

   2   8492  mds/MDCache.cc <<GLOBAL>>
             !(dn->lock.is_xlocked() && dn->lock.get_xlock_by() == mdr)) {

The get_xlock_by() is called in the path_traverse() and trying to expose the dn between mdrs. So my understanding is it will try to expose the uncommitted changes between mdrs in the same client.

Comment on lines +4112 to +4115
if (((mask & CEPH_CAP_LINK_SHARED) && (in->linklock.is_xlocked())) ||
((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked())) ||
((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked())) ||
((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked()))) {
Copy link
Contributor

@leonid-s-usov leonid-s-usov Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce impact by making sure that the xlocker is never batched. This will allow batching multiple non-xlocker clients even if some of the xlocks are held, which should be safe as long as the xlocker isn't the batch head.

Suggested change
if (((mask & CEPH_CAP_LINK_SHARED) && (in->linklock.is_xlocked())) ||
((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked())) ||
((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked())) ||
((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked()))) {
if (((mask & CEPH_CAP_LINK_SHARED) && (in->linklock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it and thanks @leonid-s-usov @gregsfortytwo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, please double check, my patch above is probably wrong, but the idea stays: don't batch when the xlocker client is the head.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. All the xlocker client requests won't be batched and even won't be set as batch heads.

With this change all the non-locker client requests could be batched and the batch heads will acquire the rdlock and then wait.

…s client

When the previous client's setattr request is still holding the xlock
for the linklock/authlock/xattrlock/filelock locks, if the same client
send a getattr request it will use the projected inode to fill the
reply, while for other clients the getattr requests will use the
non-projected inode to fill replies. This causes inconsistent file
mode across multiple clients.

This will just skip batching the ops when any of the xlock is held.

Fixes: https://tracker.ceph.com/issues/63906
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Copy link
Contributor

@leonid-s-usov leonid-s-usov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are rapidly approaching the point where I want us to move the batching into a proper interface instead of open-coding it, but this LGTM for now. :)

@github-actions
Copy link

github-actions bot commented Jul 1, 2024

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

@github-actions github-actions bot added the stale label Jul 1, 2024
@vshankar
Copy link
Contributor

vshankar commented Jul 1, 2024

jenkins test api

@batrick
Copy link
Member

batrick commented Jul 2, 2024

@batrick there's a FIXME in handle_client_getattr stemming from #27866 — do you have any idea what's going on there?

I don't know why Zheng put those there, sorry.

Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another example where synthetic cap requests would be really handy for testing.

((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) {
r = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like Ax. Consider:

client 1: issued pAx
client 2: getattr pAsLsXsFs issued pLxXx
client 3: getattr pAsLsXsFs issued p

client 2 will skip acquiring the linklock/xattrlock because it's issued LxXx. It will block on recall of Ax from client 1.

Client 2's getattr cannot be a batch head for client 3's `getattr.

This all exposes a problem I believe exists with where we're constructing the batch head: this should be constructed only after this acquire_locks fails:

ceph/src/mds/Server.cc

Lines 4217 to 4218 in 2c16096

if (!mds->locker->acquire_locks(mdr, lov))
return;

Before that point, record whether we've skipped any locks due to issued caps or (for this particular bug) the lock is already xlocked by the client. The latter cannot be easily checked without looking at the inode's lock because the Locker state machine hides why a rdlock succeeds (in this case, the client has an xlock already).

Copy link
Member Author

@lxbsz lxbsz Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like Ax. Consider:

client 1: issued pAx client 2: getattr pAsLsXsFs issued pLxXx client 3: getattr pAsLsXsFs issued p

@batrick Correct me if I am wrong here.

From mds/lock.c I can see that only the loner client could get the x caps. Could you point me out in which case could we see the x in different locks will be issued to different clients at the same time ?

Different locks could have different loners at the same time ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like Ax. Consider:
client 1: issued pAx client 2: getattr pAsLsXsFs issued pLxXx client 3: getattr pAsLsXsFs issued p

@batrick Correct me if I am wrong here.

From mds/lock.c I can see that only the loner client could get the x caps. Could you point me out in which case could we see the x in different locks will be issued to different clients at the same time ?

You might be right; I'm not sure. In principle I don't see why it wouldn't be allowed but the state diagram suggests it is not.

Different locks could have different loners at the same time ?

It doesn't look like it but it's worth checking.

I think the batch leader construction should still be moved however. And that "detail" of the loner client shouldn't be relied on for the checks in any case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another bug I think. This won't handle the case where we're skipping e.g. a rdlock when the client is issued an exclusive cap, like Ax. Consider:
client 1: issued pAx client 2: getattr pAsLsXsFs issued pLxXx client 3: getattr pAsLsXsFs issued p

@batrick Correct me if I am wrong here.
From mds/lock.c I can see that only the loner client could get the x caps. Could you point me out in which case could we see the x in different locks will be issued to different clients at the same time ?

You might be right; I'm not sure. In principle I don't see why it wouldn't be allowed but the state diagram suggests it is not.

Different locks could have different loners at the same time ?

It doesn't look like it but it's worth checking.

I think the batch leader construction should still be moved however. And that "detail" of the loner client shouldn't be relied on for the checks in any case.

I couldn't remember I ever saw this case, just now I checked some debug logs, such as the debug logs from:

ceph-post-file: fb9a96f9-5f6d-46b4-b1fa-8580928b2241

I didn't find any case will do this. And also by going through the mds code, only when a longer is successfully set will the EXCL locker state could be set. And all the lockers in a CInode will be set to the same single longer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then perhaps the better check is: "is this client the loner for the CInode? In that case, do not make it a batch head".

Let's move the batch code below this call to acquire_locks on failure:

ceph/src/mds/Server.cc

Lines 4217 to 4218 in 2c16096

if (!mds->locker->acquire_locks(mdr, lov))
return;

If we fail to acquire the locks, then make it the batch head if one does not exist. If a batch head does exist already, then drop locks and add it to the batch queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the batch code below this call to acquire_locks on failure:

ceph/src/mds/Server.cc

Lines 4217 to 4218 in 2c16096

if (!mds->locker->acquire_locks(mdr, lov))
return;

If we fail to acquire the locks, then make it the batch head if one does not exist. If a batch head does exist already, then drop locks and add it to the batch queue.

Sure.

Checked the code carefully again, if my understanding is correct this won't work well as expected ?

For example, if the first lookup request just wants to acquire the rdlock for linklock and it succeeds, and then later the second lookup request comes and also just wants to the rdlock for the linklock. If both these two requests succeeds there won't be any chance to batch them. Actually we should batch them, right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no further waits after that acquire_locks so I don't think so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you should right here. Let me check it more.

Copy link
Member Author

@lxbsz lxbsz Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@batrick If we will move the batch code after acquire_locks, we also should adjust the acquire_locks and other callers to make sure that they won't add the current request to any waiter, which will retry the request later, and then try to batch this request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's right. Let's leave it this way then.

((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) {
r = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then perhaps the better check is: "is this client the loner for the CInode? In that case, do not make it a batch head".

Let's move the batch code below this call to acquire_locks on failure:

ceph/src/mds/Server.cc

Lines 4217 to 4218 in 2c16096

if (!mds->locker->acquire_locks(mdr, lov))
return;

If we fail to acquire the locks, then make it the batch head if one does not exist. If a batch head does exist already, then drop locks and add it to the batch queue.

@vshankar
Copy link
Contributor

vshankar commented Jul 5, 2024

This PR is under test in https://tracker.ceph.com/issues/66850.

@vshankar
Copy link
Contributor

This PR is under test in https://tracker.ceph.com/issues/67089.

@vshankar
Copy link
Contributor

I'm seeing some new failures in the branch which this PR is a part of. Trying to isolate the problematic change. Will update when done.

@vshankar
Copy link
Contributor

vshankar commented Aug 2, 2024

This PR is under test in https://tracker.ceph.com/issues/67318.

((mask & CEPH_CAP_AUTH_SHARED) && (in->authlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_XATTR_SHARED) && (in->xattrlock.is_xlocked_by_client(client))) ||
((mask & CEPH_CAP_FILE_SHARED) && (in->filelock.is_xlocked_by_client(client)))) {
r = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's right. Let's leave it this way then.

joscollin pushed a commit to joscollin/ceph that referenced this pull request Aug 7, 2024
* refs/pull/56602/head:
	mds: always make getattr wait for xlock to be released by the previous client

Reviewed-by: Leonid Usov <leonid.usov@ibm.com>
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
vshankar added a commit to vshankar/ceph that referenced this pull request Aug 20, 2024
* refs/pull/56602/head:
	mds: always make getattr wait for xlock to be released by the previous client

Reviewed-by: Greg Farnum <gfarnum@redhat.com>
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Leonid Usov <leonid.usov@ibm.com>
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cephfs Ceph File System

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants