mds: flush mdlog if locked and still has wanted caps not satisfied by lxbsz · Pull Request #45243 · ceph/ceph

lxbsz · 2022-03-03T14:03:56Z

In _do_cap_update() if one client is releasing the Fw caps the
relevant client range will be erased, and then new_max will be 0.
It will skip flushing the mdlog after it submitting a journal log,
which will keep holding the wrlock for the filelock.

So when a new client is trying to open the file for reading, since
the wrlock is locked for the filelock the file_eval() is possibly
couldn't changing the lock state and at the same time if the
filelock is in stable state, such as in EXECL, MIX. The mds may
skip flushing the mdlog in the open related code too.

We need to flush the mdlog if there still has any wanted caps
couldn't be satisfied and has any lock for the filelock after the
file_eval().

Fixes: https://tracker.ceph.com/issues/54463
Signed-off-by: Xiubo Li xiubli@redhat.com

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

vshankar · 2022-03-03T14:12:20Z

@lxbsz This would need to be backported.

lxbsz · 2022-03-03T14:46:16Z

@lxbsz This would need to be backported.

Sure, I have created one tracker: https://tracker.ceph.com/issues/54463

gregsfortytwo

Hmm. This patches the specific case we saw but I'm not sure it's the right solution. An awful lot of the time when there are file size changes, the client doing the change is the only one who needs the information, so we don't need a log flush.

@kotreshhr says you came across this commit 6844bf6 as the likely source of the bug, and building off of that seems to me like a better general approach: identifying cases where cap flushes need to be committed immediately so we can grant caps to other clients. That way, we get the benefit of not adding more piecemeal required flushes to the MDS path that end up with us just flushing constantly all the time, while making sure we don't block cross-client IO.

Does that make sense? Am I missing something?

lxbsz · 2022-03-04T00:23:42Z

Hmm. This patches the specific case we saw but I'm not sure it's the right solution. An awful lot of the time when there are file size changes, the client doing the change is the only one who needs the information, so we don't need a log flush.

@kotreshhr says you came across this commit 6844bf6 as the likely source of the bug, and building off of that seems to me like a better general approach: identifying cases where cap flushes need to be committed immediately so we can grant caps to other clients. That way, we get the benefit of not adding more piecemeal required flushes to the MDS path that end up with us just flushing constantly all the time, while making sure we don't block cross-client IO.

Does that make sense? Am I missing something?

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

I think the max_size are more likely to increase, while the decrease should be less likely and IMO it should be in truncate case. Did I miss something about this ?

Here is a little different, the ifile lock is in stable state ifile excl w=1, but with the wr locked. In current code it will only try to flush the mdlog when the ifile lock is unstable and has any locked...

gregsfortytwo · 2022-03-04T03:42:16Z

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

So, we could just commit to disk on every single client op. We don't do that for all the usual batching reasons: doing individual IOs is wildly less efficient, results in less total throughput, and frequently higher average latencies than batching them up does.

So the question is, when do we not batch? And the answer is: we do immediate commits when we block something that a client is waiting on.

In this case, a client who asks to reduce the size is only waiting for the metadata commit; when you do a truncate or change the size downwards, it never blocks IO for the client doing the size change.
Whereas clients have a max_size they are not allowed to exceed under any circumstances, and that must be committed to disk before it can be changed — this is how we bound our recovery effort if the client crashes with write caps on the file. (I don't remember the exact heuristic for changing max_size, but Client does something like double the max size every time they get close enough to the current max size.) And if the client asks for a max_size increase, it's possible-to-likely that any delay in getting that permission is blocking IO at the end of the file.

I think the max_size are more likely to increase, while the decrease should be less likely and IMO it should be in truncate case. Did I miss something about this ?

Here is a little different, the ifile lock is in stable state ifile excl w=1, but with the wr locked. In current code it will only try to flush the mdlog when the ifile lock is unstable and has any locked...

I haven't looked at the specific states the locks were in for this case; did it not transition into an unstable state when it started recalling the loner caps? That's what I naively would have expected to happen.

lxbsz · 2022-03-04T04:33:54Z

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

So, we could just commit to disk on every single client op. We don't do that for all the usual batching reasons: doing individual IOs is wildly less efficient, results in less total throughput, and frequently higher average latencies than batching them up does.

Yeah, that's also what I think it should be.

So the question is, when do we not batch? And the answer is: we do immediate commits when we block something that a client is waiting on.

In this case, a client who asks to reduce the size is only waiting for the metadata commit; when you do a truncate or change the size downwards, it never blocks IO for the client doing the size change. Whereas clients have a max_size they are not allowed to exceed under any circumstances, and that must be committed to disk before it can be changed — this is how we bound our recovery effort if the client crashes with write caps on the file. (I don't remember the exact heuristic for changing max_size, but Client does something like double the max size every time they get close enough to the current max size.) And if the client asks for a max_size increase, it's possible-to-likely that any delay in getting that permission is blocking IO at the end of the file.

Okay. Sounds reasonable. So in this case if without the commit (0bca941) we need to figure out whether will there be other cases will cause the same issue except the following case below, i need to check the code more carefully,but not sure I will find them all.

I am okay to abandon the commit (0bca941) if you think that will be okay.

I think the max_size are more likely to increase, while the decrease should be less likely and IMO it should be in truncate case. Did I miss something about this ?
Here is a little different, the ifile lock is in stable state ifile excl w=1, but with the wr locked. In current code it will only try to flush the mdlog when the ifile lock is unstable and has any locked...

I haven't looked at the specific states the locks were in for this case; did it not transition into an unstable state when it started recalling the loner caps? That's what I naively would have expected to happen.

The problem here is that before new client requested to open the file, there had only one client was operating this file and it was the loner client, and the loner client would stay in EXCL stable lock state, and then do the journal log writing by holding the wrlock of ifile lock, that's why we see the ifile lock was in ifile excl w=1 state. So after that the loner client had no any new request, there had no chance to nudge the log in time.

And later when a new client came to open this file and when evaling the locks, the w=1 will block lock state changing and it skipped triggering to nudge the mdlog in Line#5429 below:

void Locker::file_eval(ScatterLock *lock, bool *need_issue) 
{
  ...
5398   // excl -> *?
5399   if (lock->get_state() == LOCK_EXCL) {
5400     dout(20) << " is excl" << dendl;
5401     int loner_issued, other_issued, xlocker_issued;
5402     in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued, CEPH_CAP_SFILE);
5403     dout(7) << "file_eval loner_issued=" << gcap_string(loner_issued)
5404             << " other_issued=" << gcap_string(other_issued)
5405             << " xlocker_issued=" << gcap_string(xlocker_issued)
5406             << dendl;
5407     if (!((loner_wanted|loner_issued) & (CEPH_CAP_GEXCL|CEPH_CAP_GWR|CEPH_CAP_GBUFFER)) ||
5408         (other_wanted & (CEPH_CAP_GEXCL|CEPH_CAP_GWR|CEPH_CAP_GRD)) ||
5409         (in->is_dir() && in->multiple_nonstale_caps())) {  // FIXME.. :/
5410       dout(20) << " should lose it" << dendl;
5411       // we should lose it.
5412       //  loner  other   want
5413       //  R      R       SYNC
5414       //  R      R|W     MIX
5415       //  R      W       MIX
5416       //  R|W    R       MIX
5417       //  R|W    R|W     MIX
5418       //  R|W    W       MIX
5419       //  W      R       MIX
5420       //  W      R|W     MIX
5421       //  W      W       MIX
5422       // -> any writer means MIX; RD doesn't matter.
5423       if (((other_wanted|loner_wanted) & CEPH_CAP_GWR) ||
5424           lock->is_waiter_for(SimpleLock::WAIT_WR)) {
5425         scatter_mix(lock, need_issue);
5426       } else if (!lock->is_wrlocked()) {  // let excl wrlocks drain first
5427         simple_sync(lock, need_issue);
5428       } else {
5429         dout(10) << " waiting for wrlock to drain" << dendl;
5430       }
5431     }
5432   }
...
}

And from the log file we can see the relevant logs:

 1776 2022-03-02T15:14:26.040874368Z debug 2022-03-02 15:14:26.038 7f54e5fcb700 20 mds.0.locker  is excl
 1777 2022-03-02T15:14:26.040880340Z debug 2022-03-02 15:14:26.038 7f54e5fcb700  7 mds.0.locker file_eval loner_issued=sc other_issued= xlocker_issued=
 1778 2022-03-02T15:14:26.040880340Z debug 2022-03-02 15:14:26.038 7f54e5fcb700 20 mds.0.locker  should lose it2022-03-02T15:14:26.040887640Z
 1779 2022-03-02T15:14:26.040887640Z debug 2022-03-02 15:14:26.038 7f54e5fcb700 10 mds.0.locker  waiting for wrlock to drain

My second commit (a591756) will fix this.

lxbsz · 2022-03-04T04:36:51Z

jenkins retest this please

ljflores · 2022-03-05T14:58:20Z

jenkins test make check

ljflores · 2022-03-05T17:52:26Z

jenkins test make check

lxbsz · 2022-03-07T08:28:33Z

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

So, we could just commit to disk on every single client op. We don't do that for all the usual batching reasons: doing individual IOs is wildly less efficient, results in less total throughput, and frequently higher average latencies than batching them up does.

Yeah, that's also what I think it should be.

So the question is, when do we not batch? And the answer is: we do immediate commits when we block something that a client is waiting on.
In this case, a client who asks to reduce the size is only waiting for the metadata commit; when you do a truncate or change the size downwards, it never blocks IO for the client doing the size change. Whereas clients have a max_size they are not allowed to exceed under any circumstances, and that must be committed to disk before it can be changed — this is how we bound our recovery effort if the client crashes with write caps on the file. (I don't remember the exact heuristic for changing max_size, but Client does something like double the max size every time they get close enough to the current max size.) And if the client asks for a max_size increase, it's possible-to-likely that any delay in getting that permission is blocking IO at the end of the file.

Okay. Sounds reasonable. So in this case if without the commit (0bca941) we need to figure out whether will there be other cases will cause the same issue except the following case below, i need to check the code more carefully,but not sure I will find them all.

I am okay to abandon the commit (0bca941) if you think that will be okay.

I have removed this commit and will flush the mdlog when needed later in commit (0bc5fa7).

gregsfortytwo

I really like this direction! 👍 Just a style guide nit and I think the third patch is no longer necessary?

gregsfortytwo · 2022-03-17T00:00:59Z

src/mds/Locker.cc

-int Locker::issue_caps(CInode *in, Capability *only_cap)
+int Locker::get_allowed_caps(CInode *in, Capability *cap,
+                             int &all_allowed, int &loner_allowed,
+                             int &xlocker_allowed)


I know the Ceph codebase is inconsistent about this for historical reasons, but I believe the style guide says references should be const, and out parameters should all be pointers. We do try and stick to that convention for new and updated code.

Sure. will update it.

gregsfortytwo · 2022-03-17T00:03:59Z

src/mds/Locker.cc

+        simple_sync(lock, need_issue);
+      } else {
+        dout(10) << " waiting for wrlock to drain" << dendl;
+        mds->mdlog->flush();


Isn't this commit made irrelevant by flushing the log on the conflicting caps in previous patches?

Yeah, I will drop this change.

lxbsz · 2022-03-17T01:20:39Z

I really like this direction! +1 Just a style guide nit and I think the third patch is no longer necessary?

Yeah, the second one have already cover it. For our issue it's not necessary. So if the second commit looks okay, I will drop this one then.

gregsfortytwo · 2022-03-17T01:59:43Z

I really like this direction! +1 Just a style guide nit and I think the third patch is no longer necessary?

Yeah, the second one have already cover it. For our issue it's not necessary. So if the second commit looks okay, I will drop this one then.

Yep sounds good. Should also update the PR description and tracker ticket to be clearer that this is about caps getting stuck, not specifically the size changing to 0.

lxbsz · 2022-03-17T02:17:20Z

I really like this direction! +1 Just a style guide nit and I think the third patch is no longer necessary?

Yeah, the second one have already cover it. For our issue it's not necessary. So if the second commit looks okay, I will drop this one then.

Yep sounds good. Should also update the PR description and tracker ticket to be clearer that this is about caps getting stuck, not specifically the size changing to 0.

Sure. Will do that. Thanks @gregsfortytwo

lock_state_any is true will ignore the lock state. Signed-off-by: Xiubo Li <xiubli@redhat.com>

In _do_cap_update() if one client is releasing the Fw caps the relevant client range will be erased, and then new_max will be 0. It will skip flushing the mdlog after it submitting a journal log, which will keep holding the wrlock for the filelock. So when a new client is trying to open the file for reading, since the wrlock is locked for the filelock the file_eval() is possibly couldn't changing the lock state and at the same time if the filelock is in stable state, such as in EXECL, MIX. The mds may skip flushing the mdlog in the open related code too. We need to flush the mdlog if there still has any wanted caps couldn't be satisfied and has any lock for the filelock after the file_eval(). Fixes: https://tracker.ceph.com/issues/54463 Signed-off-by: Xiubo Li <xiubli@redhat.com>

Signed-off-by: Xiubo Li <xiubli@redhat.com>

vshankar · 2022-04-25T12:20:58Z

https://tracker.ceph.com/projects/cephfs/wiki/Master#2022-Apr-25

Comments addressed.

github-actions bot added the cephfs Ceph File System label Mar 3, 2022

lxbsz requested review from a team, gregsfortytwo, kotreshhr and vshankar March 3, 2022 14:04

lxbsz force-pushed the mdslog_bz branch from b566c5a to 0bca941 Compare March 3, 2022 14:47

gregsfortytwo requested changes Mar 3, 2022

View reviewed changes

lxbsz force-pushed the mdslog_bz branch 2 times, most recently from 1cb9904 to 8053b52 Compare March 4, 2022 08:58

lxbsz force-pushed the mdslog_bz branch from 8053b52 to 8e89409 Compare March 7, 2022 08:25

lxbsz requested a review from gregsfortytwo March 7, 2022 08:29

gregsfortytwo previously requested changes Mar 17, 2022

View reviewed changes

lxbsz added 3 commits March 17, 2022 13:00

mds: make _need_flush_mdlog() check in both stable and unstable cases

9ceaa3a

lock_state_any is true will ignore the lock state. Signed-off-by: Xiubo Li <xiubli@redhat.com>

ceph: move mdlog flush to scatter_writebehind()

d41359a

Signed-off-by: Xiubo Li <xiubli@redhat.com>

lxbsz changed the title ~~mds: flush the mdlog when file size changes~~ @lxbsz mds: flush mdlog if locked and still has wanted caps not satisfied Mar 17, 2022

lxbsz changed the title ~~@lxbsz mds: flush mdlog if locked and still has wanted caps not satisfied~~ mds: flush mdlog if locked and still has wanted caps not satisfied Mar 17, 2022

lxbsz force-pushed the mdslog_bz branch from 8e89409 to d41359a Compare March 17, 2022 05:13

vshankar self-assigned this Mar 17, 2022

vshankar added the wip-vshankar-testing label Mar 29, 2022

vshankar approved these changes Apr 25, 2022

View reviewed changes

vshankar merged commit fbbe112 into ceph:master Apr 25, 2022

This was referenced May 30, 2022

Pacific: mds: flush mdlog if locked and still has wanted caps not satisfied #46423

Merged

Quincy: mds: flush mdlog if locked and still has wanted caps not satisfied #46494

Merged

vshankar mentioned this pull request Jun 30, 2025

mds: nudge log for unstable locks after early reply #64229

Merged

14 tasks

Conversation

lxbsz commented Mar 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

vshankar commented Mar 3, 2022

Uh oh!

lxbsz commented Mar 3, 2022

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

lxbsz commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregsfortytwo commented Mar 4, 2022

Uh oh!

lxbsz commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lxbsz commented Mar 4, 2022

Uh oh!

ljflores commented Mar 5, 2022

Uh oh!

ljflores commented Mar 5, 2022

Uh oh!

lxbsz commented Mar 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gregsfortytwo left a comment

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Mar 17, 2022

Choose a reason for hiding this comment

Uh oh!

lxbsz Mar 17, 2022

Choose a reason for hiding this comment

Uh oh!

gregsfortytwo Mar 17, 2022

Choose a reason for hiding this comment

Uh oh!

lxbsz Mar 17, 2022

Choose a reason for hiding this comment

Uh oh!

lxbsz commented Mar 17, 2022

Uh oh!

gregsfortytwo commented Mar 17, 2022

Uh oh!

lxbsz commented Mar 17, 2022

Uh oh!

vshankar commented Apr 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lxbsz commented Mar 3, 2022 •

edited

Loading

lxbsz commented Mar 4, 2022 •

edited

Loading

lxbsz commented Mar 4, 2022 •

edited

Loading

lxbsz commented Mar 7, 2022 •

edited

Loading