Skip to content

mds: flush mdlog if locked and still has wanted caps not satisfied #45243

Merged
vshankar merged 3 commits intoceph:masterfrom
lxbsz:mdslog_bz
Apr 25, 2022
Merged

mds: flush mdlog if locked and still has wanted caps not satisfied #45243
vshankar merged 3 commits intoceph:masterfrom
lxbsz:mdslog_bz

Conversation

@lxbsz
Copy link
Member

@lxbsz lxbsz commented Mar 3, 2022

In _do_cap_update() if one client is releasing the Fw caps the
relevant client range will be erased, and then new_max will be 0.
It will skip flushing the mdlog after it submitting a journal log,
which will keep holding the wrlock for the filelock.

So when a new client is trying to open the file for reading, since
the wrlock is locked for the filelock the file_eval() is possibly
couldn't changing the lock state and at the same time if the
filelock is in stable state, such as in EXECL, MIX. The mds may
skip flushing the mdlog in the open related code too.

We need to flush the mdlog if there still has any wanted caps
couldn't be satisfied and has any lock for the filelock after the
file_eval().

Fixes: https://tracker.ceph.com/issues/54463
Signed-off-by: Xiubo Li xiubli@redhat.com

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@github-actions github-actions bot added the cephfs Ceph File System label Mar 3, 2022
@lxbsz lxbsz requested review from a team, gregsfortytwo, kotreshhr and vshankar March 3, 2022 14:04
@vshankar
Copy link
Contributor

vshankar commented Mar 3, 2022

@lxbsz This would need to be backported.

@lxbsz
Copy link
Member Author

lxbsz commented Mar 3, 2022

@lxbsz This would need to be backported.

Sure, I have created one tracker: https://tracker.ceph.com/issues/54463

Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This patches the specific case we saw but I'm not sure it's the right solution. An awful lot of the time when there are file size changes, the client doing the change is the only one who needs the information, so we don't need a log flush.

@kotreshhr says you came across this commit 6844bf6 as the likely source of the bug, and building off of that seems to me like a better general approach: identifying cases where cap flushes need to be committed immediately so we can grant caps to other clients. That way, we get the benefit of not adding more piecemeal required flushes to the MDS path that end up with us just flushing constantly all the time, while making sure we don't block cross-client IO.

Does that make sense? Am I missing something?

@lxbsz
Copy link
Member Author

lxbsz commented Mar 4, 2022

Hmm. This patches the specific case we saw but I'm not sure it's the right solution. An awful lot of the time when there are file size changes, the client doing the change is the only one who needs the information, so we don't need a log flush.

@kotreshhr says you came across this commit 6844bf6 as the likely source of the bug, and building off of that seems to me like a better general approach: identifying cases where cap flushes need to be committed immediately so we can grant caps to other clients. That way, we get the benefit of not adding more piecemeal required flushes to the MDS path that end up with us just flushing constantly all the time, while making sure we don't block cross-client IO.

Does that make sense? Am I missing something?

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

I think the max_size are more likely to increase, while the decrease should be less likely and IMO it should be in truncate case. Did I miss something about this ?

Here is a little different, the ifile lock is in stable state ifile excl w=1, but with the wr locked. In current code it will only try to flush the mdlog when the ifile lock is unstable and has any locked...

@gregsfortytwo
Copy link
Member

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

So, we could just commit to disk on every single client op. We don't do that for all the usual batching reasons: doing individual IOs is wildly less efficient, results in less total throughput, and frequently higher average latencies than batching them up does.

So the question is, when do we not batch? And the answer is: we do immediate commits when we block something that a client is waiting on.

In this case, a client who asks to reduce the size is only waiting for the metadata commit; when you do a truncate or change the size downwards, it never blocks IO for the client doing the size change.
Whereas clients have a max_size they are not allowed to exceed under any circumstances, and that must be committed to disk before it can be changed — this is how we bound our recovery effort if the client crashes with write caps on the file. (I don't remember the exact heuristic for changing max_size, but Client does something like double the max size every time they get close enough to the current max size.) And if the client asks for a max_size increase, it's possible-to-likely that any delay in getting that permission is blocking IO at the end of the file.

I think the max_size are more likely to increase, while the decrease should be less likely and IMO it should be in truncate case. Did I miss something about this ?

Here is a little different, the ifile lock is in stable state ifile excl w=1, but with the wr locked. In current code it will only try to flush the mdlog when the ifile lock is unstable and has any locked...

I haven't looked at the specific states the locks were in for this case; did it not transition into an unstable state when it started recalling the loner caps? That's what I naively would have expected to happen.

@lxbsz
Copy link
Member Author

lxbsz commented Mar 4, 2022

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

So, we could just commit to disk on every single client op. We don't do that for all the usual batching reasons: doing individual IOs is wildly less efficient, results in less total throughput, and frequently higher average latencies than batching them up does.

Yeah, that's also what I think it should be.

So the question is, when do we not batch? And the answer is: we do immediate commits when we block something that a client is waiting on.

In this case, a client who asks to reduce the size is only waiting for the metadata commit; when you do a truncate or change the size downwards, it never blocks IO for the client doing the size change. Whereas clients have a max_size they are not allowed to exceed under any circumstances, and that must be committed to disk before it can be changed — this is how we bound our recovery effort if the client crashes with write caps on the file. (I don't remember the exact heuristic for changing max_size, but Client does something like double the max size every time they get close enough to the current max size.) And if the client asks for a max_size increase, it's possible-to-likely that any delay in getting that permission is blocking IO at the end of the file.

Okay. Sounds reasonable. So in this case if without the commit (0bca941) we need to figure out whether will there be other cases will cause the same issue except the following case below, i need to check the code more carefully,but not sure I will find them all.

I am okay to abandon the commit (0bca941) if you think that will be okay.

I think the max_size are more likely to increase, while the decrease should be less likely and IMO it should be in truncate case. Did I miss something about this ?
Here is a little different, the ifile lock is in stable state ifile excl w=1, but with the wr locked. In current code it will only try to flush the mdlog when the ifile lock is unstable and has any locked...

I haven't looked at the specific states the locks were in for this case; did it not transition into an unstable state when it started recalling the loner caps? That's what I naively would have expected to happen.

The problem here is that before new client requested to open the file, there had only one client was operating this file and it was the loner client, and the loner client would stay in EXCL stable lock state, and then do the journal log writing by holding the wrlock of ifile lock, that's why we see the ifile lock was in ifile excl w=1 state. So after that the loner client had no any new request, there had no chance to nudge the log in time.

And later when a new client came to open this file and when evaling the locks, the w=1 will block lock state changing and it skipped triggering to nudge the mdlog in Line#5429 below:

void Locker::file_eval(ScatterLock *lock, bool *need_issue) 
{
  ...
5398   // excl -> *?
5399   if (lock->get_state() == LOCK_EXCL) {
5400     dout(20) << " is excl" << dendl;
5401     int loner_issued, other_issued, xlocker_issued;
5402     in->get_caps_issued(&loner_issued, &other_issued, &xlocker_issued, CEPH_CAP_SFILE);
5403     dout(7) << "file_eval loner_issued=" << gcap_string(loner_issued)
5404             << " other_issued=" << gcap_string(other_issued)
5405             << " xlocker_issued=" << gcap_string(xlocker_issued)
5406             << dendl;
5407     if (!((loner_wanted|loner_issued) & (CEPH_CAP_GEXCL|CEPH_CAP_GWR|CEPH_CAP_GBUFFER)) ||
5408         (other_wanted & (CEPH_CAP_GEXCL|CEPH_CAP_GWR|CEPH_CAP_GRD)) ||
5409         (in->is_dir() && in->multiple_nonstale_caps())) {  // FIXME.. :/
5410       dout(20) << " should lose it" << dendl;
5411       // we should lose it.
5412       //  loner  other   want
5413       //  R      R       SYNC
5414       //  R      R|W     MIX
5415       //  R      W       MIX
5416       //  R|W    R       MIX
5417       //  R|W    R|W     MIX
5418       //  R|W    W       MIX
5419       //  W      R       MIX
5420       //  W      R|W     MIX
5421       //  W      W       MIX
5422       // -> any writer means MIX; RD doesn't matter.
5423       if (((other_wanted|loner_wanted) & CEPH_CAP_GWR) ||
5424           lock->is_waiter_for(SimpleLock::WAIT_WR)) {
5425         scatter_mix(lock, need_issue);
5426       } else if (!lock->is_wrlocked()) {  // let excl wrlocks drain first
5427         simple_sync(lock, need_issue);
5428       } else {
5429         dout(10) << " waiting for wrlock to drain" << dendl;
5430       }
5431     }
5432   }
...
}

And from the log file we can see the relevant logs:

 1776 2022-03-02T15:14:26.040874368Z debug 2022-03-02 15:14:26.038 7f54e5fcb700 20 mds.0.locker  is excl
 1777 2022-03-02T15:14:26.040880340Z debug 2022-03-02 15:14:26.038 7f54e5fcb700  7 mds.0.locker file_eval loner_issued=sc other_issued= xlocker_issued=
 1778 2022-03-02T15:14:26.040880340Z debug 2022-03-02 15:14:26.038 7f54e5fcb700 20 mds.0.locker  should lose it2022-03-02T15:14:26.040887640Z
 1779 2022-03-02T15:14:26.040887640Z debug 2022-03-02 15:14:26.038 7f54e5fcb700 10 mds.0.locker  waiting for wrlock to drain 

My second commit (a591756) will fix this.

@lxbsz
Copy link
Member Author

lxbsz commented Mar 4, 2022

jenkins retest this please

@lxbsz lxbsz force-pushed the mdslog_bz branch 2 times, most recently from 1cb9904 to 8053b52 Compare March 4, 2022 08:58
@ljflores
Copy link
Member

ljflores commented Mar 5, 2022

jenkins test make check

1 similar comment
@ljflores
Copy link
Member

ljflores commented Mar 5, 2022

jenkins test make check

@lxbsz
Copy link
Member Author

lxbsz commented Mar 7, 2022

Yeah, it makes sense. I was also thinking the same thing and I am still reading the Locker related code. I will update the PR once I find that. But IMO this patch here still make sense, because currently only when the size increases will it flush the mdlog, why not when decrease ?

So, we could just commit to disk on every single client op. We don't do that for all the usual batching reasons: doing individual IOs is wildly less efficient, results in less total throughput, and frequently higher average latencies than batching them up does.

Yeah, that's also what I think it should be.

So the question is, when do we not batch? And the answer is: we do immediate commits when we block something that a client is waiting on.
In this case, a client who asks to reduce the size is only waiting for the metadata commit; when you do a truncate or change the size downwards, it never blocks IO for the client doing the size change. Whereas clients have a max_size they are not allowed to exceed under any circumstances, and that must be committed to disk before it can be changed — this is how we bound our recovery effort if the client crashes with write caps on the file. (I don't remember the exact heuristic for changing max_size, but Client does something like double the max size every time they get close enough to the current max size.) And if the client asks for a max_size increase, it's possible-to-likely that any delay in getting that permission is blocking IO at the end of the file.

Okay. Sounds reasonable. So in this case if without the commit (0bca941) we need to figure out whether will there be other cases will cause the same issue except the following case below, i need to check the code more carefully,but not sure I will find them all.

I am okay to abandon the commit (0bca941) if you think that will be okay.

I have removed this commit and will flush the mdlog when needed later in commit (0bc5fa7).

@lxbsz lxbsz requested a review from gregsfortytwo March 7, 2022 08:29
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this direction! 👍 Just a style guide nit and I think the third patch is no longer necessary?

int Locker::issue_caps(CInode *in, Capability *only_cap)
int Locker::get_allowed_caps(CInode *in, Capability *cap,
int &all_allowed, int &loner_allowed,
int &xlocker_allowed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the Ceph codebase is inconsistent about this for historical reasons, but I believe the style guide says references should be const, and out parameters should all be pointers. We do try and stick to that convention for new and updated code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. will update it.

simple_sync(lock, need_issue);
} else {
dout(10) << " waiting for wrlock to drain" << dendl;
mds->mdlog->flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this commit made irrelevant by flushing the log on the conflicting caps in previous patches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will drop this change.

@lxbsz
Copy link
Member Author

lxbsz commented Mar 17, 2022

I really like this direction! +1 Just a style guide nit and I think the third patch is no longer necessary?

Yeah, the second one have already cover it. For our issue it's not necessary. So if the second commit looks okay, I will drop this one then.

@gregsfortytwo
Copy link
Member

I really like this direction! +1 Just a style guide nit and I think the third patch is no longer necessary?

Yeah, the second one have already cover it. For our issue it's not necessary. So if the second commit looks okay, I will drop this one then.

Yep sounds good. Should also update the PR description and tracker ticket to be clearer that this is about caps getting stuck, not specifically the size changing to 0.

@lxbsz
Copy link
Member Author

lxbsz commented Mar 17, 2022

I really like this direction! +1 Just a style guide nit and I think the third patch is no longer necessary?

Yeah, the second one have already cover it. For our issue it's not necessary. So if the second commit looks okay, I will drop this one then.

Yep sounds good. Should also update the PR description and tracker ticket to be clearer that this is about caps getting stuck, not specifically the size changing to 0.

Sure. Will do that. Thanks @gregsfortytwo

lxbsz added 3 commits March 17, 2022 13:00
lock_state_any is true will ignore the lock state.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
In _do_cap_update() if one client is releasing the Fw caps the
relevant client range will be erased, and then new_max will be 0.
It will skip flushing the mdlog after it submitting a journal log,
which will keep holding the wrlock for the filelock.

So when a new client is trying to open the file for reading, since
the wrlock is locked for the filelock the file_eval() is possibly
couldn't changing the lock state and at the same time if the
filelock is in stable state, such as in EXECL, MIX. The mds may
skip flushing the mdlog in the open related code too.

We need to flush the mdlog if there still has any wanted caps
couldn't be satisfied and has any lock for the filelock after the
file_eval().

Fixes: https://tracker.ceph.com/issues/54463
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
@lxbsz lxbsz changed the title mds: flush the mdlog when file size changes @lxbsz mds: flush mdlog if locked and still has wanted caps not satisfied Mar 17, 2022
@lxbsz lxbsz changed the title @lxbsz mds: flush mdlog if locked and still has wanted caps not satisfied mds: flush mdlog if locked and still has wanted caps not satisfied Mar 17, 2022
@vshankar
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants