mds: do not evict clients if OSDs are laggy#49971
Conversation
aa08733 to
3bc1b53
Compare
403e72c to
6256dcf
Compare
|
@lxbsz @vshankar @batrick, I created a helper function to get the value of EDIT: Not valid anymore, removed the code as objecter method won't be useful here #49971 (comment) |
6c78519 to
c0c3eac
Compare
f4b63be to
c4bd7ba
Compare
batrick
left a comment
There was a problem hiding this comment.
I'm fine (for now) with a simple "is any osd laggy" check but the MDS needs to be clear via a cluster health warning that this is not good. It means the MDS cannot evict any client for network partitions or fatal client error.
This needs a test which uses artificially laggy OSDs. You could simulate that by sending SIGSTOP to an OSD.
40f0b7b to
08ebc36
Compare
I tried sending SIGSTOP, it marks the OSD down, doesn't make it laggy. I had a private conversation with @badone on this and he suggested to use CBT to put some load on the cluster or try running daemons under valgrind to slow them down a bit. |
|
I think a config to turn off this behavior is also appropriate. |
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
Fixes: https://tracker.ceph.com/issues/58023 Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Fixes: https://tracker.ceph.com/issues/58023 Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
|
last push - addressed most of venky' comments, rebase + resolved an conflict |
|
@vshankar ready from my side, PTAL and do let me know if any changes are needed |
|
last push fixed some qa issues, https://github.com/ceph/ceph/compare/cb0d89414c9b84895576e8987513a7c144c327ed..acef638f6df59ef4228140c286301ca04c3f8641 run went fine - http://pulpito.front.sepia.ceph.com/dparmar-2023-05-16_12:53:45-fs:functional-wip-58023-distro-default-smithi/ (only failure is infra issue) |
vshankar
left a comment
There was a problem hiding this comment.
LGTM. Minor nit and this should be good for integration tests.
A client might get unresponsive/laggy due to laggy OSD(s). This change provides us a way to defer client eviction in such scenarios also adds helpers: - get_laggy_clients() - clear_laggy_clients() and call clear_laggy_clients() before calling related Server methods Fixes: https://tracker.ceph.com/issues/58023 Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
using new MDS health metric Fixes: https://tracker.ceph.com/issues/58023 Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
|
"Is this review incorporated?" yes |
|
jenkins test make check |
|
jenkins test api |
|
jenkins test make check |
|
jenkins test api |
|
I think this change passed fs suite test, but I need to revisit that since I went on PTO. On it today/tomorrow. |
|
jenkins test make check |
|
jenkins test api |
|
@batrick Rerequested review from you since you had proposed changes. Could you PTAL. |
|
jenkins retest this please |
Merging this in the interest of time. Please start a discussion if some changes are required. |
Fixes: https://tracker.ceph.com/issues/58023
Signed-off-by: Dhairya Parmar dparmar@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows