Skip to content

mds: do not evict clients if OSDs are laggy#49971

Merged
vshankar merged 7 commits intoceph:mainfrom
dparmar18:wip-58023
Jun 28, 2023
Merged

mds: do not evict clients if OSDs are laggy#49971
vshankar merged 7 commits intoceph:mainfrom
dparmar18:wip-58023

Conversation

@dparmar18
Copy link
Contributor

@dparmar18 dparmar18 commented Feb 2, 2023

Fixes: https://tracker.ceph.com/issues/58023
Signed-off-by: Dhairya Parmar dparmar@redhat.com

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

@github-actions github-actions bot added the cephfs Ceph File System label Feb 2, 2023
@dparmar18 dparmar18 force-pushed the wip-58023 branch 2 times, most recently from aa08733 to 3bc1b53 Compare February 2, 2023 13:11
@dparmar18 dparmar18 force-pushed the wip-58023 branch 4 times, most recently from 403e72c to 6256dcf Compare February 6, 2023 12:09
@dparmar18 dparmar18 marked this pull request as ready for review February 6, 2023 12:10
@dparmar18 dparmar18 requested review from a team and batrick February 6, 2023 12:10
@dparmar18
Copy link
Contributor Author

dparmar18 commented Feb 6, 2023

@lxbsz @vshankar @batrick, I created a helper function to get the value of l_osdc_osd_laggy which is part of enum in Objecter.cc but had to define it in Objecter.cc instead of single line inline function in Objecter.h, I didn't want to expose it in .cc file since its just a getter function and would never undergo any change but I would need to do something like return logger->get(123262); //value of l_osdc_osd_laggy since I cannot access l_osdc_osd_laggy in .h and it also makes no sense to move the enum into .h since it would be compiled unnecessarily at the places Objecter is invoked. Do you feel this is okay or have any better suggestions?

EDIT: Not valid anymore, removed the code as objecter method won't be useful here #49971 (comment)

@dparmar18 dparmar18 force-pushed the wip-58023 branch 3 times, most recently from 6c78519 to c0c3eac Compare February 7, 2023 13:31
@dparmar18 dparmar18 marked this pull request as draft February 15, 2023 07:43
@github-actions github-actions bot added the core label Feb 16, 2023
@dparmar18 dparmar18 force-pushed the wip-58023 branch 2 times, most recently from f4b63be to c4bd7ba Compare February 17, 2023 10:48
Copy link
Member

@batrick batrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine (for now) with a simple "is any osd laggy" check but the MDS needs to be clear via a cluster health warning that this is not good. It means the MDS cannot evict any client for network partitions or fatal client error.

This needs a test which uses artificially laggy OSDs. You could simulate that by sending SIGSTOP to an OSD.

@dparmar18 dparmar18 force-pushed the wip-58023 branch 3 times, most recently from 40f0b7b to 08ebc36 Compare March 1, 2023 12:15
@dparmar18
Copy link
Contributor Author

I'm fine (for now) with a simple "is any osd laggy" check but the MDS needs to be clear via a cluster health warning that this is not good. It means the MDS cannot evict any client for network partitions or fatal client error.

This needs a test which uses artificially laggy OSDs. You could simulate that by sending SIGSTOP to an OSD.

I tried sending SIGSTOP, it marks the OSD down, doesn't make it laggy. I had a private conversation with @badone on this and he suggested to use CBT to put some load on the cluster or try running daemons under valgrind to slow them down a bit.

@batrick
Copy link
Member

batrick commented Mar 1, 2023

I think a config to turn off this behavior is also appropriate.

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

dparmar18 added 2 commits May 15, 2023 15:59
Fixes: https://tracker.ceph.com/issues/58023
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
@dparmar18
Copy link
Contributor Author

last push - addressed most of venky' comments, rebase + resolved an conflict

@dparmar18
Copy link
Contributor Author

@vshankar ready from my side, PTAL and do let me know if any changes are needed

@dparmar18
Copy link
Contributor Author

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor nit and this should be good for integration tests.

dparmar18 added 5 commits May 17, 2023 14:38
A client might get unresponsive/laggy due to laggy OSD(s).
This change provides us a way to defer client eviction in
such scenarios

also adds helpers:
- get_laggy_clients()
- clear_laggy_clients()

and call clear_laggy_clients() before calling related
Server methods

Fixes: https://tracker.ceph.com/issues/58023
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
using new MDS health metric

Fixes: https://tracker.ceph.com/issues/58023
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
@dparmar18
Copy link
Contributor Author

@dparmar18
Copy link
Contributor Author

dparmar18 commented May 17, 2023

"Is this review incorporated?" yes

@vshankar
Copy link
Contributor

jenkins test make check

@vshankar
Copy link
Contributor

jenkins test api

@dparmar18
Copy link
Contributor Author

jenkins test make check

@dparmar18
Copy link
Contributor Author

jenkins test api

@vshankar
Copy link
Contributor

I think this change passed fs suite test, but I need to revisit that since I went on PTO.

On it today/tomorrow.

@dparmar18
Copy link
Contributor Author

jenkins test make check

@dparmar18
Copy link
Contributor Author

jenkins test api

Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshankar
Copy link
Contributor

@batrick Rerequested review from you since you had proposed changes. Could you PTAL.

@vshankar
Copy link
Contributor

jenkins retest this please

@vshankar
Copy link
Contributor

@batrick Rerequested review from you since you had proposed changes. Could you PTAL.

Merging this in the interest of time. Please start a discussion if some changes are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants