Skip to content

osd: Cancel in-progress scrubs (not user requested)#35909

Merged
neha-ojha merged 5 commits intoceph:masterfrom
dzafman:wip-46275
Jul 24, 2020
Merged

osd: Cancel in-progress scrubs (not user requested)#35909
neha-ojha merged 5 commits intoceph:masterfrom
dzafman:wip-46275

Conversation

@dzafman
Copy link
Contributor

@dzafman dzafman commented Jul 3, 2020

Currently, after an aborted scrub the next scrub is stuck. The PG::abort_scrub() isn't sufficient even though it is based on the existing code that aborts scrub when pg changes out from under scrub.

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@dzafman dzafman force-pushed the wip-46275 branch 2 times, most recently from 0522dc0 to 39fa981 Compare July 3, 2020 03:33
@dzafman
Copy link
Contributor Author

dzafman commented Jul 3, 2020

@neha-ojha I need to figure out how to clean up properly when aborting a scrub.

@dzafman dzafman requested a review from neha-ojha July 3, 2020 03:36
@dzafman dzafman marked this pull request as ready for review July 6, 2020 20:37
@dzafman
Copy link
Contributor Author

dzafman commented Jul 6, 2020

jenkins test make check

@dzafman
Copy link
Contributor Author

dzafman commented Jul 6, 2020

@neha-ojha I manually tested the following:

After an aborted scrub, with noscrub still set do a user requested scrub, observe it start and finish
After an aborted scrub, unset noscrub, observe it start and finish (scrub was due because "ceph tell 2.0 scrub")

@dzafman
Copy link
Contributor Author

dzafman commented Jul 6, 2020

jenkins test dashboard backend

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, needs a test in the lines of #35909 (comment)

@tchaikov
Copy link
Contributor

tchaikov commented Jul 8, 2020

i am seeing some "Exiting scrub checking -- not all pgs scrubbed." failures in https://pulpito.ceph.com/kchai-2020-07-07_06:01:29-rados-wip-kefu-testing-2020-07-07-1058-distro-basic-smithi/ . will drop this change from my batch and rerun.

https://pulpito.ceph.com/kchai-2020-07-08_03:24:15-rados-wip-kefu-testing-2020-07-07-1058-distro-basic-smithi/ still has lots of failure. so it's not related.

@dzafman dzafman added DNM and removed needs-test labels Jul 9, 2020
@dzafman dzafman marked this pull request as draft July 9, 2020 03:23
@dzafman dzafman removed the DNM label Jul 9, 2020
@dzafman
Copy link
Contributor Author

dzafman commented Jul 9, 2020

@neha-ojha The now included functional tests pass, but other scrub test are failing with this change. Need to investigate why.

@dzafman dzafman marked this pull request as ready for review July 18, 2020 04:14
@dzafman
Copy link
Contributor Author

dzafman commented Jul 19, 2020

TESTING PASSED

https://pulpito.ceph.com/dzafman-2020-07-18_09:43:13-rados-wip-zafman-testing-distro-basic-smithi

@neha-ojha other unrelated errors on fairly recent master:
5237793 mon thrashing error
5237859 Promise failed cephadm related
5237910 ceph_test_rados crash src/test/osd/RadosModel.h: 1397: FAILED ceph_assert(version == old_value.version)
5238009 status 1: ''sudo yum -y install ceph-mgr-diskprediction-local'''
5238057 5238090 ceph_mons timeout in qa/tasks/cephadm.py
Multiple "Test failure: test_get_export (tasks.mgr.dashboard.test_ganesha.GaneshaTest)" probably fixed by #36091

5238049 Failure in qa/tasks/repair_test.py caused by this change because noscrub and nodeep-scrub set when repair request initiates another scrub to verfy the repair. Testing fix 47792d2

PASSED single jobs
https://pulpito.ceph.com/dzafman-2020-07-19_08:59:11-rados-wip-zafman-testing-distro-basic-smithi/
IN PROGRESS running the same test 10 times
https://pulpito.ceph.com/dzafman-2020-07-19_10:06:33-rados-wip-zafman-testing-distro-basic-smithi/

@dzafman
Copy link
Contributor Author

dzafman commented Jul 19, 2020

@neha-ojha One issue remains. I'm concerned with the auto repair feature which can causes scrub -> deep-scrub -> repair -> deep-scrub. If the original scrub is user requested then all subsequent scrubs should not be blocked by noscrub, nodeep-scrub.

@dzafman
Copy link
Contributor Author

dzafman commented Jul 19, 2020

@dzafman dzafman marked this pull request as draft July 20, 2020 18:40
@dzafman dzafman requested a review from jdurgin July 20, 2020 18:40
@dzafman dzafman marked this pull request as ready for review July 20, 2020 18:40
@dzafman
Copy link
Contributor Author

dzafman commented Jul 21, 2020

@neha-ojha I forgot that auto repair only applies to scheduled scrubs. So any auto repair can be interrupted by noscrub or nodeep-scrub. There is no change needed to handle existing auto repair functionality.

src/osd/PG.cc Outdated
{
// Since repair is only by request and we need to scrub afterward
// treat the same as req_scrub.
if (!scrubber.req_scrub && !scrubber.check_repair) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the check_repair check based on #35909 (comment)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the check_repair check based on #35909 (comment)?

It was discovered by "tasks/repair_test" teuthology test. Since a direct repair does a check_repair (do a deep scrub) it should NOT be affected by noscrub/nodeep-scrub. Of course, now that I think about it, maybe if nodeep-scrub is set after an auto repair already happened and it set check_repair that deep-scrub would NOT be aborted. It could be argued that if you have come that far and completed an auto repair, you should go ahead and finish all the clean-up. On the other hand the user may want all schedule scrubbing related activity to stop. immediately.

Copy link
Member

@neha-ojha neha-ojha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commits look good to me. Let's squash and add release notes.

@dzafman
Copy link
Contributor Author

dzafman commented Jul 24, 2020

The latest commits look good to me. Let's squash and add release notes.

Release note added to PendingReleaseNotes in commit to be squashed.

@dzafman
Copy link
Contributor Author

dzafman commented Jul 24, 2020

Another rados run passed:
https://pulpito.ceph.com/dzafman-2020-07-23_16:48:07-rados-wip-zafman-testing-distro-basic-smithi/

Same unrelated failures as before

dzafman added 5 commits July 24, 2020 11:40
Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
This change adds new scrubber.req_scrub to track user
requested scrubs, deep_scrub or repair.

Fixes: https://tracker.ceph.com/issues/46275

Signed-off-by: David Zafman <dzafman@redhat.com>
Signed-off-by: David Zafman <dzafman@redhat.com>
@dzafman
Copy link
Contributor Author

dzafman commented Jul 24, 2020

jenkins test dashboard backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants