osd/scrub: add configuration parameters to control delay duration by ronen-fr · Pull Request #59590 · ceph/ceph

ronen-fr · 2024-09-04T08:35:12Z

to apply to a scrub target following a scrub failure

Specific configuration parameters are added to control the duration
of the delay to apply to the 'not-before' attribute of the failed scrub
target following a scrub failure. Some failure causes now have
their own delay values, while others share a common duration.

anthonyeleven · 2024-09-04T11:34:22Z

src/common/options/osd.yaml.in

+- name: osd_scrub_retry_delay
+  type: int
+  level: advanced
+  desc: Period (in seconds) before retrying a specific PG following a scrub failure


nit: This to me reads as though the option is independently applied to each PG. Suggest

desc: Period (in seconds) before retrying a PG that has failed a prior scrub.

anthonyeleven · 2024-09-04T11:38:09Z

src/common/options/osd.yaml.in

+  level: advanced
+  desc: Period (in seconds) before retrying a specific PG following a scrub failure
+  long_desc: Minimum delay after a failed attempt to scrub a PG. See the
+    'see also' for the configuration options for some specific delay reasons


nit: suggest

'see also' for delay options for specific failure reasons.

I also wonder if osd_scrub_retry_delay overrides the below, if it applies to only cases that aren't among the below, or what. In other words, I'd like to see more about the relationship between this first option and the below.

Please see my attempt in clarifying (in the 'long descr')

anthonyeleven · 2024-09-04T11:39:02Z

src/common/options/osd.yaml.in

+  desc: Period (in seconds) before retrying to scrub a PG at a specific level
+    after detecting a no-scrub or no-deep-scrub flag
+  long_desc: Minimum delay after a failed attempt to scrub a PG at a level
+    (shallow or deep) that is disabled by cluster or pool no-scrub or no-deep-scrub


I might take out the mentions of level here

@anthonyeleven: We now have two scheduled 'targets' per each PG: one for its next shallow scrub, and one for
the next deep one. I am trying to convey the fact that a specific one of these targets is to be postponed.

Wouldn't that require two options? osd_shallow_scrub_retry_after_noscrub and osd_deep_scrub_retry_after_noscrub?

The delay is one - but it is applied to the relevant target - the relevant level. The one that was scheduled to execute - and aborted because of the operator setting the flag.
I have pushed a new version, with the rest of the fixes you have suggested. And in the description of
the default conf - I've expanded a bit about the two levels.

to apply to a scrub target following a scrub failure Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

shortening the delay times following various scrub events. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

allowing the configuration of lower delay times (compared to 'pg_state', now denoting PGs that are not active or not clean) for PGs that failed to be scrubbed due to performing snap-trimming. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

allowing setting specific delay times for scrubs that were aborted due to the interval being changed. The specified delay should be lower than the default delay used for the other types of mid-scrub aborts. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

ronen-fr · 2024-09-07T06:23:15Z

'make check' failure caused by test environment issues. Retrying.

ronen-fr · 2024-09-07T16:31:44Z

jenkins test make check

ronen-fr · 2024-09-08T13:56:04Z

Merging based on multiple Teuthology tests. Both as this branch, and as wip-rf-delay-conf-w-standalo

That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Fixes (original): https://tracker.ceph.com/issues/50245 Fixes (Squid backport): https://tracker.ceph.com/issues/68403 (cherry picked from commit 0c4028a) Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

That test does no longer match the actual requirements and implementation of scrubbing. It was already deactivated in ceph#59590. Here - it is fully removed, mainly for the sake of backporting. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

github-actions bot added common core tests labels Sep 4, 2024

ronen-fr requested a review from a team September 4, 2024 08:35

ronen-fr force-pushed the wip-rf-delay-conf branch from c321b1e to ad79764 Compare September 4, 2024 09:15

ronen-fr marked this pull request as ready for review September 4, 2024 09:15

ronen-fr requested a review from a team as a code owner September 4, 2024 09:15

anthonyeleven reviewed Sep 4, 2024

View reviewed changes

ronen-fr added 5 commits September 4, 2024 07:07

osd/scrub: add configuration parameters to control length of delay

7069141

to apply to a scrub target following a scrub failure Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

test/osd/scrub: set new scrub-related config options to test values

c0a52a5

shortening the delay times following various scrub events. Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

test/osd: fix 'recovery scrub' standalone test

ec8f61f

Signed-off-by: Ronen Friedman <rfriedma@redhat.com>

ronen-fr force-pushed the wip-rf-delay-conf branch from ad79764 to d7c7aa7 Compare September 4, 2024 12:10

ronen-fr requested review from NitzanMordhai, anthonyeleven and athanatos September 4, 2024 12:13

athanatos approved these changes Sep 7, 2024

View reviewed changes

ronen-fr merged commit 8b5058c into ceph:main Sep 8, 2024

ronen-fr mentioned this pull request Oct 8, 2024

qa/standalone/scrub: remove TEST_recovery_scrub_2 #60198

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd/scrub: add configuration parameters to control delay duration#59590

osd/scrub: add configuration parameters to control delay duration#59590
ronen-fr merged 5 commits intoceph:mainfrom
ronen-fr:wip-rf-delay-conf

ronen-fr commented Sep 4, 2024 •

edited

Loading

Uh oh!

anthonyeleven Sep 4, 2024

Uh oh!

ronen-fr Sep 4, 2024

Uh oh!

anthonyeleven Sep 4, 2024

Uh oh!

ronen-fr Sep 4, 2024

Uh oh!

anthonyeleven Sep 4, 2024

Uh oh!

ronen-fr Sep 4, 2024

Uh oh!

anthonyeleven Sep 4, 2024

Uh oh!

ronen-fr Sep 4, 2024

Uh oh!

ronen-fr commented Sep 7, 2024

Uh oh!

ronen-fr commented Sep 7, 2024

Uh oh!

ronen-fr commented Sep 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ronen-fr commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronen-fr commented Sep 7, 2024

Uh oh!

ronen-fr commented Sep 7, 2024

Uh oh!

ronen-fr commented Sep 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ronen-fr commented Sep 4, 2024 •

edited

Loading